process-mining
2023.4
false
UiPath logo, featuring letters U and I in white

Process Mining

Automation CloudAutomation Cloud Public SectorAutomation SuiteStandalone
Last updated Dec 18, 2024

Editing transformations

Folder structure

The transformations of a process app consist of a dbt project. Below is a description of the contents of a dbt project folder.

Folder/file

Contains

dbt_packages\

the pm_utils package and its macros.

logs\

logs created when running dbt.

macros\

custom macros.

models\

.sql files that define the transformations.

models\schema\

.yml files that define tests on the data.

seed

.csv files with configuration settings.

dbt_project.yml

the settings of the dbtproject.

See the illustration below.



Data transformations

The data transformations are defined in .sql files in the models\ directory. The data transformations are organized in a standard set of sub directories:
  • 1_input,
  • 2_entities,
  • 3_events,
  • 4_event_logs,
  • 5_business_logic.
The .sql files are written in Jinja SQL, which allows you to insert Jinja statements inside plain SQL queries. When dbt runs all .sql files, each .sql file results in a new view or table in the database.
Typically, the .sql files have the following structure:
  1. With statements: One or more with statements to include the required sub tables.

    • {{ ref(‘My_table) }} refers to table defined by another .sql file.
    • {{ source(var("schema_sources"), 'My_table') }} refers to an input table.
  2. Main query: The query that defines the new table.
  3. Final query: Typically a query like Select * from table is used at the end. This makes it easy to make sub-selections while debugging.
    docs image

For more tips on how to write transformations effectively, see Tips for writing SQL

Adding source tables

To add a new source table to the dbt project, it must be listed in models\schema\sources.yml. This way, other models can refer to it by using {{ source(var("schema_sources"), 'My_table_raw') }}. See the illustration below for an example.


Important: Each new source table must be listed in sources.yml.
Note:

The suffix _raw is added to source tables table names when loading data. For example, a table called my_table should be referred to as my_table_raw.

For more detailed information, see the official dbt documentation on Sources.

Data output

The data transformations must output the data model that is required by the corresponding app; each expected table and field must be present.

Practically, this means that the tables in the models\5_business_logic should not be deleted. Also, the output fields in the corresponding queries should not be removed.

If you want to add new fields to your process app, you can use the custom fields that are available for the process app. Map the fields in the transformations to the custom fields to have them available in the output. Make sure the custom fields are named in the output as described in the data model of the process app.

Tip:
You can use the dbt docs commands to generate a documentation site for your dbt project and open it in your default browser. The documentation site also contains a Lineage Graph that provides an entity relationship diagram with an graphical representation of the linkage between each data table in your project.
For detailed information, see the official dbt documentation on dbt docs.

Macros

Macros make it easy to reuse common SQL constructions. For detailed information, see the official dbt documentation on Jinja macros.

pm_utils

The pm-utils package contains a set of macros that are typically used in Process Mining transformations. For more info about the pm_utils macros, see ProcessMining-pm-utils.
Below is an example of Jinja code calling the pm_utils.optional() macro.


Seeds

Seeds are csv files that are used to add data tables to your transformations. For detailed information, see the official dbt documentation on jinja seeds.

In Process Mining, this is typically used to make it easy to configure mappings in your transformations.

After editing seed files, these files are not automatically updated in the database immediately. To instruct dbt to load the new seed file contents into the database, run either

  • dbt seed - which will only update the seed file tables, or
  • dbt build - which will also run all models and tests.
    Note: If the seed file had no data records initially, the data types in the database might not have been set correctly. To fix this, call run dbt seed --full-refresh. This will also update the set of columns in the database.

Activity configuration

The activity_configuration.csv file is used to set additional fields related to activities. activity_order is used as a tie breaker when two events are happening on the same timestamp. See the illustration below for an example.


Tests

The models\schema\ folder contains a set of .yml files that define tests. These validate the structure and contents of the expected data. For detailed information, see the official dbt documentation on tests.
When the transformations are run in Process Mining, only the tests in sources.yml are run on each data ingestion. This is done to check if the input data is properly formatted.
Note: When you edit transformations, make sure to update the tests accordingly. The tests can be removed if desired.

Custom throughput time metrics

Introduction

With customizing data transformations and dashboard editing you can create and use custom throughput time metrics. Throughput times are the timings between two activities A and B. Below is a description of the steps you need to perform to create a custom throughput time metric when editing transformations and how to enable the througput time metric in the process app dashboards.

Creating a custom throughput time metric with editing transformations

You must first caclulate the throughput time and then make it available as a Case field.

Calculating the throughput time

Per case, you can compute throughput times between Activity A and Activity B. Since activities can occur more than once per case, you need to take into account whether you take the first or last occurrence of an activity.

  1. Create an additional model based on the Event log to calculate the desired throughput times. For example, Cases_with_throughput_times.

  2. In this model, create pre-processing tables in which you define which event ends you want to use for the computations. Per table, you need the Case ID, and the Event end of an activity. Below is an example of how to select the last occurrence of activity A for a case.

    Event_end_activity_A as (
        select
            Event_log."Case_ID",
            max(Event_log."Event_end") as "Event_end_activity_A"
        from Event_log
        where Event_log."Activity" = 'Activity A'
        group by Event_log."Case_ID")Event_end_activity_A as (
        select
            Event_log."Case_ID",
            max(Event_log."Event_end") as "Event_end_activity_A"
        from Event_log
        where Event_log."Activity" = 'Activity A'
        group by Event_log."Case_ID")
    Note:

    In this example, if you want to select the first occurrence of the activity, replace max with min.

  3. Define the throughput time table by joining the pre-processing tables to the Event log and calculating the actual throughput time.

    Tip:

    You can use the datediff function provided in the pm-utils package to compute the time difference between any two event ends.

    The throughput time should be computed in milliseconds for the instances where Activity A precedes Activity B. Milliseconds are the unit of time used to define durations in the app template. As the throughput times are already grouped per case in the pre-processing tables, you can pick any record. In the above example he aggregation min is used. The throughput time table, selecting the throughput time and a Case ID, can be defined as displayed below.

    Cases_with_throughput_times as (
        select
            Event_log."Case_ID",
            case
                when min(Event_end_activity_A."Event_end_activity_A") <= min(Event_end_activity_B."Event_end_activity_B")
                    then {{ pm_utils.datediff('millisecond',
                    'min(Event_end_activity_A."Event_end_activity_A")',
                    'min(Event_end_activity_B."Event_end_activity_B")') }}
            end as "Throughput_time_activity_A_to_activity_B"
        from Event_log
        left join Event_end_activity_A
            on Event_log."Case_ID" = Event_end_activity_A."Case_ID"
        left join Event_end_activity_B
            on Event_log."Case_ID" = Event_end_activity_B."Case_ID"
        group by Event_log."Case_ID)"Cases_with_throughput_times as (
        select
            Event_log."Case_ID",
            case
                when min(Event_end_activity_A."Event_end_activity_A") <= min(Event_end_activity_B."Event_end_activity_B")
                    then {{ pm_utils.datediff('millisecond',
                    'min(Event_end_activity_A."Event_end_activity_A")',
                    'min(Event_end_activity_B."Event_end_activity_B")') }}
            end as "Throughput_time_activity_A_to_activity_B"
        from Event_log
        left join Event_end_activity_A
            on Event_log."Case_ID" = Event_end_activity_A."Case_ID"
        left join Event_end_activity_B
            on Event_log."Case_ID" = Event_end_activity_B."Case_ID"
        group by Event_log."Case_ID)"

Calculating throughput time in days excluding weekends
You can use the date_from_timestamp function provided in the pm-utils package to compute the number of days between two activities. Additionally, the diff_weekdays function enables you to filter out weekend days.

See the code example below for how to calculate the number of weekdays between Activity A and Activity B.

with Event_log as (
    select * from {{ ref('Event_log') }}
),

Activity_A as (
    select
        Event_log."Case_ID",
        min({{ pm_utils.date_from_timestamp('Event_log."Event_end"') }}) as "Date_activity_A"
    from Event_log
    where Event_log."Activity" = 'Receive invoice'
    group by Event_log."Case_ID"
),

Activity_B as (
    select
        Event_log."Case_ID",
        min({{ pm_utils.date_from_timestamp('Event_log."Event_end"') }}) as "Date_activity_B"
    from Event_log
    where Event_log."Activity" = 'Pay invoice'
    group by Event_log."Case_ID"
),

Total_days_minus_weekends as (
    select
        Activity_A."Case_ID",
        Activity_A."Date_activity_A",
        Activity_B."Date_activity_B",
        {{ pm_utils.diff_weekdays('Activity_A."Date_activity_A"', 'Activity_B."Date_activity_B"') }}
    -- Only compute for cases where both dates are known.
    from Activity_A
    inner join Activity_B
        on Activity_A."Case_ID" = Activity_B."Case_ID"
)

select * from Total_days_minus_weekendswith Event_log as (
    select * from {{ ref('Event_log') }}
),

Activity_A as (
    select
        Event_log."Case_ID",
        min({{ pm_utils.date_from_timestamp('Event_log."Event_end"') }}) as "Date_activity_A"
    from Event_log
    where Event_log."Activity" = 'Receive invoice'
    group by Event_log."Case_ID"
),

Activity_B as (
    select
        Event_log."Case_ID",
        min({{ pm_utils.date_from_timestamp('Event_log."Event_end"') }}) as "Date_activity_B"
    from Event_log
    where Event_log."Activity" = 'Pay invoice'
    group by Event_log."Case_ID"
),

Total_days_minus_weekends as (
    select
        Activity_A."Case_ID",
        Activity_A."Date_activity_A",
        Activity_B."Date_activity_B",
        {{ pm_utils.diff_weekdays('Activity_A."Date_activity_A"', 'Activity_B."Date_activity_B"') }}
    -- Only compute for cases where both dates are known.
    from Activity_A
    inner join Activity_B
        on Activity_A."Case_ID" = Activity_B."Case_ID"
)

select * from Total_days_minus_weekends
Calculating throughput time in days excluding holidays

Follow the steps below to calculate the throughput time in days between Activity A and Activity B excluding weekend days and holidays.

1. Create a Holidays.csv file to define the days that should be counted as holidays. The file should at least contain a record for each holiday. Use the following format:
Holiday

Date

Weekday

New year's day

2024-01-01

Yes

Easter

2024-03-31

No

..

..

..

The records in the Holidays.csv file are used to count the number of days that need to be excluded exclude from a date range.
2. Load the Holidays.csv file as a seed file in the dbt project. For detailed information, see the official dbt documentation on jinja seeds.
3. Calculate the throughput time in days excluding weekends using the date_from_timestamp and the diff_weekdays functions provided in the pm-utils package as described above in Calculating throughput time in days excluding weekends.
4. Calculate the number of records that are stored in the holidays .csv file that fall within the given date range for each case. See the example code below.
Holidays_count as (
    select
        Total_days_minus_weekends."Case_ID",
        count(Holidays."Date") as "Number_of_holidays"
    from Total_days_minus_weekends
    left join Holidays
        on Holidays."Date" between Total_days_minus_weekends."Date_activity_A" and Total_days_minus_weekends."Date_activity_B"
    where Holidays."Weekday" = 'Yes'
    group by Total_days_minus_weekends."Case_ID"
)Holidays_count as (
    select
        Total_days_minus_weekends."Case_ID",
        count(Holidays."Date") as "Number_of_holidays"
    from Total_days_minus_weekends
    left join Holidays
        on Holidays."Date" between Total_days_minus_weekends."Date_activity_A" and Total_days_minus_weekends."Date_activity_B"
    where Holidays."Weekday" = 'Yes'
    group by Total_days_minus_weekends."Case_ID"
)
Note:
In the above example, the filter Weekday = 'Yes' is used to not subtract holidays when the holiday is on a Saturday or a Sunday. This is already taken care of in the diff_weekday function.

5. Subtract the computed number of holidays from the total number of days computed for each case. See the example code below.

Total_days_minus_weekends_and_holidays as (
    select
        Total_days_minus_weekends."Case_ID",
        Total_days_minus_weekends."Number_of_days" - Holidays_count."Number_of_holidays" as "Number_of_days_between_dates"
    from Total_days_minus_weekends
    inner join Holidays_count
        on Total_days_minus_weekends."Case_ID" = Holidays_count."Case_ID"
)Total_days_minus_weekends_and_holidays as (
    select
        Total_days_minus_weekends."Case_ID",
        Total_days_minus_weekends."Number_of_days" - Holidays_count."Number_of_holidays" as "Number_of_days_between_dates"
    from Total_days_minus_weekends
    inner join Holidays_count
        on Total_days_minus_weekends."Case_ID" = Holidays_count."Case_ID"
)

Making the throughput time available as case field

Once the throughput time table is created this table needs to be joined to the Cases table to add the additional throughput time data as case information. To have the new throughput time field available in the dashboards, it is necessary to cast the new throughput time field to one of the custom case duration fields.

Replace one of the custom case duration lines in the Cases table that look like this:

{{ pm_utils.optional(ref('Cases_base'), '"custom_case_duration_1"', 'integer') }} as "custom_case_duration_1",{{ pm_utils.optional(ref('Cases_base'), '"custom_case_duration_1"', 'integer') }} as "custom_case_duration_1",

with the newly created throughput time:

Cases_with_throughput_times."Throughput_time_activity_A_to_activity_B" as "custom_case_duration_1",Cases_with_throughput_times."Throughput_time_activity_A_to_activity_B" as "custom_case_duration_1",

The updates to the transformations for the custom throughput time metric are now done and can be imported into the app template.

Enabling the throughput time metric in the process app dashboards

When you have created a custom throughput time in your transformations it is available in your app template as a case property under its alias. You can customize your process app to create a throughput time metric based on the custom throughput time you created in the transformations.

Note:

By default, a new custom duration field is added as a field of type numeric. Make sure to edit the field and change the Type of the new field to duration. See also Data manager.

  • Go to Data manager and create a new metric.
  • Select the custom duration field to be used for the throughput time, and select Average or any other desired aggregation. You can also rename the custom case duration field to your desired name in the Data Manager.
  • Edit the application and put the new metric on the charts where you want to make it available for business users.
  • Publish the dashboards to make the throughput time metric available on the dashboards.
Note:

In Purchase-to-Pay and Order-to-Cash app templates a throughput time calculation is already available in the Purchase_order_items_with_throughput_times and Sales_order_items_with_throughput_times, respectively. Custom throughput times can be added there and then be made available as a custom duration on the Purchase_order_items or Sales_order_items.

SQL differences between Snowflake and SQL Server

SQL Server vs. Snowflake

In a local development environment, transformations are run on SQL Server, while Snowflake is used in Process Mining Automation Suite. Although most SQL statements will work both on SQL Server and Snowflake, there can be slight differences in syntax, which may lead to different return results.

To write SQL statements that work on both database systems:

  • Write field names in double quotes, e.g. Table."Field".
  • Prevent using SQL functions that are different in Snowflake and SQL Server, e.g. string_agg() and listagg().
    The pm_utils package comes with a set of functions that work on both database types, see Multiple databases. For example, instead of using string_agg() or listagg(), the pm_utils.string_agg() will result in the same behavior for both databases. If pm_utils does not contain the desired function, then a Jinja statement should be created to make sure the right function is called on each database.

String concatenation

To combine to strings, use the pm_utils.concat() function. This will yield the same results for both SQL Server and Snowflake.
Example: pm_utils.concat("This is a nice string", null) = "This is a nice string" Concatenating strings should not be done with operators like + or ||, as they are different for both databases (Snowflake uses || and SQL Server uses +). Also the standard concat() function has different behavior on both systems:

SQL Server

Snowflake

null values will be ignored and treated as an empty string.
null values will cause the entire result to be null.

Sorting

Sorting is handled differently in Snowflake and SQL server.

Example: ... order by "Attribute_1" desc, "Attribute_2" ...

Null values

SQL Server

Snowflake

null will default be sorted first (ascending)
null will default be sorted last (ascending)

Handling capital letters

SQL Server

Snowflake

capitals are sorted as expected (AaBbCc)

first sorts by capitals, then by non-capitals (ABCabc)

Dashes

Example: -Accountant-

SQL Server

Snowflake

dashes are ignored in sorting (so '-Accountant-' is treated same as 'Accountant')

dashes will be sorted at the top

Whitespace handling

When you group by values “A“ and “ A“, this is seen as one value in SQL Server, but as two different values in Snowflake. Therefore trimming is advised if your data may cause this issue.

Case sensitivity

By default, SQL Server is case insensitive whereas Snowflake is case sensitive. This means that Table."Field" = "Some_value" and Table."Field" = "SOME_VALUE" will return the same result set in SQL Server, but potentially two different result sets in Snowflake.

You are advised to change the behavior of your local SQL Server database to match Snowflakes behavior, to prevent any problems. This can be accomplished by setting the database collation to a case sensitive value.

Configuration settings for loading input data

Settings.json

The settings.json file contains settings related to loading input data. These settings are temporary and should be used to switch existing process apps (created before March 2023) to the new data loading behavior. You should not change the settings.json file for new process apps.

When you create a new process app, always make sure that the input data is in the required format for the app template that you use to create a new app. See App Templates.

Important:

New process apps will apply to the new data model by default. Existing process apps will continue to work in Process Mining 2023.4 and Process Mining 2023.10. Switch over the data loading settings of your process apps before Process Mining 2024.4, to make sure uploading data keeps working properly. The AddRawTablePostfix and StripSpecialCharacters settings are temporary and will be removed in Process Mining 2024.4. From then on, all process apps will function as if these settings were set to false.

Setting

Format

Description

AddRawTablePostfix

boolean

For adding a _raw suffix to your source tables when using file uploads through the Upload data option. For example, if the file you upload is named Event_log.csv it changes to Event_log_raw.csv if this setting is set to true.
StripSpecialCharacters

boolean

For removing special characters and replacing spaces with underscores in table names and/or field names.For example, a field called Event+end changes to Eventend.

Using existing process apps with the new data model settings

Follow these steps to use your existing process apps with the new data model settings.

  1. Download the settings.json.zip and unzip the file.

  2. Export the transformations of your process app. See Editing data transformations in a local environment.

  3. Add the settings.json file (if not already present) to transformations.

  4. Make sure both AddRawTablePostfix and StripSpecialCharacters are set to false.

  5. Change your input files, or transformations, such that the file names match exactly. For example, if your input transformations expect event_log_raw then your csv file should also be called event_log_raw.csv.

  6. If you are using special characters in table names or field names (for example ' ', '(' or '?'), then make sure the input transformations use the same name. For example, a field Activity category should be referred to as Activity category in your queries and not Activity_category.

  7. Import transformations and ingest new data.

The transformations will be run again in the platform and the source tables will be changed accordingly.

Dbt projects

Data transformations are used to transform input data into data suitable for Process Mining. The transformations in Process Mining are written as dbt projects.

This pages gives an introduction to dbt. For more detailed information, see the official dbt documentation.

pm-utils package

Process Mining app templates come with a dbt package called pm_utils. This pm-utils package contains utility functions and macros for Process Mining dbt projects. For more info about the pm_utils , see ProcessMining-pm-utils.

Updating the pm-utils version used for your app template

UiPath® constantly improves the pm-utils package by adding new functions.
When a new version of the pm-utils package is released, you are advised to update the version used in your transformations, to make sure that you make use of the latest functions and macros of the pm-utils package.
You find the version number of the latest version of the pm-utils package in the Releases panel of the ProcessMining-pm-utils.
Follow these steps to update the pm-utils version in your transformations.
  1. Download the source code (zip) from the release of pm-utils.
  2. Extract the zip file and rename to folder to pm_utils.
  3. Export transformations from the inline Data transformations editor and extract the files.

  4. Replace the pm_utils folder from the exported transformations with the new pm_utils folder.

  5. Zip the contents of the transformations again and import them in the Data transformations editor.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.