process-mining
2021.10
true
UiPath logo, featuring letters U and I in white
Process Mining
Automation CloudAutomation Cloud Public SectorAutomation SuiteStandalone
Last updated Sep 2, 2024

Data Volume

Introduction

The amount of data will always be in a direct trade-off with performance. Process mining is inherently obsessed with details to construct the process graphs.

However, having all these unique timestamps impacts the performance. In general, there are theoretical limits that all process mining tools and all in-memory tools approach.

Types of Users

We make a clear distinction between the performance of the data used for an Application and the Connector. Although they make use of the same platform, there are some differences, i.e. what is acceptable for the users (developers versus end users) and the type of actions performed.

Large amounts of data can have both an impact on the Connector and Application, but all can be solved in the Connector.

Data Volume

The performance end-users will experience is directly related to the data volume. The data volume is determined by the number of rows in the biggest tables. In general, only the number of rows determine the performance end users experience. The number of columns is only a factor when the data is loaded from the database.

Processes with about 5.000.000 (5M) cases and up to about 50.000.000 (50M) events per process would be ideal. With more cases and events parsing the data and showing the visualization will take longer.

The UiPath Process Mining platform will continue to work, however, when large amounts of data are inserted, the reaction speed may drop. It is recommended to check the data amount beforehand. If it exceeds the above numbers, it is advised to consider optimizing or limiting the dataset.

Level of Detail

A higher level of detail will take a higher response time which impacts the performance.

The exact tradeoff between the amount of data, the level of detail, and the waiting time needs to be discussed with the end users. Sometimes historical data can be very important, but often only the last few years are needed.

Another factor is the unique values you have in your columns. UiPath Process Mining uses a proprietary method to reduce the size of the *.mvn files to a minimum. This works well for values that are similar. A lot of unique values for an attribute will also impact performance e.g. event detail.

Solutions

There are two main solution directions for dealing with large data volumes:

  • optimization;
  • data minimization.

Optimization involves the adjustments Superadmins can make to make the dashboards render faster, which can be achieved by tailoring the application settings to the specific dataset (see Application Design for more information).

This section describes data minimization, which are the different techniques you can employ to reduce the data visible to the end user, tailored to the specific business question.

The techniques described here can exist alongside each other or can even be combined to leverage the benefits of multiple techniques. In addition, you may keep an application without data minimization alongside minimized applications because the level of detail might sometimes be required for specific analyses where slower performance is acceptable.

Data Scoping

Limiting the number of records that will show up in the tour dataset will not only improve the performance of the application, it will also improve the comprehensibility of the process and in turn, improve acceptance by the business.

The scoping of the data can be done in the Connector.

One of the options for scoping is to limit the time frame to look by filtering out dates or periods. For example, you could limit the timeframe from 10 years to one year. Or from 1 year to one month. See the illustration below.



A limited amount of activities is advised, especially in the start of any process mining effort. From there you can build up as the expertise starts to ramp up.

Below is a guideline for the range of activities:

Range (nr. of activities)

Description

5-20

Preferred range when starting with process mining.

Simple process to give insight information.

20-50

Expert range. Expanding with clear variants.

50-100

Most useful if there are clear variants. This means somewhat related processes, but primarily on their own.

100+

Advised is to split up into subprocesses.

Note: Filtering out activities will simplify your process and make it more comprehensible. Be aware that you also may lose information or details.

Below are some suggestions for filtering data:

  • Unrelated activities: activities that are not directly impacting the process could be filtered out.
  • Secondary activities: some activities, i.e. a change activity, can happen anywhere in the process. These significantly blow up a number of variants.
  • Minimally occurring events: events that occur only a few times in your dataset could be filtered out.
  • Smaller process: only analyze a subprocess.
  • Grouping activities: some activities in your dataset may be more like small tasks, which together represent an activity that makes more sense to the business. Grouping them will require some logic in the connector and may result in overlapping activities.
  • If possible, within the performance of the Connector, use the Connector to filter out activities. In this way, any changes can be reverted easily, or activities can be added back. Avoid filtering out activities in the data extraction or data loading.

Remove Outliers

If there is one case with a lot of events (outlier), it will impact some expressions which calculate aggregates on the event level. The from/to dashboard item filter is impacted by this and can be time-consuming to calculate if you have these outliers. It is recommended to filter out these cases in the Connector to take them out of the dataset.

Note: This does impact metrics. You should only remove outliers in accordance with the business user.

Focus on Outliers

In other instances, the outliers may be the key area to focus on. If your process is going well or you adopt Six Sigma methodologies, you want to focus on the things going wrong. Instead of showing all the cases going right, you only show the cases going wrong.

See the illustration below.



Reducing the Size of the Dataset

In the Connector, you can remove attributes that have a lot of detail. For example, long strings in the Event Detail attribute.

When finished developing a lot of unused attributes may end up in your dataset. It is recommended to only set the availability of the attributes that are used in the output dataset of the Connector to the public. Set the availability of other attributes to private.

Pre-aggregation

Pre-aggregation is a technique that is employed by many BI tools to gain insights into large data volumes. It involves aggregating data over specific attributes to reduce the number of records in a dataset. In BI this would typically be summing the value of each supplier, so only have one record for each supplier.

See the illustration below.



Process mining requires more configuration, but a starting point is to only aggregate on process variants. For each variant you would have one case record and a related number of events. This can significantly reduce the data volumes.

To show correct results you would also have to show how many records each variant represents, for the event ends you could use a median duration of each event. Aggregating only using variants might be too high so it would be wise to check most common filters used, e.g. a combination of variants, case type and month of the case end (to show trends over time).

However, adding attributes has a quadratic effect on the number of records so this requires a careful balance between performance and use case.

Pre-aggregation is most applicable for an overview of your process and spotting general trends.

Sampling

Sampling is a technique where you take a percentage of the cases and their events happening in a specific period. You can for instance set that only 10% of all cases and their events are shown. In this way you still have exceptions or outliers since each case has a similar chance of showing up in the dataset.

See illustration below.



Cascaded Sampling

Cascaded sampling is a technique where the sampling percentage drops over time with a certain percentage. An example of this shows 100% of last week’s data, 90% of two weeks ago, 80% of three weeks ago, and so on.

Data Sharding

Data sharding is a technique of the data scoping solution, which allows organizations to split up the data into multiple datasets, rather than just slicing off one part. This setup does require additional configuration since the application needs to be split up by using modules and multiple smaller dataset need to be exported from the connector.

With data sharding, the original dataset is divided into multiple shards. The smaller each shard is, the faster it will be. When a user logs in to the application, only the applicable data shard will be loaded.

A typical unit for sharding would be “Company code” or “Department”. For example, in the case of 50 company codes, each shard will contain one company code, and essentially be about 50 times faster than the original dataset.

See the illustration below for an overview of sharding.



Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.