Sharding is an innovative solution for improving performance of your process mining applications. In short, sharding divides the data in your event log into smaller parts called “shards”. The smaller each shard is, the faster it will be.
With a shard, end-users only consider the applicable part of the data they are interested in. When a user logs in to the application, only the applicable data shard will be loaded.
Shards can be divided into two different types:
- Regular shards, which contain parts of your data at the detailed level.
- Benchmark shards, which contain an aggregated, high-level view of all your data.
Multiple techniques exist for creating regular shards as well as for benchmark shards. Regular shards can be created by splitting your data based on case attributes. Benchmark shards combine the data of all shards. Typically, the detail level of the data is reduced using pre-aggregation, filtering, or sampling.
An example attribute for sharding could be Company code, where each shard contains all cases belonging to a single company code. If you were to have 10 company codes in your dataset, every shard will then be approximately 10 times faster than the original (assuming equal splits).
See illustration below.
Besides splitting your data into separate shards, it is useful to have an overview shard containing a higher-level view of all data, a ‘benchmark shard’.
You can set this up in multiple ways:
- By pre-aggregating values or attributes: this prevents you from doing detailed analysis but allows you to still compare differences over shards.
- By lowering the level of detail by filtering out fine-grained events: this enables you to compare processes on a coarse level.
- By filtering: You can remove all event data and only keep tags and the respective cases, this way you can compare tags over multiple shards.
- By sampling: You can sample cases in your dataset to only keep part of the cases, keeping a representative sample of cases as your benchmark dataset.
You can also set up multiple benchmark shards using different methods.
You can use a single connector for your ETL, even when using sharding. You do this by setting up application modules, using one module per shard you want to create.
In your connector, add a system table with table scope set to “current user” to get the ActiveApplicationCode, which indicates the module that is currently active. You can use this attribute from the system table to create conditions for your data loading.
When applying sharding using case types, set up an expression Case_Type_Shard based on the ActiveApplicationCode attribute, to determine what case type belongs to which application code. Then, in the cases_base table, you set the join condition to:
Cases_preprocessing where Case_type_Shard = Case_type
This ensures that only the cases which have a case type belonging to the current shard are passed through in your final output.
You also need to make sure only events belonging to cases in the current shard are in the output. Therefore, in the
events_preprocessing table, create a lookup expression to the
cases_base table which checks whether cases are in the selected shard.
See illustration below.
Use this expression attribute in the join condition of the events_base table with the expression:
Events_preprocessing where Case_in_shard.
The benchmark shard is also set up using the
ActiveApplicationCode attribute. The filtering depends on what type of benchmark shard you want to use and is similar to what is described above for the regular shards.
To set up your application for sharding, you also need one module per shard. These modules must have the same module codes as the ones in your connector.
Furthermore, depending on what type of benchmark shard you use, the data structure may be different for the regular shards and the benchmark shard. If this is the case, you need a separate application for the benchmark shard.
Since you are using multiple modules, you need to reload the data using a script, to make sure data of all connector modules ends up in the same dataset. This way, the application knows, based on the opened module, what part of the data to consider. See Set up Automated Data Refreshes for the script for reloading your data.
Updated 7 months ago