- Getting Started
- Administration
- Manage Sources and Datasets
- Understanding the data structure and permissions
- Create a data source in the GUI
- Uploading a CSV file into a source
- Create a new dataset
- Multilingual sources and datasets
- Enabling sentiment on a dataset
- Amend a dataset's settings
- Delete messages via the UI
- Delete a dataset
- Export a dataset
- Using Exchange Integrations
- Preparing Data for .CSV Upload
- Model Training and Maintenance
- Understanding labels, entities and metadata
- Label hierarchy and best practice
- Defining your taxonomy objectives
- Analytics vs. automation use cases
- Turning your objectives into labels
- Building your taxonomy structure
- Taxonomy design best practice
- Importing your taxonomy
- Overview of the model training process
- Generative Annotation (NEW)
- Understanding the status of your dataset
- Model training and labelling best practice
- Training with label sentiment analysis enabled
- Train
- Introduction to 'Refine'
- Precision and recall explained
- Precision and recall
- How does Validation work?
- Understanding and improving model performance
- Why might a label have low average precision?
- Training using 'Check label' and 'Missed label'
- Training using Teach label (Refine)
- Training using Search (Refine)
- Understanding and increasing coverage
- Improving Balance and using 'Rebalance'
- When to stop training your model
- Using Analytics & Monitoring
- Automations and Communications Mining
- FAQs and More
Building custom regex entities
Permissions required: 'Modify Datasets'.
A Custom Regex Entity can be used to extract and format spans of text that have a known repetitive structure, such as IDs or reference numbers.
This is a useful option for simple, structured entities with little variation, whereas for those with significant variation and where context has a big influence on predictions, a machine-learning based entity would be the right choice. Combinations of the two can be used in any dataset within Communications Mining.
A broader Regex (i.e. set of rules to define the entity) can also be used as the base of a custom entity. This combines the rules with contextual, machine learning based refinement through training within Communications Mining to create sophisticated custom entities. This provides the most optimal performance as well as the necessary restrictions on values extracted for automation.
A Custom Regex Entity is made up of one or more Custom Regex Templates. Each template expresses one way to extract (and format) the entity.
Combined together, these templates offer a flexible and powerful way to cover multiple representations of the same entity type.
A template is made of two parts:
- The regex (regular expression), which describes the constraints that need to be met by a span of text to be extracted as an entity
- The formatting, which expresses how to normalise the extracted string into a more standard format
For instance, if your customer IDs can be either the word “ID” followed by 7 digits, or an alphanumeric string of 9 characters, here is what your two templates will look like:
ID\
d{}
will show:
The Custom Regex Template can be tested on text to ensure that it behaves as expected. Any entity that would be extracted with the Template will be shown in a list, with its value, as well as the position of the start and end characters.
\d{4}
and the formatting ID-{$}
, the following test string will show one extraction:
The regex is the pattern used to extract entities in the text. See here for the syntax documentation.
Named capture groups can be used to identify a specific section of the extracted string for subsequent formatting. The names of the capture groups should be unique across all templates, and should only contain lowercase letters or digits.
Formatting can be provided to post-process the extracted entity.
By default, no formatting is applied and the string returned by the platform will be the string extracted by the regex. However, if needed, more complex transformations can be defined, using the following rules.
$
symbol. Note that the $
symbol by itself represents the full regex match.
{
and }
braces.
ID-
then the regex and the formatting would be:
My identification number is
1234567
, it will return one entity:
ID-1234567
&
symbol.
Regex | (?P<id1>\b\d{3}\b)|(?P<id2>\b\d{4}\b) |
Formatting | {$id1 & "-" & $id2} |
Text | The first id is 123 and the second one is 4567 |
Entity returned by the platform | 123-4567 |
Some functions can also be used in the formatting to transform the extracted string. The names of the functions and their signatures are inspired by Excel.
Converts all characters in the extracted span to uppercase:
Regex | \w{3} |
Formatting | {upper($)} |
Text | abc |
Entity returned by the platform | ABC |
Converts all characters in the extracted span to lowercase:
Regex | \w{3} |
Formatting | {lower($)} |
Text | AbC |
Entity returned by the platform | abc |
Capitalises the extracted span:
Regex | \w+\s\w+ |
Formatting | {proper($)} |
Text | albert EINSTEIN |
Entity returned by the platform | Albert Einstein |
Pads the extracted span up to a given size with a given character.
Function arguments:
- The text containing the characters to be padded
- Size of the padded string
- Character to be used for padding
Regex | \d{2,5} |
Formatting | {pad($, 5, "0")} |
Text | 123 |
Entity returned by the platform | 00123 |
Replaces characters with other characters.
Function arguments:
- The text containing the characters to be substituted
- What characters to replace
- What the old characters should be replaced with
Regex | ab |
Formatting | {substitute($, "a", "12")} |
Text | ab |
Entity returned by the platform | 12b |
Returns the first n characters from the span.
Function arguments:
- The text containing the characters to be extracted
- The number of characters to return
Regex | \w{4} |
Formatting | {left($, 2)} |
Text | ABCD |
Entity returned by the platform | AB |
Returns the last n characters from the span.
Function arguments:
- The text containing the characters to be extracted
- The number of characters to return
Regex | \w{4} |
Formatting | {right($, 2)} |
Text | ABCD |
Entity returned by the platform | CD |