- Getting Started
- Framework Components
- Document Understanding in AI Center
- Pipelines
- ML Packages
- Data Manager
- OCR Services
- Licensing
- References
Document Understanding User Guide
RegEx Based Extractor
The Regex Based Extractor is the perfect tool for simple use cases, in which, for certain fields, data is always found in a strict, predictable format and context. In other words, if you have a field for which you can define a Regular Expression that is consistently good when matched, then the Regex Based Extractor is a good choice.
The activity comes with a configuration wizard that assists you in defining the regular expressions for the fields you want to target for data extraction in this way.
The activity supports both simple fields as well as table field extraction.
It is recommended to look into other extraction methods, in case there is a high variability of the context and format of the expected values. In such cases, either a Form Extractor or a Machine Learning Extractor may be better suited.
This extractor does not have learning (training) capabilities and requires up-front configuration.
The Regex Based Extractor has two major configurations to be considered:
- the Configure Regular Expressions wizard - which allows you to define regular expressions for certain fields. This wizard also makes available the Regex Editor wizard, which assists you in building your regular expressions.
- the UseVisualAlignment setting - which allows you to control whether the regular expressions configured for an extractor should be applied to the text output of the digitization component, or to a text version in which text lines are organized visually, and words are rearranged on lines based on their visual alignment.
The Configure Regular Expressions Wizard can be used for defining regular expressions to be used to capture data for both simple as well as table fields.
- CultureInvariant - Specifies that the linguistic cultural differences are ignored.
- ECMAScript - Enables ECMA Script compliant behavior for the expression. This value can be used only in conjunction with the IgnoreCase and Multiline options.
- ExplicitCapture - Specifies that the only valid captures are the ones of groups that are explicitly named or numbered and are defined as
(?<name> subexpression)
. Any unnamed parentheses are to be ignored. - IgnoreCase - Specifies that the search is not case sensitive.
- IgnorePatternWhitespace - Eliminates the unescaped white space from the defined pattern and enables the comments marked with
#
. This option does not apply to character classes, numeric quantifiers, or tokens marking the beginning of an individual RegEx language element. - Singleline - Specifies that the search is initiated in a single line. The dot
(.)
matches all characters, including the exception\n
. - Multiline - Specifies that the search is initiated in multiple lines. For this option, the special characters
^
and$
match the beginning and the ending of any line. - RightToLeft - Specifies that the search is
performed from right to left.
Note: More information about the Regular Expression Options can be found here.
- Click on the Edit button to edit the
options of that field and the format of the regular expression.
- Add text in the Test Text field for
testing the search criteria you choose against the text that you want to apply RegEx on.
- Select one of the RegEx formula types from the
drop-down list. This sets the Regular expression to match one of the following
characteristics:
- Literal - Matches the exact characters specified by you. This option is case sensitive.
- Digit - Matches a digit.
- One of - Matches a single character present in the set.
- Not one of - Matches a single character not present in the set.
- Anything - Matches any
character, except for
\n
. - Any word character - Matches any letters and numbers.
- Whitespace - Matches one white space.
- Starts with - Initiates the search where the line starts.
- Ends with - Initiates the search where the line ends.
- Advanced - Requires a custom expression.
- Email - Matches an email address.
- URL - Matches an URL.
- US date - Matches the US date format.
- US phone number - Matches the US
phone number format.
Note: More information about the Regular Expressions in .NET can be found here.
- Use the Value field to write the value of the regular expression.
- Select a quantifier from the Quantifiers drop-down list.
1
.
Any (0 or more) - Matches the preceding element for zero or more times, but as few times as possible.
At least one (1 or more) - Matches the preceding element for one or more times.
Zero or one - Matches the preceding element for zero or one time, but for as few times as possible.
x
and y
times, where x
and y
are integers, but as few times as possible.
- Use the button for adding an extra RegEx field. Move fields up and down in the hierarchy by using the and buttons. Use the button for deleting the field.
- Select the check box for the Capture option if you want to extract that specific field.
- The Full Expression field shows the entire expression, exactly how it was customized by you.
- Select one or more options from the Regex
Options drop-down list.
- Click the Save button once all your configurations are done to exit the Edit mode and then click the Save once again for closing the wizard.