- Overview
- Document Processing Contracts
- Release notes
- About the Document Processing Contracts
- Box Class
- IPersistedActivity interface
- PrettyBoxConverter Class
- IClassifierActivity Interface
- IClassifierCapabilitiesProvider Interface
- ClassifierDocumentType Class
- ClassifierResult Class
- ClassifierCodeActivity Class
- ClassifierNativeActivity Class
- ClassifierAsyncCodeActivity Class
- ClassifierDocumentTypeCapability Class
- ExtractorAsyncCodeActivity Class
- ExtractorCodeActivity Class
- ExtractorDocumentType Class
- ExtractorDocumentTypeCapabilities Class
- ExtractorFieldCapability Class
- ExtractorNativeActivity Class
- ExtractorResult Class
- ICapabilitiesProvider Interface
- IExtractorActivity Interface
- ExtractorPayload Class
- DocumentActionPriority Enum
- DocumentActionData Class
- DocumentActionStatus Enum
- DocumentActionType Enum
- DocumentClassificationActionData Class
- DocumentValidationActionData Class
- UserData Class
- Document Class
- DocumentSplittingResult Class
- DomExtensions Class
- Page Class
- PageSection Class
- Polygon Class
- PolygonConverter Class
- Metadata Class
- WordGroup Class
- Word Class
- ProcessingSource Enum
- ResultsTableCell Class
- ResultsTableValue Class
- ResultsTableColumnInfo Class
- ResultsTable Class
- Rotation Enum
- SectionType Enum
- WordGroupType Enum
- IDocumentTextProjection Interface
- ClassificationResult Class
- ExtractionResult Class
- ResultsDocument Class
- ResultsDocumentBounds Class
- ResultsDataPoint Class
- ResultsValue Class
- ResultsContentReference Class
- ResultsValueTokens Class
- ResultsDerivedField Class
- ResultsDataSource Enum
- ResultConstants Class
- SimpleFieldValue Class
- TableFieldValue Class
- DocumentGroup Class
- DocumentTaxonomy Class
- DocumentType Class
- Field Class
- FieldType Enum
- LanguageInfo Class
- MetadataEntry Class
- TextType Enum
- TypeField Class
- ITrackingActivity Interface
- ITrainableActivity Interface
- ITrainableClassifierActivity Interface
- ITrainableExtractorActivity Interface
- TrainableClassifierAsyncCodeActivity Class
- TrainableClassifierCodeActivity Class
- TrainableClassifierNativeActivity Class
- TrainableExtractorAsyncCodeActivity Class
- TrainableExtractorCodeActivity Class
- TrainableExtractorNativeActivity Class
- Document Understanding Digitizer
- Document Understanding ML
- Document Understanding OCR Local Server
- Document Understanding
- Release notes
- About the Document Understanding activity package
- Project compatibility
- Set PDF Password
- Merge PDFs
- Get PDF Page Count
- Extract PDF Text
- Extract PDF Images
- Extract PDF Page Range
- Extract Document Data
- Create Validation Task and Wait
- Wait for Validation Task and Resume
- Create Validation Task
- Classify Document
- Create Classification Validation Task
- Create Classification Validation Task and Wait
- Wait for Classification Validation Task and Resume
- Intelligent OCR
- Release notes
- About the IntelligentOCR activity package
- Project compatibility
- Configuring Authentication
- Load Taxonomy
- Digitize Document
- Classify Document Scope
- Keyword Based Classifier
- Document Understanding Project Classifier
- Intelligent Keyword Classifier
- Create Document Classification Action
- Wait For Document Classification Action And Resume
- Train Classifiers Scope
- Keyword Based Classifier Trainer
- Intelligent Keyword Classifier Trainer
- Data Extraction Scope
- Document Understanding Project Extractor
- RegEx Based Extractor
- Form Extractor
- Intelligent Form Extractor
- Present Validation Station
- Create Document Validation Action
- Wait For Document Validation Action And Resume
- Train Extractors Scope
- Export Extraction Results
- ML Services
- OCR
- OCR Contracts
- Release notes
- About the OCR Contracts
- Project compatibility
- IOCRActivity Interface
- OCRAsyncCodeActivity Class
- OCRCodeActivity Class
- OCRNativeActivity Class
- Character Class
- OCRResult Class
- Word Class
- FontStyles Enum
- OCRRotation Enum
- OCRCapabilities Class
- OCRScrapeBase Class
- OCRScrapeFactory Class
- ScrapeControlBase Class
- ScrapeEngineUsages Enum
- ScrapeEngineBase
- ScrapeEngineFactory Class
- ScrapeEngineProvider Class
- OmniPage
- PDF
- [Unlisted] Abbyy
- [Unlisted] Abbyy Embedded
Document Understanding Activities
RegEx Based Extractor
UiPath.IntelligentOCR.Activities.DataExtraction.RegexBasedExtractor
Enables you to create and use a custom Regular Based Expression to extract information from a document. This activity can be used only together with the Data Extraction Scope activity.
set
or boolean
fields.
Designer panel
Configure Expressions - Opens the Configure Regular Expressions wizard.
Properties panel
Common
- DisplayName - The display name of the activity.
Input
- Configuration - Specifies the configuration value for the extractor as a
JSON
escaped string. Use the extractor wizard to generate the configuration. You can keep the configuration in the Properties panel, as a string, or you can define it by using the wizard and bind it to a variable. It is advisable to edit the Configuration field by using the wizard and not the Properties panel. - Timeout - Specifies the timeout value for any Regex search, in milliseconds. A timeout of
0
, or negative, is interpreted as infinite. The default value is2000
. - UseVisualAlignment - If selected, the regular expressions are applied to a text version generated based on visual word alignments (a visual word alignment includes words separated by a single space character, lines separated by a single newline character, and pages separated by two lines characters). The default value is False. This option can be used for complex layouts where it is easier for users to write regular expressions based on how words are visually organized on lines, ignoring any sentence, paragraph, or layout group otherwise identified in the document.
Misc
- Private - If selected, the values of variables and arguments are no longer logged at Verbose level.
- Add a RegEx Based Extractor activity to your workflow, within a Data Extraction Scope activity.
- Configure your regular expressions by selecting
Configure Expressions.
The Wizard window opens.
Figure 1. Overview of the Configure Regular Expressions wizard
- Expand a document type entry in order to see all
defined fields, and to start configuring your regular expressions. Document
types and their respective fields are automatically read from the project's
Taxonomy. The Regex configuration option is available for every field in the
taxonomy. Check the following configuration options you can encounter in the
wizard:
- You can have a document
type, that, when you expand it, a single regular field is
displayed.
For a simple field, only a single regular expression can be defined using the Configure Regular Expressions wizard that opens when you select Edit next to that field.
Figure 2. A document type in the Configure Regular Expressions wizard that has a regular field defined
- You can have a document
type, that, when you expand it, a table field is displayed, showing
configuration options for a table, such as Expression for the
entire table content, or an Expression for individual
rows.
Check the following list for the multiple settings and options available for a table field configuration:
- The Table Value RegEx can be used for capturing an entire table area. If no value is added in the Table field line, the entire text content of the document is considered onward for table processing.
- The Rows Value RegEx can be used for capturing an entire row from a given table capture. If no value is added in the Rows field line, the table area is split by end-of-line. Each captured value is considered from this point forward as a row on which the column extraction is to be applied.
- The Column Value RegEx can be used for capturing the value of a particular column, from each captured row.
Figure 3. A document type in the Configure Regular Expressions wizard that has a table field defined
Scenarios of using the table, rows and column RegEx
Check the following possible scenarios for using the available table RegEx options:- If you leave the Table RegEx and the Rows RegEx fields empty, all lines in the text version of the document are used to apply the Column Level Regexes for cell value identification.
- If you define a RegEx to capture the table area, but leave the Rows RegEx empty, all lines in the table capture are individually processed using each Column RegEx to capture the cell values.
- If you leave the Table RegEx empty but define a Rows RegEx, then all text captured with the Rows RegEx is used and the Column RegExes are applied to capture cell values for each row.
- If you fill in both Table and Rows RegEx, the activity applies the Table RegEx to identify the table string, then apply the Rows RegEx to identify each line, followed by the Column Level RegEx for capturing cell values.
- You can have a document
type, that, when you expand it, a single regular field is
displayed.
- Add your regular expression in the
Expression field.
You have the option of either writing the whole RegEx in the Expression field or to build it by using the Edit option.
Important: For any of the regular expressions you define, make sure you have at least one capture group. Only the captured parts of an expression are used for value reporting. - Select the dropdown list from the Regex
Options column. You can set various regex options from this multi-select
option.
You can choose from the following options:
- CultureInvariant - Specifies that the linguistic cultural differences are ignored.
- ECMAScript - Enables ECMA (European Computer Manufacturers Association) Script compliant behavior for the expression. This value can be used only in conjunction with the IgnoreCase and Multiline options.
- ExplicitCapture -
Specifies that the only valid captures are the ones of groups that are
explicitly named or numbered and are defined as
(?<name> subexpression)
. Any unnamed parentheses are ignored. - IgnoreCase - Specifies that the search is not case sensitive.
- IgnorePatternWhitespace - Eliminates the unescaped white
space from the defined pattern and enables the comments marked with
#
(hashtag symbol). This option does not apply to character classes, numeric quantifiers, or tokens marking the beginning of an individual RegEx language element. - Singleline -
Specifies that the search is initiated in a single line. The dot
(.)
matches all characters, including the exception\n
. - Multiline -
Specifies that the search is initiated in multiple lines. For this
option, the special characters
^
and$
match the beginning and the ending of any line. - RightToLeft -
Specifies that the search is performed from right to left.
Note: Visit RegexOptions Enum for more information about the regular expression options you can use.
Figure 4. The expanded Regex Options dropdown showing the available options
- Select Edit to edit the
options of that field and the format of the regular expression.
The RegEx Builder wizard opens.
Figure 5. Overview of the RegEx Builder wizard
- Input your desired text in the
Test Text field. This is the text that you want to apply RegEx to
based on the search criteria you choose. After that, insert a value in the
Value field of the RegEx, which will then become highlighted in the
Test Text field as well.
Figure 6. Entering text in the Test Text field and highlighting a certain value from it using the Value field
- Select one of the RegEx formula
types from the dropdown list. This sets the regular expression to match one of
the following characteristics:
- Literal - Matches the exact characters specified by you. This option is case sensitive.
- Digit - Matches a digit.
- One of - Matches a single character present in the set.
- Not one of - Matches a single character not present in the set.
- Anything - Matches
any character, except for
\n
. - Any word character - Matches any letters and numbers.
- Whitespace - Matches one white space.
- Starts with - Initiates the search where the line starts.
- Ends with - Initiates the search where the line ends.
- Advanced - Requires a custom expression.
- Email - Matches an email address.
- URL - Matches an URL.
- US date - Matches the US date format.
- US phone number -
Matches the US phone number format.
Figure 7. The dropdown list showing the available characteristics for the regular expression
Note: Visit .NET regular expressions for more information about regular expressions in .NET.
- Use the Value field for writing the value of the regular expression.
- Select a quantifier from the
Quantifiers dropdown list. You can choose from the following
options:
- Exactly - Matches
the preceding element exactly how many times it is specified. By
default, it is set to
1
. - Any (0 or more) - Matches the preceding element for zero or more times, but as few times as possible.
- At least one (1 or more) - Matches the preceding element for one or more times.
- Zero or one - Matches the preceding element for zero or one time but for as few times as possible.
- Between x and y
times - Matches the preceding element between
x
andy
times, wherex
andy
are integers, but as few times as possible.
- Exactly - Matches
the preceding element exactly how many times it is specified. By
default, it is set to
- To edit fields, you can use the
following options:
- Select Add to add an extra RegEx field.
- Select Move up and Move down to move fields up and down in the hierarchy.
- Select Remove to delete the field.
- Select the check box for the Capture option if you want to extract that specific field.
- The Full Expression field shows the entire expression exactly how you customized it.
- Select one or multiple options
from the Regex Options dropdown list.
Figure 8. The available options in the Regex Options dropdown list
- Select Save once all your configurations are done to exit the Edit mode.
- Select Saveagain to close the wizard.