- Overview
- Document Processing Contracts
- Release notes
- About the Document Processing Contracts
- Box Class
- IPersistedActivity interface
- PrettyBoxConverter Class
- IClassifierActivity Interface
- IClassifierCapabilitiesProvider Interface
- ClassifierDocumentType Class
- ClassifierResult Class
- ClassifierCodeActivity Class
- ClassifierNativeActivity Class
- ClassifierAsyncCodeActivity Class
- ClassifierDocumentTypeCapability Class
- ExtractorAsyncCodeActivity Class
- ExtractorCodeActivity Class
- ExtractorDocumentType Class
- ExtractorDocumentTypeCapabilities Class
- ExtractorFieldCapability Class
- ExtractorNativeActivity Class
- ExtractorResult Class
- ICapabilitiesProvider Interface
- IExtractorActivity Interface
- ExtractorPayload Class
- DocumentActionPriority Enum
- DocumentActionData Class
- DocumentActionStatus Enum
- DocumentActionType Enum
- DocumentClassificationActionData Class
- DocumentValidationActionData Class
- UserData Class
- Document Class
- DocumentSplittingResult Class
- DomExtensions Class
- Page Class
- PageSection Class
- Polygon Class
- PolygonConverter Class
- Metadata Class
- WordGroup Class
- Word Class
- ProcessingSource Enum
- ResultsTableCell Class
- ResultsTableValue Class
- ResultsTableColumnInfo Class
- ResultsTable Class
- Rotation Enum
- SectionType Enum
- WordGroupType Enum
- IDocumentTextProjection Interface
- ClassificationResult Class
- ExtractionResult Class
- ResultsDocument Class
- ResultsDocumentBounds Class
- ResultsDataPoint Class
- ResultsValue Class
- ResultsContentReference Class
- ResultsValueTokens Class
- ResultsDerivedField Class
- ResultsDataSource Enum
- ResultConstants Class
- SimpleFieldValue Class
- TableFieldValue Class
- DocumentGroup Class
- DocumentTaxonomy Class
- DocumentType Class
- Field Class
- FieldType Enum
- LanguageInfo Class
- MetadataEntry Class
- TextType Enum
- TypeField Class
- ITrackingActivity Interface
- ITrainableActivity Interface
- ITrainableClassifierActivity Interface
- ITrainableExtractorActivity Interface
- TrainableClassifierAsyncCodeActivity Class
- TrainableClassifierCodeActivity Class
- TrainableClassifierNativeActivity Class
- TrainableExtractorAsyncCodeActivity Class
- TrainableExtractorCodeActivity Class
- TrainableExtractorNativeActivity Class
- Document Understanding Digitizer
- Document Understanding ML
- Document Understanding OCR Local Server
- Document Understanding
- Release notes
- About the Document Understanding activity package
- Project compatibility
- Set PDF Password
- Merge PDFs
- Get PDF Page Count
- Extract PDF Text
- Extract PDF Images
- Extract PDF Page Range
- Extract Document Data
- Create Validation Task and Wait
- Wait for Validation Task and Resume
- Create Validation Task
- Classify Document
- Create Classification Validation Task
- Create Classification Validation Task and Wait
- Wait for Classification Validation Task and Resume
- Intelligent OCR
- Release notes
- About the IntelligentOCR activity package
- Project compatibility
- Configuring Authentication
- Load Taxonomy
- Digitize Document
- Classify Document Scope
- Keyword Based Classifier
- Document Understanding Project Classifier
- Intelligent Keyword Classifier
- Create Document Classification Action
- Wait For Document Classification Action And Resume
- Train Classifiers Scope
- Keyword Based Classifier Trainer
- Intelligent Keyword Classifier Trainer
- Data Extraction Scope
- Document Understanding Project Extractor
- RegEx Based Extractor
- Form Extractor
- Intelligent Form Extractor
- Present Validation Station
- Create Document Validation Action
- Wait For Document Validation Action And Resume
- Train Extractors Scope
- Export Extraction Results
- ML Services
- OCR
- OCR Contracts
- Release notes
- About the OCR Contracts
- Project compatibility
- IOCRActivity Interface
- OCRAsyncCodeActivity Class
- OCRCodeActivity Class
- OCRNativeActivity Class
- Character Class
- OCRResult Class
- Word Class
- FontStyles Enum
- OCRRotation Enum
- OCRCapabilities Class
- OCRScrapeBase Class
- OCRScrapeFactory Class
- ScrapeControlBase Class
- ScrapeEngineUsages Enum
- ScrapeEngineBase
- ScrapeEngineFactory Class
- ScrapeEngineProvider Class
- OmniPage
- PDF
- [Unlisted] Abbyy
- [Unlisted] Abbyy Embedded
Form Extractor
UiPath.IntelligentOCR.Activities.DataExtraction.FormExtractor
The Form Extractor is best suited for extracting, matching, and reporting specific information by analyzing the word's position inside the document, or detecting a signature. This activity can be used only together with the Data Extraction Scope activity. Handwritten text can also be detected if the Form Extractor activity is used along with the UiPath Document OCR activity.
Properties panel
Common
- DisplayName - The display name of the activity.
Input
- ApiKey - Specifies the API key of the account. The API Key field is automatically pre-populated if defined in local project settings or in the Document Understanding framework.
- Endpoint - The URL to UiPath® server. By default, the endpoint is
https://du.uipath.com/svc/formextractor
. For more information, visit Document Understanding Public Endpoints. - MinOverlapPercentage -
Specifies the minimum overlap area (in percentage) between a box in the document
and a box in the template required to make an extraction. The percentage value
can be set between
0
and100
. The default value is65
. - Timeout - Specifies the amount of time (in milliseconds) to wait for a response from the server before an error is thrown. The default value is 100000 milliseconds (100 seconds).
Misc
- Private - If selected, the
values of variables and arguments are no longer logged at Verbose level.
Note: Multiple templates can be defined for one Document Type. When the activity is run, the extractor selects the best matching template based on the information found on the first page.
Allows you to create, edit, manage, and export/import templates for the document types defined in the taxonomy.
Creating a template
- Add a Form Extractor activity to your workflow, within a Data Extraction Scope.
- Configure the extractor by
selecting Manage Templates.
The Template Manager window opens.Figure 1. Overview of the Template Manager wizard
- Select Create Template for
creating a new template.
Figure 2. Overview of the Create a new template configuration fields
Note:If the UiPath.IntelligentOCR.Activities package has been updated to v5.1.0, then the ForceApplyOCR parameter has been replaced with the ApplyOcrOnPDF. Here is the compatibility between the old and new parameters:
- ForceApplyOCR = True is replaced by ApplyOcrOnPDF = Yes;
- ForceApplyOCR = False is replaced by ApplyOcrOnPDF = Auto;
- ForceApplyOCR = Empty is replaced by ApplyOcrOnPDF = Auto;
- ForceApplyOCR
=
<user-defined variable>
is replaced by ApplyOcrOnPDF = Auto.
The Apply OCR on PDF option establishes if the OCR process should be applied or not to PDF documents. Three options are available in the dropdown list: True, False, and Auto.
If set to True, the OCR is applied to all PDF pages of the document. If set to False, only digitally typed text is extracted. The default value is Auto, determining if the document requires to apply the OCR algorithm depending on the input document.
Each OCR engine comes with its own set of custom options. Visit OCR Engine for more details about all options available for each OCR engine. The default OCR engine is UiPath Document OCR.
- Select the document type for your
template from the Document Type dropdown list.
Note: All Document Types are based on the Taxonomy. Make sure to add or create a taxonomy inside the project's folder.
- Add the name of the template in the Template name field. Choose a relevant name that reflects the version or the layout of your document.
- Add the document's path in the
Template document field.
Navigate to the file's path by using the Browse option.
- Select an OCR from the OCR Engine dropdown list, and configure it according to its needs.
- Select Configure to trigger the template editing.
If you have already created a template, then it can be edited, exported, or removed. Delete and Export options become available only when at least one template is selected. The Edit and Remove options for an individual template are always available.
Boolean
content interpretation, which is
mapping a captured value to a Yes or No reported value.
You can import templates created and exported from other workflows. Use these features to share templates between projects. Once a document type is configured using the Form Extractor, you don't need to reconfigure the templates in a new implementation.
Exporting Procedure
Here are the steps you need to follow to export a template:
- Create one or more templates by following the steps explained at the beginning of this page.
- Select the templates you want to export.
- Select an Export
option:
- Export with original files
Exporting with original files attaches them to the export.
- Export without
original filesFigure 5. The action of selecting the Export with original files options
- Export with original files
- Save the template's archive with the desired name.
- A message is displayed once the
template is saved. Select OK.
Figure 6. The "X" template(s) successfully exported message
Note:If you cannot share the content of the documents you have built your templates on, then use the Export without original files option. You are still able to share and import the template archive in other projects, but you cannot edit or view them anymore.
If you want to edit the templates once imported in a different project, make sure to use the Export with original files option when exporting and then importing them.
Importing Procedure
Here are the steps you need to follow to import a template:
- Select Import.
Figure 7. The action of selecting Import in the Template Manager wizard
- Select an archive. The import
wizard appears and presents all document types and all templates available in
the selected export archive. Select the templates you wish to import and choose
the desired Import option:
- Import with original files
- Import without
original filesFigure 8. The Import options in the Template Manager wizard
Note:- When templates are imported, Document Types are created automatically in the project's Taxonomy. If a Document Type with the same name already exists, another one is created by appending a count to the Document Type name.
- If you are importing templates that have been exported without the original files, or if you choose to import templates without the original files, then you have no view or edit options for those templates.
When a template is imported, several special situations might occur. The following list explains each situation and its particularities:
- New document type: If a new document type is imported, then a new field is added in the wizard configurator, informing you that a new template is to be created.
- Duplicate document type: If an identical document type is imported, then the following warning message appears: "This template already exists and it will be overwritten."
- Extended template: If a document type template that includes extra fields than the already existing one, is imported, then the following warning message appears: "This document type will be updated as follows: The following field(s) do not exist and will be created".
- Extended document type: If the user imports a document type that includes extra fields than the already existing one, then the following warning message appears: "This document type will be updated as follows: The following field(s) don't have configurations to import".
- Document type with identical name but different content: If the user
imports a document type that has the same name as the existing one but different
fields, then the following warning message appears: "This document type will be
updated as follows":
- "The following fields do not exist and will be created"
- "The following fields don't have configurations to import"
- Document type with missing table: If the user imports a document type that doesn't include a table, then the following warning message appears: "This document type will be updated as follows: The following field(s) don't have configurations to import."
- Document type with extended table: If the user imports a document type that includes a table with extra columns, then the following warning message appears: "This document will be updated as follows: The following field(s) do not exist and will be created".
- Document type with reduced table: If the user imports a document type that includes a table with missing columns, then the following warning message appears: "This document will be updated as follows: The following field(s) don't have configurations to import"
- Table template with different document types: If you import a document
type template that includes a table with different document types, then a new
template is created. If your taxonomy includes a table that has a field with a
different document type, then the following message appears: "The field with id
xyz
was found both in the imported taxonomy as well in the existing taxonomy but their types are incompatible (either both should be tables or neither of them)."
General Considerations
The Template Editor is built on top of the functionality present in the Validation station. To access it, select Edit for a template.
Visit Validation Station to learn about the basic usage of the Validation Station.
- : Sets the anchor selection mode;
- : Clears the whole anchor selection.
When creating a new template, an explanation text appears when first opening the Template Editor. In case you want to access the text again, go in the document view section on the right side, select More Options, and then Show explanation text.
Table information can be modified at cell or table level. Visit Present Validation Station for more information about how to configure tables at cell level and at table level.
Anchors can be defined once the Template Editor is opened from the Template Manager and can be found among the Selection Mode options.
When defining or editing a page-level template, although it is optional, the first thing that needs to be performed is the Page 1 Matching Info selection. This step is mandatory only for fixed form templates.
Situated on the left side of the screen, the Page 1 Matching Info selection requires a text input (tokens only are accepted) from the first page of the template that is always in the same position within that particular template layout and forms a unique graph of words (considering relative distances and angles between words) across all the templates defined for a particular document type.
In other words, the Page 1 Matching Info (and all other Page Matching Info fields) are "fingerprints" of a particular page and are extensively used in identifying the right matching template at runtime.
For this reason, for the Page 1 Matching Info field, it is strongly recommended to select 10 to 20 words, preferably longer, spread across the entire page area.
The other Page Matching Info fields (one for each template page) must be filled in only if you are attempting data extraction from that particular page, and do not require cross-template uniqueness anymore. If no fields need to be extracted from a particular page, defining the page-level matching info for that page is not mandatory.
For all fields other than tables, configuring the template consists of selecting a Custom Area and assigning it to a particular field.
For fixed form configurations, data fields can only be configured using Custom Area selections.
For a field you can define one or more such Custom Areas, using the Add button. If two or more Custom Areas are defined for a single field, then at runtime, if the field is defined in the Taxonomy as Single Value, all values are concatenated into a single reported value. If the field is defined as Multi Value, then each value is reported individually.
The icon beside each field indicates the type of supported selection: Tokens or Custom area.
If an empty area is selected, the selection is automatically set as Custom area. If text is detected inside the selected area, you are asked to choose the type of the selection between Tokens or Custom area.
Use the Validation Station selection mode feature to lock your selection between Tokens and Custom Areas.
As mentioned above, there are fields where information can be added only by using Tokens (like the Page Matching Info fields) or only by using a Custom Area (like simple fields). For Table fields, you can do the following:
- Define each cell one by one, once the Table Editor is expanded - by adding Custom Area selection to each cell individually;
- Use the table markup functionality - by marking the table area, drawing row, and column separators, and then assigning the thus marked table to the field. Make sure that the extracted area has the same number of columns and rows as the template area.
- Select More Options for the table field
- Select Extract new table.
- Select the table that you want extract.
- For every field above each table column, select the column name that you
want it to represent.
You can also choose to Extract header.
- Lastly, select Save new table.
A distinctive method of defining the bounds of a custom area from which data is to be extracted is to use field-level anchors. These allow for targeting data extraction based on field-level configurations, thus allowing for more flexibility when defining your form extraction rules.
Consequently, at run-time, the Form Extractor knows how to perform the following:
- Identify if a page-level template matches, and extract information according to the best page-level template match it recognizes;
- Identify if any anchor-based settings match, and extract information according to their application in the document to be processed;
- Compute appropriate confidence scores for all possible matches, to be able to report the best result (highest probability match) of all available options.
Creating a New Anchor Setting
- Make sure you are in the Anchor Selection mode.
- Draw a box around the value area.
- Select a Label (main anchor) for
your value area by using one of the following methods:
- Select the first word and then use
Ctrl + Select
for the last word of the selection. - Select, drag, and then release to capture a word range.
Note: A Label can only contain consecutive words from the same visual line.
- Select the first word and then use
- Select any additional anchors that would uniquely identify your Label. The same selection principle applies.
- Assign your anchor construct to
the appropriate field by selecting Extract Value for that particular
field.
Figure 12. Example of creating multiple anchors for a field
Note: You can also use the previous examples from this page to learn how to create a template and define extraction areas and anchors.
Edit an Existing Anchor Setting
- Highlight your anchor setting.
- Make changes to it (delete any anchors, the label, even the value area if you wish, add new elements, etc.).
- Select More Options for a
field anchor, and then use the Change Extracted Value option to update
your field association.
Figure 13. Example of changing the extracted value for a field
Note:- If you delete the target area, all anchors get deleted and you start over.
- If you delete the Label (main anchor), the first anchor in the order it was created becomes the new Label.
Delete an Existing Anchor Setting
To delete an anchor setting, you can use one of the following options:
- Select More Options for a
field anchor and use the Mark as Missing option for a saved value.
Figure 14. Example of using the Mark as Missing option to delete an anchor setting
- Select More Options for a
field anchor and use the Remove Value option, case of a list of anchors
defined for a given field.
Figure 15. Example of using the Remove Value option to delete an anchor setting
Mix and Match Configurations
You can define as many templates as you want for the same document type. You can have multiple page-level templates, multiple anchors for the same field, even templates containing both page-level as well as field-level anchors.
- When defining field-level anchors make sure your Label is close to your value area and it is supported by additional anchors if the same text construct can be found in multiple places within the same document.
- The bigger the length of your labels and anchors is, the more precision you get.
- The value area is always computed based on its relative position against your Label (main anchor). Choose your main anchors accordingly.
- Having field-level anchors allows fields to move within the template and still be captured, offering more flexibility in document layout changes.
The Form Extractor activity is part of the Document Understanding solutions. Visit the Document Understanding Guide for more information.
- Description
- Project compatibility
- Configuration
- The Template Manager Wizard
- Configuring Boolean field processing
- Exporting and Importing Templates
- Special Situations when Importing a Template
- The Template Editor Wizard
- Configuring Anchors
- Configuring Simple Fields
- Configuring Tables
- Anchors Configuration
- Document Understanding Integration