- Overview
- Document Processing Contracts
- Release notes
- About the Document Processing Contracts
- Box Class
- IPersistedActivity interface
- PrettyBoxConverter Class
- IClassifierActivity Interface
- IClassifierCapabilitiesProvider Interface
- ClassifierDocumentType Class
- ClassifierResult Class
- ClassifierCodeActivity Class
- ClassifierNativeActivity Class
- ClassifierAsyncCodeActivity Class
- ClassifierDocumentTypeCapability Class
- ExtractorAsyncCodeActivity Class
- ExtractorCodeActivity Class
- ExtractorDocumentType Class
- ExtractorDocumentTypeCapabilities Class
- ExtractorFieldCapability Class
- ExtractorNativeActivity Class
- ExtractorResult Class
- ICapabilitiesProvider Interface
- IExtractorActivity Interface
- ExtractorPayload Class
- DocumentActionPriority Enum
- DocumentActionData Class
- DocumentActionStatus Enum
- DocumentActionType Enum
- DocumentClassificationActionData Class
- DocumentValidationActionData Class
- UserData Class
- Document Class
- DocumentSplittingResult Class
- DomExtensions Class
- Page Class
- PageSection Class
- Polygon Class
- PolygonConverter Class
- Metadata Class
- WordGroup Class
- Word Class
- ProcessingSource Enum
- ResultsTableCell Class
- ResultsTableValue Class
- ResultsTableColumnInfo Class
- ResultsTable Class
- Rotation Enum
- SectionType Enum
- WordGroupType Enum
- IDocumentTextProjection Interface
- ClassificationResult Class
- ExtractionResult Class
- ResultsDocument Class
- ResultsDocumentBounds Class
- ResultsDataPoint Class
- ResultsValue Class
- ResultsContentReference Class
- ResultsValueTokens Class
- ResultsDerivedField Class
- ResultsDataSource Enum
- ResultConstants Class
- SimpleFieldValue Class
- TableFieldValue Class
- DocumentGroup Class
- DocumentTaxonomy Class
- DocumentType Class
- Field Class
- FieldType Enum
- LanguageInfo Class
- MetadataEntry Class
- TextType Enum
- TypeField Class
- ITrackingActivity Interface
- ITrainableActivity Interface
- ITrainableClassifierActivity Interface
- ITrainableExtractorActivity Interface
- TrainableClassifierAsyncCodeActivity Class
- TrainableClassifierCodeActivity Class
- TrainableClassifierNativeActivity Class
- TrainableExtractorAsyncCodeActivity Class
- TrainableExtractorCodeActivity Class
- TrainableExtractorNativeActivity Class
- Document Understanding Digitizer
- Document Understanding ML
- Document Understanding OCR Local Server
- Document Understanding
- Release notes
- About the Document Understanding activity package
- Project compatibility
- Set PDF Password
- Merge PDFs
- Get PDF Page Count
- Extract PDF Text
- Extract PDF Images
- Extract PDF Page Range
- Extract Document Data
- Create Validation Task and Wait
- Wait for Validation Task and Resume
- Create Validation Task
- Classify Document
- Create Classification Validation Task
- Create Classification Validation Task and Wait
- Wait for Classification Validation Task and Resume
- Intelligent OCR
- Release notes
- About the IntelligentOCR activity package
- Project compatibility
- Configuring Authentication
- Load Taxonomy
- Digitize Document
- Classify Document Scope
- Keyword Based Classifier
- Document Understanding Project Classifier
- Intelligent Keyword Classifier
- Create Document Classification Action
- Wait For Document Classification Action And Resume
- Train Classifiers Scope
- Keyword Based Classifier Trainer
- Intelligent Keyword Classifier Trainer
- Data Extraction Scope
- Document Understanding Project Extractor
- RegEx Based Extractor
- Form Extractor
- Intelligent Form Extractor
- Present Validation Station
- Create Document Validation Action
- Wait For Document Validation Action And Resume
- Train Extractors Scope
- Export Extraction Results
- Manual validation for digitize documents
- Anchor-based data extraction using Intelligent Form Extractor
- Validation station
- ML Services
- OCR
- OCR Contracts
- Release notes
- About the OCR Contracts
- Project compatibility
- IOCRActivity Interface
- OCRAsyncCodeActivity Class
- OCRCodeActivity Class
- OCRNativeActivity Class
- Character Class
- OCRResult Class
- Word Class
- FontStyles Enum
- OCRRotation Enum
- OCRCapabilities Class
- OCRScrapeBase Class
- OCRScrapeFactory Class
- ScrapeControlBase Class
- ScrapeEngineUsages Enum
- ScrapeEngineBase
- ScrapeEngineFactory Class
- ScrapeEngineProvider Class
- OmniPage
- PDF
- [Unlisted] Abbyy
- [Unlisted] Abbyy Embedded
Document Understanding Activities
Anchor-based data extraction using Intelligent Form Extractor
The example below explains how to extract data from a form that may also include handwritten text. The following use-case scenario explains how to extract data from a purchase order.
It presents activities such as Digitize Document, Data Extraction Scope, or Intelligent Form Extractor. You can find these activities in the UiPath.IntelligentOCR.Activities package.
The following packages need to be installed prior to creating the below workflow:
- UiPath.DocumentProcessing.Contracts.Activities
- UiPath.IntelligentOCR.Activities
- UiPath.OCR.Activities
- UiPath.OCR.Contracts
- UiPath.WebAPI.Activities
Steps:
- Open Studio and create a new Process.
- Add a
Sequence container in the Workflow Designer, name it Sequence1, and
create the variables shown in the following table:
Table 1. Variables to be created Variable Type
Default value
item
String
N/A classificationResult
ClassificationResult[]
N/A outputFileName
GenericValue
N/A - Add another
Sequence container in the Workflow Designer, after the first one, name it
Sequence2, and create the variables shown in the following table:
Table 2. Variables to be created Variable Type
Default value
text
String
N/A taxonomy
DocumentTaxonomy
N/A dom
Document
N/A documentPath
String
N/A classificationResult2
ClassificationResult[]
N/A outputFileName2
GenericValue
N/A - Add a Message
Box activity inside the sequence.
- In the Properties panel, select the Ok option from the Buttons dropdown. Add the following message in the Text field: "Select a PDF file".
- Select the check box for the TopMost option. This brings the message box to the foreground.
- Add a Select
File activity after the Message Box
activity.
- In the Properties panel, add the following
text in the Filter field:
Pdf files (*.pdf)|*.pdf
- Add the
documentPath
variable in the SelectedFile field.
- In the Properties panel, add the following
text in the Filter field:
- Add an
Assign activity after the Select File
activity.
- Add the
outputFileName2
variable in the To field. - Add the expression
".temp/" + Path.GetFileName(documentPath)
in the Value field.
- Add the
- Add a
Deserialize JSON activity after the
Assign activity.
- Add the expression
File.ReadAllText("DocumentProcessing axonomy.json")
in the JSON String field. - In the Properties panel, select the UiPath.DocumentProcessing.Contracts.Taxonomy.DocumentTaxonomy option from the TypeArgument dropdown list.
- Add the
taxonomy
variable in the JsonObject field.
- Add the expression
- Add a Digitize
Document activity after the Deserialize
JSON activity.
- In the Properties panel, add the value
1
in the DegreeOfParallelism field. - Add the
documentPath
variable in the DocumentPath field. - Add the
dom
variable in the DocumentObjectModel field. - Add the
text
variable in the DocumentText field. - Add the UiPath® Document OCR engine inside the activity.
- Add your API Key inside the ApiKey field.
- Add the
"https://du.uipath.com/ocr"
expression in the Endpoint field.
- In the Properties panel, add the value
- Add a Write
Text File activity after the Digitize
Document activity.
- Add the
JsonConvert.SerializeObject(dom)
expression in the Text field. - Add the
outputFileName2 + ".dom.json"
expression in the FileName field.
- Add the
- Add another
Write Text File activity after the Write
Text File activity.
- Add the
text
variable in the Text field. - Add the
outputFileName2 + ".text.txt"
expression in the FileName field.
- Add the
- Drag another
Sequence container in the Workflow Designer, name it Sequence3, and
create the variables shown in the following table:
Table 3. Variables to be created Variable Type
Default Value
extractionResult
ExtractionResult
N/A validatedResults
ExtractionResult
N/A doubleValidatedResults
ExtractionResult
N/A dataset
DataSet
N/A i
Int32
N/A - Add a Data
Extraction Scope activity inside the
Sequence3.
- In the Properties panel, add the
dom
variable in the DocumentObjectModel field. - Add the
documentPath
variable in the DocumentPath field. - Add the
text
variable in the DocumentText field. - Add the
"All.Benchmarks.Invoice"
expression in the DocumentTypeId field. - Add the
taxonomy
variable in the Taxonomy field. - Add the
extractionResult
variable in the ExtractionResults field.
- In the Properties panel, add the
- Add an
Intelligent Form Extractor activity inside
the Data Extraction Scope activity.
- Add your API Key in the ApiKey field.
- Add a Write
Text File activity after the Data Extraction
Scope activity.
- Add the
JsonConvert.SerializeObject(extractionResult)
expression in the Text field. - Add the
outputFileName2 + ".results.json"
expression in the FileName field.
- Add the
- Add a Present
Validation Station activity after the Write
Text File activity.
- Add the
extractionResult
variable in the AutomaticExtractionResults field. - Add the
dom
variable in the DocumentObjectModel field. - Add the
documentPath
variable in the DocumentPath field. - Add the
text
variable in the DocumentText field. - Add the
taxonomy
variable in the Taxonomy field. - Add the
validatedResults
variable in the ValidatedExtractionResults field.
- Add the
- Add a Write
Text File activity after the Present
Validation Station activity.
- Add the
JsonConvert.SerializeObject(validatedResults)
expression in the Text field. - Add the
outputFileName2 + ".savedinVS.results.json"
expression in the FileName field.
- Add the
- Add another
Write Text File activity after the Write
Text File activity.
- Add the
JsonConvert.SerializeObject(doubleValidatedResults)
expression in the Text field. - Add the
outputFileName2 + ".doubleSavedinVS.results.json"
expression in the FileName field.
- Add the
- Run the process. The automation process should open the Validation Station, extract the data, validate it, and store it in the Output folder.
ZIP
format: Example.
You have created your workflow, defined all variables, and customized all activities. Now it's time to define your taxonomy. Visit Load Taxonomy to learn about defining your own taxonomy.
Create your taxonomy to be able to extract information from an invoice. You should be focused on creating an Invoice document type, with the fields shown in the following table:
Field Type | |
---|---|
InvoiceNo |
|
Subtotal |
|
SalesTax |
|
Total |
|
It is now time to create the template for the extraction process. Visit Load Taxonomy to learn how to create a template.
- Document Type: Invoice.
- Template Name: Invoice-example.
- Template Document: Select the target file.
- OCR Engine: Microsoft OCR.
- Languages: en.
- Profile: Scan.
- Scale: 1.
Anchors are a very special and useful feature to use when you need to extract precise information from a document. By defining an extraction area with an anchor, you can expect a high accuracy in data extraction.
Once the taxonomy is defined and the template created, you can start configuring the template by using anchors, meaning that the extraction area is defined in a box, and anchors are used for defining the box position.
Check the following list for some pointers before starting adding anchors to your template:
- The anchor box should be as big as possible (height, width) to cover any type of invoice number, long, short, big font, etc.
- One extraction area can have as many anchors as needed, but only one defined as main (the first one).
- Use anchors formed of multiple side-by-side words.
- The main anchor should be as close as possible to the extraction area.
- The positions of the extraction area and the main anchor are fixed in the template, even when applied to different documents. The only thing that can vary is the distance between the main anchor and the secondary ones.
Let's continue configuring the template and see how you can extract data using an anchor.
- Set the extraction area:
- In the right area of the Validation Station, select Selection modes.
- Select Anchor.
- Start selecting the desired area.
Figure 3. Animated image showing how to set the extraction area
Note:The main anchor should contain two or three words for high accuracy and better results in the extraction process.
Select multiple words when tagging an anchor by pressing CTRL and selecting the desired words.
- Set the main anchor:
- While still in the Anchor selection mode, select the desired area as your main anchor.
- Select Extract value for the desired field.
Figure 4. Animated image example showing how to set the main anchor
- Set the secondary anchors:
- Ensure you're still in the Anchor selection mode, and with the main anchor selections activated.
- Select the new areas for the secondary anchors.
- Select Options for the desired field, and then select Change extracted
value.
Figure 5. Animated image example showing how to set secondary anchors
Repeat the process until you finished defining all extraction areas and adding all your anchors. Once finished, save the template.