activities
latest
false
UiPath logo, featuring letters U and I in white

Document Understanding Activities

Last updated Dec 5, 2024

Anchor-based data extraction using Intelligent Form Extractor

The example below explains how to extract data from a form that may also include handwritten text. The following use-case scenario explains how to extract data from a purchase order.

It presents activities such as Digitize Document, Data Extraction Scope, or Intelligent Form Extractor. You can find these activities in the UiPath.IntelligentOCR.Activities package.

Creating the workflow

The following packages need to be installed prior to creating the below workflow:

  • UiPath.DocumentProcessing.Contracts.Activities
  • UiPath.IntelligentOCR.Activities
  • UiPath.OCR.Activities
  • UiPath.OCR.Contracts
  • UiPath.WebAPI.Activities

Steps:

  1. Open Studio and create a new Process.
  2. Add a Sequence container in the Workflow Designer, name it Sequence1, and create the variables shown in the following table:
    Table 1. Variables to be created
     

    Variable Type

    Default value

    item

    String

    N/A

    classificationResult

    ClassificationResult[]

    N/A

    outputFileName

    GenericValue

    N/A
  3. Add another Sequence container in the Workflow Designer, after the first one, name it Sequence2, and create the variables shown in the following table:
    Table 2. Variables to be created
     

    Variable Type

    Default value

    text

    String

    N/A

    taxonomy

    DocumentTaxonomy

    N/A

    dom

    Document

    N/A

    documentPath

    String

    N/A

    classificationResult2

    ClassificationResult[]

    N/A

    outputFileName2

    GenericValue

    N/A
  4. Add a Message Box activity inside the sequence.
    • In the Properties panel, select the Ok option from the Buttons dropdown. Add the following message in the Text field: "Select a PDF file".
  5. Select the check box for the TopMost option. This brings the message box to the foreground.
  6. Add a Select File activity after the Message Box activity.
    • In the Properties panel, add the following text in the Filter field: Pdf files (*.pdf)|*.pdf
    • Add the documentPath variable in the SelectedFile field.
  7. Add an Assign activity after the Select File activity.
    • Add the outputFileName2 variable in the To field.
    • Add the expression ".temp/" + Path.GetFileName(documentPath) in the Value field.
  8. Add a Deserialize JSON activity after the Assign activity.
    • Add the expression File.ReadAllText("DocumentProcessing axonomy.json") in the JSON String field.
    • In the Properties panel, select the UiPath.DocumentProcessing.Contracts.Taxonomy.DocumentTaxonomy option from the TypeArgument dropdown list.
    • Add the taxonomy variable in the JsonObject field.
  9. Add a Digitize Document activity after the Deserialize JSON activity.
    • In the Properties panel, add the value 1 in the DegreeOfParallelism field.
    • Add the documentPath variable in the DocumentPath field.
    • Add the dom variable in the DocumentObjectModel field.
    • Add the text variable in the DocumentText field.
    • Add the UiPath® Document OCR engine inside the activity.
    • Add your API Key inside the ApiKey field.
    • Add the "https://du.uipath.com/ocr" expression in the Endpoint field.
  10. Add a Write Text File activity after the Digitize Document activity.
    • Add the JsonConvert.SerializeObject(dom) expression in the Text field.
    • Add the outputFileName2 + ".dom.json" expression in the FileName field.
  11. Add another Write Text File activity after the Write Text File activity.
    • Add the text variable in the Text field.
    • Add the outputFileName2 + ".text.txt" expression in the FileName field.
  12. Drag another Sequence container in the Workflow Designer, name it Sequence3, and create the variables shown in the following table:
    Table 3. Variables to be created
     

    Variable Type

    Default Value

    extractionResult

    ExtractionResult

    N/A

    validatedResults

    ExtractionResult

    N/A

    doubleValidatedResults

    ExtractionResult

    N/A

    dataset

    DataSet

    N/A

    i

    Int32

    N/A
  13. Add a Data Extraction Scope activity inside the Sequence3.
    • In the Properties panel, add the dom variable in the DocumentObjectModel field.
    • Add the documentPath variable in the DocumentPath field.
    • Add the text variable in the DocumentText field.
    • Add the "All.Benchmarks.Invoice" expression in the DocumentTypeId field.
    • Add the taxonomy variable in the Taxonomy field.
    • Add the extractionResult variable in the ExtractionResults field.
  14. Add an Intelligent Form Extractor activity inside the Data Extraction Scope activity.
    • Add your API Key in the ApiKey field.
  15. Add a Write Text File activity after the Data Extraction Scope activity.
    • Add the JsonConvert.SerializeObject(extractionResult) expression in the Text field.
    • Add the outputFileName2 + ".results.json" expression in the FileName field.
  16. Add a Present Validation Station activity after the Write Text File activity.
    • Add the extractionResult variable in the AutomaticExtractionResults field.
    • Add the dom variable in the DocumentObjectModel field.
    • Add the documentPath variable in the DocumentPath field.
    • Add the text variable in the DocumentText field.
    • Add the taxonomy variable in the Taxonomy field.
    • Add the validatedResults variable in the ValidatedExtractionResults field.
  17. Add a Write Text File activity after the Present Validation Station activity.
    • Add the JsonConvert.SerializeObject(validatedResults) expression in the Text field.
    • Add the outputFileName2 + ".savedinVS.results.json" expression in the FileName field.
  18. Add another Write Text File activity after the Write Text File activity.
    • Add the JsonConvert.SerializeObject(doubleValidatedResults) expression in the Text field.
    • Add the outputFileName2 + ".doubleSavedinVS.results.json" expression in the FileName field.
  19. Run the process. The automation process should open the Validation Station, extract the data, validate it, and store it in the Output folder.
Visit the following link to download the example in a ZIP format: Example.

Defining your taxonomy

You have created your workflow, defined all variables, and customized all activities. Now it's time to define your taxonomy. Visit Load Taxonomy to learn about defining your own taxonomy.

Create your taxonomy to be able to extract information from an invoice. You should be focused on creating an Invoice document type, with the fields shown in the following table:

Table 4. Invoice document type fields
 

Field Type

InvoiceNo

Text

Subtotal

Number

SalesTax

Number

Total

Number

Figure 1. Overview of the finished taxonomy with the previously mentioned fields

Creating your template

It is now time to create the template for the extraction process. Visit Load Taxonomy to learn how to create a template.

For this example, configure the template using the following values:
  • Document Type: Invoice.
  • Template Name: Invoice-example.
  • Template Document: Select the target file.
  • OCR Engine: Microsoft OCR.
  • Languages: en.
  • Profile: Scan.
  • Scale: 1.
Figure 2. Animated image example showing the configuration of the template

Setting anchors in the template

Anchors are a very special and useful feature to use when you need to extract precise information from a document. By defining an extraction area with an anchor, you can expect a high accuracy in data extraction.

Once the taxonomy is defined and the template created, you can start configuring the template by using anchors, meaning that the extraction area is defined in a box, and anchors are used for defining the box position.

Check the following list for some pointers before starting adding anchors to your template:

  • The anchor box should be as big as possible (height, width) to cover any type of invoice number, long, short, big font, etc.
  • One extraction area can have as many anchors as needed, but only one defined as main (the first one).
  • Use anchors formed of multiple side-by-side words.
  • The main anchor should be as close as possible to the extraction area.
  • The positions of the extraction area and the main anchor are fixed in the template, even when applied to different documents. The only thing that can vary is the distance between the main anchor and the secondary ones.

Let's continue configuring the template and see how you can extract data using an anchor.

  1. Set the extraction area:
    • In the right area of the Validation Station, select Selection modes.
    • Select Anchor.
    • Start selecting the desired area.
      Figure 3. Animated image showing how to set the extraction area

      Note:

      The main anchor should contain two or three words for high accuracy and better results in the extraction process.

      Select multiple words when tagging an anchor by pressing CTRL and selecting the desired words.

  2. Set the main anchor:
    1. While still in the Anchor selection mode, select the desired area as your main anchor.
    2. Select Extract value for the desired field.
      Figure 4. Animated image example showing how to set the main anchor

  3. Set the secondary anchors:
    1. Ensure you're still in the Anchor selection mode, and with the main anchor selections activated.
    2. Select the new areas for the secondary anchors.
    3. Select Options for the desired field, and then select Change extracted value.
      Figure 5. Animated image example showing how to set secondary anchors

Repeat the process until you finished defining all extraction areas and adding all your anchors. Once finished, save the template.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.