Activities
latest
false
Banner background image
Document Understanding Activities
Last updated Apr 10, 2024

Anchorbased Data Extraction Using Intelligent Form Extractor

The example below explains how to extract data from a form that may also include handwritten text. The following use-case scenario explains how to extract data from a purchase order.

It presents activities such as Digitize Document, Data Extraction Scope, or Intelligent Form Extractor. You can find these activities in the UiPath.IntelligentOCR.Activities package.

Creating the workflow

The following packages need to be installed prior to creating the below workflow:

  • UiPath.DocumentProcessing.Contracts.Activities
  • UiPath.IntelligentOCR.Activities
  • UiPath.OCR.Activities
  • UiPath.OCR.Contracts
  • UiPath.WebAPI.Activities

This is how the automation process can be built:</summary>

This is how the automation process can be built:

  1. Open Studio and create a new Process.
  2. Drag a Sequence container in the Workflow Designer, name it Sequence1, and create the following variables:

    Variable Name

    Variable Type

    Default value

    item

    String

     

    classificationResult

    ClassificationResult[]

     

    outputFileName

    GenericValue

     
  3. Drag another Sequence container in the Workflow Designer, below the first one, name it Sequence2, and create the following variables:

    Variable Name

    Variable Type

    Default value

    text

    String

     

    taxonomy

    DocumentTaxonomy

     

    dom

    Document

     

    documentPath

    String

     

    classificationResult2

    ClassificationResult[]

     

    outputFileName2

    GenericValue

     
  4. Add a Message Box activity inside the sequence.

    • In the Properties panel, select the Ok option from the Buttons dropdown. Add the following message in the Text field: "Select a PDF file".

  5. Select the checkbox for the TopMost option. This brings the message box to the foreground.
  6. Add a Select File activity below the Message Box activity.

    • In the Properties panel, add the following text in the Filter field: Pdf files (*.pdf)|*.pdf
    • Add the documentPath variable in the SelectedFile field.
  7. Add an Assign activity below the Select File activity.

    • Add the outputFileName2 variable in the To field.
    • Add the expression ".temp/" + Path.GetFileName(documentPath) in the Value field.
  8. Add a Deserialize JSON activity below the Assign activity.

    • Add the expression File.ReadAllText("DocumentProcessing axonomy.json") in the JSON String field.
    • In the Properties panel, select the UiPath.DocumentProcessing.Contracts.Taxonomy.DocumentTaxonomy option from the TypeArgument dropdown list.
    • Add the taxonomy variable in the JsonObject field.
  9. Add a Digitize Document activity below the Deserialize JSON activity.

    • In the Properties panel, add the value 1 in the DegreeOfParallelism field.
    • Add the documentPath variable in the DocumentPath field.
    • Add the dom variable in the DocumentObjectModel field.
    • Add the text variable in the DocumentText field.
    • Add the UiPath Document OCR engine inside the activity.
    • Add your API Key inside the ApiKey field.
    • Add the "https://du.uipath.com/ocr" expression in the Endpoint field.
  10. Add a Write Text File activity below the Digitize Document activity.

    • Add the JsonConvert.SerializeObject(dom) expression in the Text field.
    • Add the outputFileName2 + ".dom.json" expression in the FileName field.
  11. Add another Write Text File activity below the Write Text File activity.

    • Add the text variable in the Text field.
    • Add the outputFileName2 + ".text.txt" expression in the FileName field.
  12. Drag another Sequence container in the Workflow Designer, name it Sequence3, and create the following variables:

    Variable Name

    Variable Type

    Default Value

    extractionResult

    ExtractionResult

     

    validatedResults

    ExtractionResult

     

    doubleValidatedResults

    ExtractionResult

     

    dataset

    DataSet

     

    i

    Int32

     
  13. Add a Data Extraction Scope activity inside the Sequence3.

    • In the Properties panel, add the dom variable in the DocumentObjectModel field.
    • Add the documentPath variable in the DocumentPath field.
    • Add the text variable in the DocumentText field.
    • Add the "All.Benchmarks.Invoice" expression in the DocumentTypeId field.
    • Add the taxonomy variable in the Taxonomy field.
    • Add the extractionResult variable in the ExtractionResults field.
  14. Add an Intelligent Form Extractor activity inside the Data Extraction Scope activity.

    • Add your API Key in the ApiKey field.
  15. Add a Write Text File activity below the Data Extraction Scope activity.

    • Add the JsonConvert.SerializeObject(extractionResult) expression in the Text field.
    • Add the outputFileName2 + ".results.json" expression in the FileName field.
  16. Add a Present Validation Station activity below the Write Text File activity.

    • Add the extractionResult variable in the AutomaticExtractionResults field.
    • Add the dom variable in the DocumentObjectModel field.
    • Add the documentPath variable in the DocumentPath field.
    • Add the text variable in the DocumentText field.
    • Add the taxonomy variable in the Taxonomy field.
    • Add the validatedResults variable in the ValidatedExtractionResults field.
  17. Add a Write Text File activity below the Present Validation Station activity.

    • Add the JsonConvert.SerializeObject(validatedResults) expression in the Text field.
    • Add the outputFileName2 + ".savedinVS.results.json" expression in the FileName field.
  18. Add another Write Text File activity below the Write Text File activity.

    • Add the JsonConvert.SerializeObject(doubleValidatedResults) expression in the Text field.
    • Add the outputFileName2 + ".doubleSavedinVS.results.json" expression in the FileName field.
  19. Run the process. The automation process should open the Validation Station, extract the data, validate it, and store it in the Output folder.

Download example fromhere.

Defining your taxonomy

You have created your workflow, defined all variables, and customized all activities. Now it's time to define your taxonomy. To do so, please follow the steps described here.

Create your taxonomy to be able to extract information from an invoice. You should be focused on creating an Invoice document type, with the following fields:

Field Name

Field Type

InvoiceNo

Text

Subtotal

Number

SalesTax

Number

Total

Number

Here is how your taxonomy should look:



Creating your template

It is now time to create the template for the extraction process. Create it by following the instructions from here.

The following gif explains how to create the template:



Setting anchors in the template

Anchors are a very special and useful feature to use when you need to extract precise information from a document. By defining an extraction area with an anchor, you can expect a high accuracy in data extraction.

The following gifs explain how to use the anchors on the invoice document used for the above example. More details about anchors can be found here.

Once the taxonomy is defined and the template created, you can start configuring the template by using anchors, meaning that the extraction area is defined in a box, and anchors are used for defining the box position.

Here are some pointers before starting adding anchors to your template:

  • The anchor box should be as big as possible (height, width) to cover any type of invoice number, long, short, big font, etc.
  • One extraction area can have as many anchors as needed, but only one defined as main (the first one).
  • Use anchors formed of multiple side-by-side words.
  • The main anchor should be as close as possible to the extraction area.
  • The positions of the extraction area and the main anchor are fixed in the template, even when applied to different documents. The only thing that can vary is the distance between the main anchor and the secondary ones.

Let's continue configuring the template and see how you can extract data using an anchor. The gifs below explain how to mark the extraction area and to add the main and secondary anchors.

  • Set the extraction area



    Note:

    The main anchor should contain two or three words for high accuracy and better results in the extraction process.

    Select multiple words when tagging an anchor by pressing CTRL and clicking on the desired words.

  • Set the main anchor



  • Set the secondary anchors



Repeat the process until you finished defining all extraction areas and adding all your anchors. Once finished, save the template.



Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.