Document Understanding
2022.4
false
Banner background image
Document Understanding User Guide
Last updated Mar 13, 2024

Taxonomy Overview

What Is a Taxonomy

The Taxonomy is the metadata that the Document Understanding framework considers in each of its steps.

  • A Taxonomy is a collection of Document Types.

    • A Document Type is the definition of a logical type of document, that must be handled by different business processes. Examples of Document Types are Invoices, Medical Records, IRS Forms W-2, Contracts, etc. A document type, besides a name, group, and category (for easier handling), usually contains a collection of Fields.

      • A Field is one piece of information that is expected to be found and captured from a specific Document Type.

As seen above, a Taxonomy is a hierarchical structure that contains the schema of the information the Document Understanding framework will use throughout. Each entity definition (for document types or fields) found in the Taxonomy has a unique ID.

How Does It Help in Document Classification?

If you want to classify incoming files into different document types, then the taxonomy should contain the document types you want to specifically treat. These will allow you to configure your document understanding processes based on a uniform data schema: the structure of your taxonomy.

How Does It Help in Data Extraction?

If you want to extract data from certain document types, then the taxonomy will contain the list of fields that you are targeting for automatic data extraction. These will allow the configuration of various extraction methods and rules, again, based on a single source of truth data schema: the structure of your document type.

Field Types and Details

A Field may have derived parts: formatted information extracted or edited from the underlying textual value found in a document.

Field Type

Allows Multi-Value

Purpose

Derived Parts for Formatting

Additional Information

Text

Yes

Textual information

N/A

N/A

Number

Yes

Numeric values

Value

N/A

Date

Yes

Dates

  • Day
  • Month
  • Year
Date fields allow for the definition of an Expected Format, which must be an MSDN-compliant date format string (for example, dd-MM-yyyy or MM, dd, yyyy).

This format is used by the Data Extraction Scope activity when trying to parse a date into its constituent day, month, and year parts.

Name

Yes

Person names

  • Given Name
  • Middle Name
  • Last Name

N/A

Address

Yes

Addresses

  • Address Line 1
  • Address Line 2
  • Address Line 3
  • City
  • State / County / Province
  • Country
  • Zip Postal Code

N/A

Set

Yes

Define a list of possible values from a predefined set

N/A

A Set field must define the allowed options as values. These are reflected in the Validation Station.

Boolean

Yes

Yes/No values

N/A

A Boolean field can only have Yes or No as possible values, and is reflected in the Validation Station.

Table

No

Tabular data

N/A

A Table field contains the definition of the columns.

Table Column

No

Each cell in the table.

N/A

Table Columns in a Table field are defined as one of the regular fields in the Components list.

They cannot be of Table type.

Other Information Captured in the Taxonomy

The Taxonomy also contains the list of groups and categories, as well as a collection of supported languages that can be associated with the processed documents. For example, to process documents in Japanese and English, then the Supported Languages tag must contain their respective display name and language code. An Undetermined Language (code und) is recommended to be added, to support exceptional cases.

Taxonomy Extension Methods

Serialize()

Called on a DocumentTaxonomy object, the Serialize() method returns a JSON representation of the object, so that it can be stored and retrieved for later usage.

Deserialize(String)

The DocumentTaxonomy.Deserialize(jsonString) static extension returns a DocumentTaxonomy object, hydrated with the JSON encoded data passed as a parameter.

GetFields(String)

Called on a DocumentTaxonomy object, the GetFields() method called with a DocumentTypeId string returns a list of fields defined within that document type.

How to Create and Edit Your Project's Taxonomy

Once the UiPath.IntelligentOCR.Activities package is installed in your project in UiPath Studio, a Taxonomy Manager button appears in the main ribbon of Studio's Design tab. Use the Taxonomy Manager wizard to edit your project taxonomy.

The Taxonomy is stored in a file within your UiPath Studio project, in the DocumentProcessing folder, and in the taxonomy.json file.

The file is automatically created when you first open the Taxonomy Manager wizard. You can see the exact location of the file in the Taxonomy Manager, by hovering over the button. Alternatively, each time you open the Taxonomy Manager, a pop-up message will appear in the upper right corner, informing you of the location of the file. When a project is published from Studio, the taxonomy will be published as well as an artifact of the project.

The taxonomy.json file is unique to each project, but it can be reused if you manually copy it over to a new project. To do so, you must simply create a new project, then go to the project folder and copy the file with the taxonomy of your choice in the right location (in the DocumentProcessing folder).
Important: For data integrity purposes, we recommend you always edit the taxonomy using Taxonomy Manager.

How to Use Your Taxonomy Within Your Project

The taxonomy for document understanding is required as an Object throughout the Document Understanding framework.

The simplest and most convenient way to load your object is by using the Load Taxonomy activity. Once your taxonomy object is loaded, you can use it in all subsequent framework components requiring it.

Advanced Use Cases

  • If you choose to store your taxonomy in a different location, you can still load it in your project (once you obtain the string content of the taxonomy file, let's say in a myTaxonomyContentString variable), by using a simple Assign activity, as follows:

    myTaxonomy = DocumentTaxonomy.Deserialize(myTaxonomyContentString)

  • If your use case demands it, remember the Taxonomy is a POCO (plain old class object) that, when needed, can be edited even at run-time.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.