document-understanding

2023.10

false

Document Understanding User Guide

DELIVERY:

Last updated Apr 15, 2025

Taxonomy overview

What is a taxonomy

The Taxonomy is the metadata that the Document Understanding^TM framework considers in each of its steps.

A Taxonomy is a collection of Document Types.
A Document Type is the definition of a logical type of document, that must be handled by different business processes. Examples of Document Types are Invoices, Medical Records, IRS Forms W-2, Contracts, etc. A document type, besides a name, group, and category (for easier handling), usually contains a collection of Fields.
A Field is one piece of information that is expected to be found and captured from a specific Document Type.

As seen above, a Taxonomy is a hierarchical structure that contains the schema of the information the Document Understanding framework will use throughout. Each entity definition (for document types or fields) found in the Taxonomy has a unique ID.

How does it help in document classification?

If you want to classify incoming files into different document types, then the taxonomy should contain the document types you want to specifically treat. These will allow you to configure your document understanding processes based on a uniform data schema: the structure of your taxonomy.

How does it help in data extraction?

If you want to extract data from certain document types, then the taxonomy will contain the list of fields that you are targeting for automatic data extraction. These will allow the configuration of various extraction methods and rules, again, based on a single source of truth data schema: the structure of your document type.

Field types and details

A Field may have derived parts: formatted information extracted or edited from the underlying textual value found in a document.

Field Type	Allows Multi-Value	Purpose	Derived Parts for Formatting	Additional Information
Text	Yes	Textual information	N/A	N/A
Number	Yes	Numeric values	Value (up to eight decimals)	N/A
Date	Yes	Dates	Day Month Year	Date fields allow for the definition of an Expected Format, which must be an MSDN-compliant date format string (for example, `dd-MM-yyyy` or `MM, dd, yyyy`). This format is used by the Data Extraction Scope activity when trying to parse a date into its constituent day, month, and year parts.
Name	Yes	Person names	Given Name Middle Name Last Name	N/A
Address	Yes	Addresses	Address Line 1 Address Line 2 Address Line 3 City State / County / Province Country Zip Postal Code	N/A
Set	Yes	Define a list of possible values from a predefined set	N/A	A Set field must define the allowed options as values. These are reflected in the Validation Station.
Boolean	Yes	Yes/No values	N/A	A Boolean field can only have Yes or No as possible values, and is reflected in the Validation Station.
Table	No	Tabular data	N/A	A Table field contains the definition of the columns.
Table Column	No	Each cell in the table.	N/A	Table Columns in a Table field are defined as one of the regular fields in the Components list. They cannot be of Table type.

Other information captured in the taxonomy

The Taxonomy also contains the list of groups and categories, as well as a collection of supported languages that can be associated with the processed documents. For example, to process documents in Japanese and English, then the Supported Languages tag must contain their respective display name and language code. An Undetermined Language (code und) is recommended to be added, to support exceptional cases.

Taxonomy extension methods

Serialize()

Called on a DocumentTaxonomy object, the Serialize() method returns a JSON representation of the object, so that it can be stored and retrieved for later usage.

Deserialize(String)

The DocumentTaxonomy.Deserialize(jsonString) static extension returns a DocumentTaxonomy object, hydrated with the JSON encoded data passed as a parameter.

GetFields(String)

Called on a DocumentTaxonomy object, the GetFields() method called with a DocumentTypeId string returns a list of fields defined within that document type.

How to create and edit your project's taxonomy

Once the UiPath.IntelligentOCR.Activities package is installed in your project in UiPath® Studio, a Taxonomy Manager button appears in the main ribbon of Studio's Design tab. Use the Taxonomy Manager wizard to edit your project taxonomy.

The Taxonomy is stored in a file within your UiPath Studio project, in the DocumentProcessing folder, and in the taxonomy.json file.

The file is automatically created when you first open the Taxonomy Manager wizard. You can see the exact location of the file in the Taxonomy Manager, by hovering over the button. Alternatively, each time you open the Taxonomy Manager, a pop-up message will appear in the upper right corner, informing you of the location of the file. When a project is published from Studio, the taxonomy will be published as well as an artifact of the project.

The taxonomy.json file is unique to each project, but it can be reused if you manually copy it over to a new project. To do so, you must simply create a new project, then go to the project folder and copy the file with the taxonomy of your choice in the right location (in the DocumentProcessing folder).

Important: For data integrity purposes, we recommend you always edit the taxonomy using Taxonomy Manager.

How to use your taxonomy within your project

The taxonomy for document understanding is required as an Object throughout the Document Understanding framework.

The simplest and most convenient way to load your object is by using the Load Taxonomy activity. Once your taxonomy object is loaded, you can use it in all subsequent framework components requiring it.

Advanced use cases

If you choose to store your taxonomy in a different location, you can still load it in your project (once you obtain the string content of the taxonomy file, let's say in a myTaxonomyContentString variable), by using a simple Assign activity, as follows:

myTaxonomy = DocumentTaxonomy.Deserialize(myTaxonomyContentString)
If your use case demands it, remember the Taxonomy is a POCO (plain old class object) that, when needed, can be edited even at run-time.

On this page

What is a taxonomy
How does it help in document classification?
How does it help in data extraction?
Field types and details
Other information captured in the taxonomy
Taxonomy extension methods
Serialize()
Deserialize(String)
GetFields(String)
How to create and edit your project's taxonomy
How to use your taxonomy within your project
Advanced use cases

Was this page helpful?

PREVIOUSTaxonomy Manager

NEXTTaxonomy related activities

Support and Services

Get The Help You Need

UiPath Academy

Learning RPA - Automation Courses

UiPath Forum

UiPath Community Forum

Trust and Security

Cookies Policy