# Taxonomy overview

> The Taxonomy is the metadata that the **Document Understanding<sup>TM</sup>** framework considers in each of its steps.

## What is a taxonomy

The Taxonomy is the metadata that the **Document Understanding<sup>TM</sup>** framework considers in each of its steps.

* A **Taxonomy** is a collection of Document Types.
* A **Document Type** is the definition of a logical type of document, that must be handled by different business processes. Examples of Document Types are Invoices, Medical Records, IRS Forms W-2, Contracts, etc. A document type, besides a name, group, and category (for easier handling), usually contains a collection of Fields.
* A **Field** is one piece of information that is expected to be found and captured from a specific Document Type.

A Taxonomy is a hierarchical structure that contains the schema of the information the Document Understanding framework will use throughout. Each entity definition (for document types or fields) found in the Taxonomy has a unique ID.

### How does it help in document classification?

If you want to classify incoming files into different document types, then the taxonomy should contain the document types you want to specifically treat. These will allow you to configure your document understanding processes based on a uniform data schema: the structure of your taxonomy.

### How does it help in data extraction?

If you want to extract data from certain document types, then the taxonomy will contain the list of fields that you are targeting for automatic data extraction. These will allow the configuration of various extraction methods and rules, again, based on a single source of truth data schema: the structure of your document type.

### Field types and details

A Field may have derived parts: formatted information extracted or edited from the underlying textual value found in a document.

 <colgroup>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
 </colgroup>
 
  
    Field Type 
    Allows Multi-Value 
    Purpose 
    Derived Parts for Formatting 
    Additional Information 
  
 
 
  
     Text  
     Yes  
     Textual information  
     N/A  
     N/A  
  
  
     Number  
     Yes  
     Numeric values  
   
      
         Value (up to eight decimals) 
      

     N/A  
  
  
     Date  
     Yes  
     Dates  
   
      
         Day 
         Month 
         Year 
      

    Date fields allow for the definition of an Expected Format , which must be an MSDN-compliant date format string (for example, <code>dd-MM-yyyy</code> or <code>MM, dd, yyyy</code> ).  This format is used by the Data Extraction Scope activity when trying to parse a date into its constituent day, month, and year parts.  
  
  
     Name  
     Yes  
     Person names  
   
      
         Given Name 
         Middle Name 
         Last Name 
      

     N/A  
  
  
     Address  
     Yes  
     Addresses  
   
      
         Address Line 1 
         Address Line 2 
         Address Line 3 
         City 
         State / County / Province 
         Country 
         Zip Postal Code 
      

     N/A  
  
  
     Set  
     Yes  
     Define a list of possible values from a predefined set  
     N/A  
     A Set field must define the allowed options as values. These are reflected in the Validation Station.  
  
  
     Boolean  
     Yes  
     Yes/No values  
     N/A  
     A Boolean field can only have Yes or No as possible values, and is reflected in the Validation Station.  
  
  
     Table  
     No  
     Tabular data  
     N/A  
     A Table field contains the definition of the columns.  
  
  
     Table Column  
     No  
     Each cell in the table.  
     N/A  
     Table Columns in a Table field are defined as one of the regular fields in the Components list.  They cannot be of Table type.  
  
 

### Other information captured in the taxonomy

The Taxonomy also contains the list of groups and categories, as well as a collection of supported languages that can be associated with the processed documents. For example, to process documents in Japanese and English, then the Supported Languages tag must contain their respective display name and language code. An **Undetermined Language** (code `und`) is recommended to be added, to support exceptional cases.

## Taxonomy extension methods

### Serialize()

Called on a `DocumentTaxonomy` object, the `Serialize()` method returns a `JSON` representation of the object, so that it can be stored and retrieved for later usage.

### Deserialize(String)

The `DocumentTaxonomy.Deserialize(jsonString)` static extension returns a `DocumentTaxonomy` object, hydrated with the JSON encoded data passed as a parameter.

### GetFields(String)

Called on a `DocumentTaxonomy` object, the `GetFields()` method called with a `DocumentTypeId` string returns a list of fields defined within that document type.

## How to create and edit your project's taxonomy

Once the **UiPath.IntelligentOCR.Activities** package is installed in your project in UiPath® Studio, a **Taxonomy Manager** button appears in the main ribbon of Studio's Design tab. Use the **Taxonomy Manager** wizard to edit your project taxonomy.

The Taxonomy is stored in a file within your UiPath Studio project, in the **DocumentProcessing** folder, and in the `taxonomy.json` file.

The file is automatically created when you first open the **Taxonomy Manager** wizard. You can check the exact location of the file in the **Taxonomy Manager**, by hovering over the ![](https://dev-assets.cms.uipath.com/assets/images/document-understanding/document-understanding-image-info_button_1-f110c81c-34da4e16.png) button. Alternatively, each time you open the **Taxonomy Manager**, a pop-up message will appear in the upper right corner, informing you of the location of the file. When a project is published from Studio, the taxonomy will be published as well as an artifact of the project.

The `taxonomy.json` file is unique to each project, but it can be reused if you manually copy it over to a new project. To do so, you must simply create a new project, then go to the project folder and copy the file with the taxonomy of your choice in the right location (in the **DocumentProcessing** folder).

:::important
For data integrity purposes, we recommend you always edit the taxonomy using Taxonomy Manager.
:::

## How to use your taxonomy within your project

The taxonomy for document understanding is required as an Object throughout the **Document Understanding** framework.

The simplest and most convenient way to load your object is by using the [Load Taxonomy](https://docs.uipath.com/activities/other/latest/document-understanding/load-taxonomy) activity. Once your taxonomy object is loaded, you can use it in all subsequent framework components requiring it.

## Advanced use cases

* If you choose to store your taxonomy in a different location, you can still load it in your project (once you obtain the string content of the taxonomy file, let's say in a `myTaxonomyContentString` variable), by using a simple **Assign** activity, as follows:

  `myTaxonomy = DocumentTaxonomy.Deserialize(myTaxonomyContentString)`
* If your use case demands it, remember the Taxonomy is a POCO (plain old class object) that, when needed, can be edited *even at run-time*.
