Document Understanding User Guide

DELIVERY:

Last updated Feb 4, 2025

Intelligent Keyword Classifier

What Is Intelligent Keyword Classifier

The Intelligent Keyword Classifier is a classifier that uses the word vector it learns from files of certain document types to perform document classification.

The algorithm is built around the concept of repeating content for the same document type and starts from the premise that document types have a series of words that usually occur in those document types, thus allowing for a vector similarity computation.

When classifying a file into a document type, the Intelligent Keyword Classifier:

finds the closest word vector a file is more similar to,
reports on the highest scoring document type, with the underlying matching main words.

The Intelligent Keyword Classifier also has file splitting capabilities, meaning that it can report more than one class for a given file, for separate page ranges.

When To Use

You should consider using this classifier if:

your files contain one or more document types within a single file
your document types are relatively easy to differentiate as far as content goes.

Special Requirements

You need to use your Automation Cloud Document Understanding API Key, or host your own instance of the Intelligent Keyword Classifier in AI Center on-premises, to use this classifier.

How To Configure at Design-Time

You can configure the Intelligent Keyword Classifier at design-time, by simply accessing the Manage Learning wizard of the activity. The same wizard can be used for reviewing data collected during the document classification training phase, by opening the same wizard with an updated learning file path.

This wizard allows you to configure and manage the training data used by this activity for identifying the document type and classifying the documents. It was created to suit the need for editing a file path. If a Learning Data option with a variable is used instead, then you are asked if you either want to edit a specific file path or to abort this operation.

Note: The Manage Learning wizard only works when the activity is configured with a Learning File Path string. It does not work with a Learning File Path set as a variable input, or with a LearningData string input.

Add an Intelligent Keyword Classifier/Intelligent Keyword Classifier Trainer activity to your workflow.
Configure your Intelligent Keyword Classifier activity by adding the path of a .json file.
- If no path is provided and the Manage Learning option is clicked, then a popup is displayed, asking for a Learning File Path input. Once the path is provided, the wizard opens.
- A variable can be added instead of a .json file, but, because the wizard cannot apply the learning pattern to a LearningData variable, it asks for a specific file path that can be edited.
Click on the Manage Learning option.
- The Wizard window opens.
If no path is provided and the Manage Learning option is clicked, then a popup is displayed, asking for a Learning File Path. Once the path is provided, the wizard opens.

Note: Even if no .json file is available, you can add the name of a new .json file straight into the activity and the .json file is automatically created inside the specified folder.

The below screenshot presents a document type that has been trained, one that hasn't, and one that has been trained and accessed to be viewed or deleted.

For document types that have not been trained yet, design-time training can be performed using the Start Training option. For document types that already have some training, you can either delete it to start over, by using this option, or perform extra training (cumulative to the already existing one) using the edit option.

Note:

Training Files Fed into Design Time Training Should Contain Single Document Types

Training files to be used must contain a single document type instance per file. Do not run design-time training on files that contain two or more document types, as your training data will be erroneous.

Once a new training has been initiated, a new screen is displayed asking for the training files and the OCR engine that should be used.

Each OCR engine comes with its own set of custom options. Here you can find more details about all options available for each OCR engine.

Note:

The following OCR engines do not support rotated documents and should not be used to process such documents:

Microsoft OCR
Tesseract OCR

Only training data from document types that have been trained is eligible for export. Document types that have not been trained cannot be selected.

Exporting Training Data

You can export training data following these steps:

Select document types that have been trained.
Click the Export button.
If you have unsaved changes, the following message is displayed.
Click Yes.
Save the training data archive with the desired name.
A message is displayed stating how many document type training data sets were exported. For example:
Click OK. The wizard closes.

Importing Training Data

You can import training data following these steps:

Click the Import button.
Select the training data archive and click Open.
Select the document types you want.
Click the Import button.
The training data is imported.

The below table explains each message displayed when importing training data:

Import Type	Message displayed
New Document Type and Word Vectors	This document type will be added to the taxonomy
New Word Vector (none was previously defined)	N/A
Identical Document Type and Word Vector	The word vector for this document type will be overwritten