- Getting Started
- Framework Components
- Document Understanding in AI Center
- Pipelines
- ML Packages
- Data Manager
- OCR Services
- Licensing
- References
Document Understanding User Guide
Intelligent Keyword Classifier
The Intelligent Keyword Classifier is a classifier that uses the word vector it learns from files of certain document types to perform document classification.
The algorithm is built around the concept of repeating content for the same document type and starts from the premise that document types have a series of words that usually occur in those document types, thus allowing for a vector similarity computation.
When classifying a file into a document type, the Intelligent Keyword Classifier:
- finds the closest word vector a file is more similar to,
- reports on the highest scoring document type, with the underlying matching main words.
The Intelligent Keyword Classifier also has file splitting capabilities, meaning that it can report more than one class for a given file, for separate page ranges.
You should consider using this classifier if:
- your files contain one or more document types within a single file
- your document types are relatively easy to differentiate as far as content goes.
You need to use your Automation Cloud Document Understanding API Key, or host your own instance of the Intelligent Keyword Classifier in AI Center on-premises, to use this classifier.
You can configure the Intelligent Keyword Classifier at design-time, by simply accessing the Manage Learning wizard of the activity. The same wizard can be used for reviewing data collected during the document classification training phase, by opening the same wizard with an updated learning file path.
This wizard allows you to configure and manage the training data used by this activity for identifying the document type and classifying the documents. It was created to suit the need for editing a file path. If a Learning Data option with a variable is used instead, then you are asked if you either want to edit a specific file path or to abort this operation.
The below screenshot presents a document type that has been trained, one that hasn't, and one that has been trained and accessed to be viewed or deleted.
For document types that have not been trained yet, design-time training can be performed using the Start Training option. For document types that already have some training, you can either delete it to start over, by using this option, or perform extra training (cumulative to the already existing one) using the edit option.
Training Files Fed into Design Time Training Should Contain Single Document Types
Training files to be used must contain a single document type instance per file. Do not run design-time training on files that contain two or more document types, as your training data will be erroneous.
Once a new training has been initiated, a new screen is displayed asking for the training files and the OCR engine that should be used.
Each OCR engine comes with its own set of custom options. Here you can find more details about all options available for each OCR engine.
The following OCR engines do not support rotated documents and should not be used to process such documents:
- Microsoft OCR
- Tesseract OCR
Only training data from document types that have been trained is eligible for export. Document types that have not been trained cannot be selected.
You can export training data following these steps:
- Select document types that have been trained.
- Click the Export button.
-
If you have unsaved changes, the following message is displayed.
- Click Yes.
- Save the training data archive with the desired name.
- A message is displayed stating how many document
type training data sets were exported. For example:
- Click OK. The wizard closes.
You can import training data following these steps:
- Click the Import button.
- Select the training data archive and click Open.
- Select the document types you want.
- Click the Import button.
- The training data is imported.
The below table explains each message displayed when importing training data:
Import Type |
Message displayed |
---|---|
New Document Type and Word Vectors |
This document type will be added to the taxonomy |
New Word Vector (none was previously defined) |
N/A |
Identical Document Type and Word Vector |
The word vector for this document type will be overwritten |
Place the Intelligent Keyword Classifier Trainer activity in a Train Classifiers Scope, and configure it accordingly.
For more information, check Document Classification Training.