- Getting Started
- Framework Components
- ML Packages
- Pipelines
- Data Manager
- OCR Services
- Document Understanding deployed in Automation Suite
- Document Understanding deployed in AI Center standalone
- Deep Learning
- Licensing
- References
- UiPath.Abbyy.Activities
- UiPath.AbbyyEmbedded.Activities
- UiPath.DocumentUnderstanding.ML.Activities
- UiPath.DocumentUnderstanding.OCR.LocalServer.Activities
- UiPath.IntelligentOCR.Activities
- UiPath.OCR.Activities
- UiPath.OCR.Contracts
- UiPath.DocumentProcessing.Contracts
- UiPath.OmniPage.Activities
- UiPath.PDF.Activities
Document Understanding User Guide
Intelligent Keyword Classifier
The Intelligent Keyword Classifier is a classifier that uses the word vector it learns from files of certain document types to perform document classification.
The algorithm is built around the concept of repeating content for the same document type and starts from the premise that document types have a series of words that usually occur in those document types, thus allowing for a vector similarity computation.
When classifying a file into a document type, the Intelligent Keyword Classifier:
- finds the closest word vector a file is more similar to,
- reports on the highest scoring document type, with the underlying matching main words.
The Intelligent Keyword Classifier also has file splitting capabilities, meaning that it can report more than one class for a given file, for separate page ranges.
You should consider using this classifier if:
- your files contain one or more document types within a single file
- your document types are relatively easy to differentiate as far as content goes.
You need to use your Automation Cloud Document Understanding API Key, or host your own instance of the Intelligent Keyword Classifier in AI Center on-prem, to use this classifier.
You can configure the Intelligent Keyword Classifier at design-time, by simply accessing the Manage Learning wizard of the activity. The same wizard can be used for reviewing data collected during the document classification training phase, by opening the same wizard with an updated learning file path.
This wizard allows you to configure and manage the training data used by this activity for identifying the document type and classifying the documents. It was created to suit the need for editing a file path. If a Learning Data option with a variable is used instead, then you are asked if you either want to edit a specific file path or to abort this operation.
- Add an Intelligent Keyword Classifier/Intelligent Keyword Classifier Trainer activity to your workflow.
- Configure your Intelligent Keyword
Classifier activity by adding the path of a
.json
file.- If no path is provided and the Manage Learning option is clicked, then a popup is displayed, asking for a Learning File Path input. Once the path is provided, the wizard opens.
- A variable can be added instead of a
.json
file, but, because the wizard cannot apply the learning pattern to a LearningData variable, it asks for a specific file path that can be edited.
- Click on the Manage Learning option.
- The Wizard window opens.
- The Wizard window opens.
- If no path is provided and the Manage Learning
option is clicked, then a popup is displayed, asking for a Learning File Path. Once the path
is provided, the wizard opens.
Note: Even if no.json
file is available, you can add the name of a new.json
file straight into the activity and the.json
file is automatically created inside the specified folder.
The below screenshot presents a document type that has been trained, one that hasn't, and one that has been trained and accessed to be viewed or deleted.
For document types that have not been trained yet, design-time training can be performed using the Start Training option. For document types that already have some training, you can either delete it to start over, by using this option, or perform extra training (cumulative to the already existing one) using the edit option.
Once a new training has been initiated, a new screen is displayed asking for the training files and the OCR engine that should be used.
Each OCR engine comes with its own set of custom options. Here you can find more details about all options available for each OCR engine.
The following OCR engines do not support rotated documents and should not be used to process such documents:
- Microsoft OCR
- Tesseract OCR
Only training data from document types that have been trained is eligible for export. Document types that have not been trained cannot be selected.
You can export training data following these steps:
- Select document types that have been trained.
- Click the Export button.
- If you have unsaved changes, the following
message is displayed.
- Click Yes.
- Save the training data archive with the desired name.
- A message is displayed stating how many document
type training data sets were exported. For example:
- Click OK to return to the main screen of the wizard.
You can import training data following these steps:
- Click the Import button.
- Select the training data archive and click Open.
- Select the document types you want.
- Click the Import button.
- The training data is imported.
The below table explains each message displayed when importing training data:
Import Type |
Message displayed |
---|---|
New Document Type and Word Vectors |
This document type will be added to the taxonomy |
New Word Vector (none was previously defined) |
N/A |
Identical Document Type and Word Vector |
The word vector for this document type will be overwritten |
Place the Intelligent Keyword Classifier Trainer activity in a Train Classifiers Scope, and configure it accordingly.
We cannot enforce training file consistency across parallel trainings at the activity level. Two possible solutions for this issue are provided by Document Understanding Process. Both consist of traffic control:
- lock files (implemented by default in the process): rename the file using the
.lock
extension, modify and save the file, then rename the file again removing the.lock
extension - manual setup of a special queue: make an empty queue in Orchestrator and integrate your two activities from the project
For more information on how to train a Classifier, check Document Classification Training.