Document Understanding User Guide

DELIVERY:

Last updated Feb 4, 2025

Keyword Based Classifier

What Is Keyword Based Classifier

The Keyword Based Classifier is a simple classifier that searches for repeating string sequences within a given file, in order to perform document classification.

The algorithm is built around the concept of document titles and starts from the premise that document types with titles usually have a relatively low variation into how those titles look in documents.

When classifying a file into a document type, the Keyword Based Classifier:

finds the best matching string or string collection, from its learning data, that applies to a taxonomy document type. Confidence is computed based on:
how close is the match to the beginning of the document,
how many times the match has been confirmed by knowledge workers and reinforced in the learning data.
reports on the highest scoring document type, with the underlying matching configuration.

The Keyword Based Classifier can work with a single string entry (one string that is considered as one entry in the learning data the Classifier is using), or with an entry containing multiple strings (two or more strings that form a single entry). In the case of multiple strings, the Classifier applies the matching algorithm to each string individually and then computes a simple average of the confidences of the identified matches.

Example

Let's take the example below:

if an entry contains a single string, for instance, "this is my match", then the Keyword Based Classifier searches and rates this string as a potential document type match (according to which document type the string is attributed to).
if an entry contains three strings, for instance, ["this is a match", "needs more evidence for filtering", "yet another one"], then the Keyword Based Classifier searches and rates each one of the three strings, and then computes a simple average of the matching confidences for reporting.

When to Use

You should consider using this classifier if:

your files contain one and only one document type each (so no file splitting is required);
your files contain evidence related to the document type in the first three pages of the file.

Special Requirements

No special requirements to use the Keyword Based Classifier.

How to Configure at Design-Time

You can configure the Keyword Based Classifier at design-time, by simply accessing the Manage Learning wizard of the activity. The same wizard can be used for reviewing data collected during the document classification training phase, by opening the same wizard with an updated learning file path.

This wizard allows you to configure and manage the keywords used by this activity for identifying the document type. It was created to suit the need for editing a file path. If a Learning Data parameter with a variable is used instead, then you are asked if you either want to edit a specific file path or to abort this operation.

Note: Manage Keyword Based Classifier Learning wizard can be used only for editing and configuring a file path.

Add a Keyword Based Classifier/Keyword Based Classifier Trainer activity to your workflow.
Configure your Keyword Based Classifier activity by adding the path of a .json file.
- If no path is provided and the Manage Learning option is clicked, then a popup is displayed, asking for a Learning File Path input. Once the path is provided, the wizard opens.
- A variable can be added instead of a .json file, but, because the wizard cannot apply the learning pattern to a LearningData variable, it asks for a specific file path that can be edited.
Click on the Manage Learning option.
- The Wizard window opens.
If no path is provided and the Manage Learning option is clicked, then a popup is displayed, asking for a Learning File Path. Once the path is provided, the wizard opens.

Note: Even if no .json file is available, you can add the name of a new .json file straight into the activity and the .json file is automatically created inside the specified folder.

The wizard has as many document type categories as you defined in your taxonomy. You can add single or multiple keywords for each document type. The activity learns the keywords of a specific document and later it's able to identify and classify the document in a specific type based on these rules.

All entries should be entered as strings, between "" (quotes), and you can add single or multiple values.

Clicking the Add new keyword set button adds an extra field to that category.
Clicking the button removes the field and its keywords.
Click the Save button for saving your wizard configuration. You can find all the added values in the .json file of the project.

Note: Double quotes entered as part of a keyword in the Manage Keywords wizard are always escaped according to the Visual Basic convention (double double quotes), even in a C# flavored project.