- Getting Started
- Framework Components
- Document Understanding in AI Center
- Pipelines
- ML Packages
- Data Manager
- OCR Services
- Licensing
- References
Document Understanding User Guide
Keyword Based Classifier
The Keyword Based Classifier is a simple classifier that searches for repeating string sequences within a given file, in order to perform document classification.
The algorithm is built around the concept of document titles and starts from the premise that document types with titles usually have a relatively low variation into how those titles look in documents.
When classifying a file into a document type, the Keyword Based Classifier:
- finds the best matching string or string collection, from its learning data, that applies to a taxonomy document type. Confidence is computed based on:
- how close is the match to the beginning of the document,
- how many times the match has been confirmed by knowledge workers and reinforced in the learning data.
- reports on the highest scoring document type, with the underlying matching configuration.
The Keyword Based Classifier can work with a single string entry (one string that is considered as one entry in the learning data the Classifier is using), or with an entry containing multiple strings (two or more strings that form a single entry). In the case of multiple strings, the Classifier applies the matching algorithm to each string individually and then computes a simple average of the confidences of the identified matches.
Let's take the example below:
- if an entry contains a single string, for instance, "this is my match", then the Keyword Based Classifier searches and rates this string as a potential document type match (according to which document type the string is attributed to).
- if an entry contains three strings, for instance, ["this is a match", "needs more evidence for filtering", "yet another one"], then the Keyword Based Classifier searches and rates each one of the three strings, and then computes a simple average of the matching confidences for reporting.
You should consider using this classifier if:
- your files contain one and only one document type each (so no file splitting is required);
- your files contain evidence related to the document type in the first three pages of the file.
You can configure the Keyword Based Classifier at design-time, by simply accessing the Manage Learning wizard of the activity. The same wizard can be used for reviewing data collected during the document classification training phase, by opening the same wizard with an updated learning file path.
This wizard allows you to configure and manage the keywords used by this activity for identifying the document type. It was created to suit the need for editing a file path. If a Learning Data parameter with a variable is used instead, then you are asked if you either want to edit a specific file path or to abort this operation.
The wizard has as many document type categories as you defined in your taxonomy. You can add single or multiple keywords for each document type. The activity learns the keywords of a specific document and later it's able to identify and classify the document in a specific type based on these rules.
""
(quotes), and you can add single or multiple values.
- Clicking the Add new keyword set button adds an extra field to that category.
- Clicking the button removes the field and its keywords.
-
Click the Save button for saving your wizard configuration. You can find all the added values in the
.json
file of the project.Note: Double quotes entered as part of a keyword in the Manage Keywords wizard are always escaped according to the Visual Basic convention (double double quotes), even in a C# flavored project.
Place the Keyword Based Classifier Trainer activity in a Train Classifiers Scope, and configure it accordingly.
For more information, check Document Classification Training.