The Keyword Based Classifier is a simple classifier that searches for repeating string sequences within a given file, in order to perform document classification.
The algorithm is built around the concept of document titles and starts from the premise that document types with titles usually have a relatively low variation into how those titles look in documents.
When classifying a file into a document type, the Keyword Based Classifier:
- finds the best matching string or string collection, from its learning data, that applies to a taxonomy document type. Confidence is computed based on:
- how close is the match to the beginning of the document,
- how many times the match has been confirmed by knowledge workers and reinforced in the learning data.
- reports on the highest scoring document type, with the underlying matching configuration.
The Keyword Based Classifier can work with a single string entry (one string that is considered as one entry in the learning data the Classifier is using), or with an entry containing multiple strings (two or more strings that form a single entry). In case of multiple string, the Classifier applies the matching algorithm on each string individually and then computes a simple average of the confidences of the identified matches.
Let's take the example below:
- if an entry contains a single string, for instance, "this is my match", then the Keyword Based Classifier searches and rates this string as a potential document type match (according to which document type the string is attributed to).
- if an entry contains three strings, for instance, ["this is a match", "needs more evidence for filtering", "yet another one"], then the Keyword Based Classifier searches and rates each one of the three strings, and then computes a simple average of the matching confidences for reporting.
The keyword set can be defined within a line or by using multiple lines. When set within a line it identifies the given input, for example, if x, y, and z are listed as keywords, then the search is looking for x and y and z.
Having multiple lines defined means that the search is looking for the keywords listed in the first line, or the second line, or the third until it covers all lines and identifies the best matches, thus increasing the confidence score by simply having identified more matches from more available keywords.
You should consider using this classifier if:
- your files contain one and only one document type each (so no file splitting is required);
- your files contain evidence related to the document type in the first three pages of the file.
No special requirements in order to use the Keyword Based Classifier.
You can configure the Keyword Based Classifier at design-time, by simply accessing the Manage Learning wizard of the activity. The same wizard can be used for reviewing data collected during the document classification training phase, by opening the same wizard with an updated learning file path.
This wizard allows you to configure and manage the keywords used by this activity for identifying the document type. It was created to suit the need of editing a file path. If a Learning Data parameter with a variable is used instead, then you are asked if you either want to edit a specific file path or to abort this operation.
Manage Keyword Based Classifier Learning wizard can be used only for editing and configuring a file path.
- Add a Keyword Based Classifier/Keyword Based Classifier Trainer activity to your workflow.
- Configure your Keyword Based Classifier activity by adding the path of a
- If no path is provided and the Manage Learning option is clicked, then a popup is displayed, asking for a Learning File Path input. Once the path is provided, the wizard opens.
- A variable can be added instead of a
.jsonfile, but, because the wizard cannot apply the learning pattern to a LearningData variable, it asks for a specific file path that can be edited.
- Click on the Manage Learning option.
- The Wizard window opens.
- If no path is provided and the Manage Learning option is clicked, then a popup is displayed, asking for a Learning File Path. Once the path is provided, the wizard opens.
Even if no
.jsonfile is available, you can add the name of a new
.jsonfile straight into the activity and the
.jsonfile is automatically created inside the specified folder.
The wizard has as many document type categories as you defined in your taxonomy. You can add single or multiple keywords for each document type. The activity learns the keywords of a specific document and later it's able to identify and classify the document in a specific type based on these rules.
All entries should be entered as strings, between
"" (quotes), and you can add single or multiple values.
- Clicking the Add new keyword set button adds an extra field to that category.
- Clicking the button removes the field and its keywords.
- Click the Save button for saving your wizard configuration. You can find all the added values in the
.jsonfile of the project.
Double quotes entered as part of a keyword in the Manage Keywords wizard are always escaped according to the Visual Basic convention (double double quotes), even in a C# flavored project.
For more information, check Document Classification Training.
Updated about a month ago