For the needed volumes of documents, see Pipelines.
For more details about how to assemble a high-quality dataset, see Training High Performing Models.
There are many situations where a field appears in multiple places in the same document or even on the same page. These should all be labelled, as long as they have the same meaning.
For instance, the total amount for utility bills. It often appears at the top, within a line item list in the middle, or in a payslip at the bottom, which can be detached and sent in the mail with the check. In this situation, all three occurrences would be labelled. This is useful because in some cases, if there is an OCR error or the layout is different and one field cannot be identified, the model can still identify the other occurrences.
What counts is the meaning of the value, not the value itself. For instance, on some invoices which carry no tax, the net amount and the total amount have the same value. But they are clearly different concepts. Consequently, they should not be labelled both as total amount but only the one whose meaning is to represent the total amount.
You can have multiple users use the same instance to label at the same time, even on the same document.
If there are concurrent changes on the schema for one user, the change goes through and for the other(s), a warning message is displayed stating that the changes could not be performed. The other user(s) should immediately refresh their browser to see the changes.
When you import a dataset without checking the Make this an Evaluation set checkbox on the Import Data dialog box, then that dataset is used for training and you only need to focus on the labeling of the words (grey boxes) on the document.
If once in a while the text that gets filled in the sidebar fields is not correct, this is not a problem, as the ML model still learns. In some cases, you may need to adjust the configuration of the fields: for instance by checking the Multi-line checkbox. But in general, the main focus is on labeling the words on the page.
When you import a dataset and you check the Make this an Evaluation set checkbox on the Import Data dialog, then that dataset is ignored by Training Pipelines in AI Center and used only by Evaluation Pipelines.
It is important that the correct text is filled into the fields in the sidebar (or the top bar for Column fields). This takes much longer to verify for each field, but it is the only way you get a reliable metric of the accuracy of the ML model you are building.
Starting with the 2021.10 release, Document Manager supports labeling multi-page documents. Consequently, fields in the sidebar have a single value for the entire document. This closely reflects the behavior at run time in the RPA workflow and enables Evaluation Pipelines in AI Center to produce realistic scores reflecting the real run time performance of the ML models.
However, keep in mind that this is a major change from previous releases where each page was labelled separately. Labeling and exporting multi-page documents assumes each document represents a single logical document. For instance, a six-page document may contain a single six-page invoice but it should not contain three different invoices, two pages each. This is particularly important for evaluation sets.
See below the main actions you need to perform when labeling documents. A given field may be labelled in multiple places on the same page.
Select an individual text box by clicking it.
To select multiple words, click the first word and then
Shift+click the rest of the desired words or select an entire area by dragging the mouse (the rubber banding) over it.
To unselect certain text boxes from your selection, while
Shift is pressed, click or rubber band the unwanted text boxes again.
When your selection is accurate, tap the shortcut key to label the field.
Make sure that the multivalued option of the field is selected.
Select the first batch of information and tap the shortcut key to label the field.
Repeat the steps above until all the values are labelled for the multivalued field.
- Multivalued fields can be used only with Machine Learning Packages version 2022.10, or higher.
- A multivalued field displays two values in its collapsed state and all values it its expanded state. Click on the expand arrow from the multivalued field to expand and visualize the list of all tagged values.
Select text boxes, then press the
Delete or the
Backspace key on your keyboard.
After you have labelled some Column fields, and only if some rows span multiple lines of text, then you may group them together by pressing the
/ key to indicate that they are part of the same table row. A green box appears around the group.
When a labelled column field is grouped together, the table is parsed and displayed at the top, highlighting the extracted data.
Select the group and press the
/ key again.
Click on the text in the sidebar or the top bar and edit the content. A small lock appears to indicate the field has been manually edited. This is necessary when labeling evaluation sets.
Click on the lock, and the field reverts to its auto-extracted value.
Use the left or right mouse buttons to select a box or to find out more information about it.
- Left Click - selects the box
- Right Click - selects the box and displays information about the OCR text and current label.
- Alt + Arrow Left / Arrow Right - Navigates between documents.
- Ctrl + Scroll - Changes the document scaling by zooming in or out.
- Alt + Delete - Deletes a document.
- Alt + Delete - Recovers a deleted document.
Updated 5 days ago