- Overview
- Getting started
- Activities
- Insights dashboards
- Document Understanding Process
- Quickstart tutorials
- Framework components
- ML packages
- Overview
- Document Understanding - ML package
- DocumentClassifier - ML package
- ML packages with OCR capabilities
- 1040 - ML package
- 1040 Schedule C - ML package
- 1040 Schedule D - ML package
- 1040 Schedule E - ML package
- 1040x - ML package
- 3949a - ML package
- 4506T - ML package
- 709 - ML package
- 941x - ML package
- 9465 - ML package
- ACORD131 - ML package
- ACORD140 - ML package
- ACORD25 - ML package
- Bank Statements - ML package
- Bills Of Lading - ML package
- Certificate of Incorporation - ML package
- Certificate of Origin - ML package
- Checks - ML package
- Children Product Certificate - ML package
- CMS 1500 - ML package
- EU Declaration of Conformity - ML package
- Financial Statements - ML package
- FM1003 - ML package
- I9 - ML package
- ID Cards - ML package
- Invoices - ML package
- Invoices Australia - ML package
- Invoices China - ML package
- Invoices Hebrew - ML package
- Invoices India - ML package
- Invoices Japan - ML package
- Invoices Shipping - ML package
- Packing Lists - ML package
- Payslips - ML package
- Passports - ML package
- Purchase Orders - ML package
- Receipts - ML Package
- Remittance Advices - ML package
- UB04 - ML package
- Utility Bills - ML package
- Vehicle Titles - ML package
- W2 - ML package
- W9 - ML package
- Other Out-of-the-box ML Packages
- Public endpoints
- Traffic limitations
- OCR Configuration
- Pipelines
- OCR services
- Supported languages
- Deep Learning
- Licensing
Document Understanding User Guide
Dataset diagnostics
Training a new model from scratch can sometimes be a very demanding job.
Dataset Diagnostics feature helps you build effective datasets by providing feedback and hints of the steps needed to achieve good accuracy for the trained model.
Located in the Management Bar of the Document Manager, Dataset Diagnostics provides visual and written guidance throughout the whole process of training a new model.
There are three dataset status levels exposed in the Management bar:
- Red - More labelled training data is required.
- Orange - More labelled training data is recommended.
- Green - The needed level of labelled training data is achieved.
If no fields are created in the session, the dataset status level is grey.
More information on each status is available in the Dataset Diagnostics popup menu. Click on the Dataset Diagnostics button to open it.
Provides information about the documents used for training the model, the total number of imported pages, and the total number of labelled pages.
The separation on the color status bar is determined by the recommended number of labelled pages needed for training the model and the actual status of your dataset, including labelled and unlabelled data. Hovering over each color of the status bar provides extra information, in a tooltip, about each status.
The numbers available on the Dataset tab are calculated based on the number of regular fields and item fields from the training session.
- Red - The dataset requires more labelled data for training the model.
- Orange - For an increased level of accuracy on the trained model, more labelled data is recommended. You can choose to proceed further with the actual data, but the level of accuracy is not as high as wanted.
- Green - The labelled data is enough for the dataset to be trained accordingly and to receive accurate information.
Provides information about each labelled field, more precisely the total number of training pages the label is tagged on, the total number of evaluated documents with the labelled field, and its status for the current training set.
- Field - The name of the labelled field.
- Training Pages - The number of pages in the Training+Validation set on which the field is labelled.
- Evaluation Documents - The number of documents in the Evaluation set on which this field is labelled.
- Status - The status of each field, marked by three options, Red, Orange, and Green.
Here are all the options available for the Status bar:
- Red - There is insufficient data about the field, more labels being required.
- Orange - More pages need to be labelled for the results to be relevant.
- Green - There are enough labelled pages for the results to be relevant.
Refresh and Close buttons are applicable for both tabs, meaning that if the Refresh button is clicked on the Dataset tab, the Fileds tab is also refreshed.
- Refresh - Use the refresh option after alterations have been made to the dataset, whether on the number of total pages or the number of labelled pages. The popup menu automatically refreshes every few minutes and it takes place on both tabs, simultaneously. Use this function when a refresh is needed outside the automatic window.
- Close - Once all the needed information is gathered, close the menu by clicking the Close button. The entire popup menu is closed, no matter the tab from which the button is clicked.
You can modify the following fields with the Dataset Calculator:
- Out-of-the-box document type
- Number of languages
- Number of layouts
The following fields from the Calculator Tab are read-only and their values are determined by doing an intersection of the used out-of-the-box document type and the current schema fields:
- Out-of-the-box regular fields
- Out-of-the-box column fields
- Out-of-the-box classification fields
Modifying any of the mentioned fields impacts the recommended size of the Dataset. The Dataset tab from the current opened popup is updated to a green/yellow/red status based on the new recommended size. Once the changes are saved, the overall Dataset Diagnosis indicator takes into account the new Dataset Tab health.
Let's say that when you initially created the document type, you have selected Invoices for the Out-of-the-box document type field. If you change your initial choice to something else, Receipts for example, then the dataset asimilates the information for both document types and displays the information that intersects from both (Invoices and Receipts) types you selected.
If there are fields that are present only in one of the models, then they show up in the Custom regular fields or Custom column fields, because these changes apply to both regular and classification fields.