- Overview
- Getting started
- Activities
- Insights dashboards
- Document Understanding Process
- Quickstart tutorials
- Framework components
- ML packages
- Overview
- Document Understanding - ML package
- DocumentClassifier - ML package
- ML packages with OCR capabilities
- 1040 - ML package
- 1040 Schedule C - ML package
- 1040 Schedule D - ML package
- 1040 Schedule E - ML package
- 1040x - ML package
- 3949a - ML package
- 4506T - ML package
- 709 - ML package
- 941x - ML package
- 9465 - ML package
- ACORD131 - ML package
- ACORD140 - ML package
- ACORD25 - ML package
- Bank Statements - ML package
- Bills Of Lading - ML package
- Certificate of Incorporation - ML package
- Certificate of Origin - ML package
- Checks - ML package
- Children Product Certificate - ML package
- CMS 1500 - ML package
- EU Declaration of Conformity - ML package
- Financial Statements - ML package
- FM1003 - ML package
- I9 - ML package
- ID Cards - ML package
- Invoices - ML package
- Invoices Australia - ML package
- Invoices China - ML package
- Invoices Hebrew - ML package
- Invoices India - ML package
- Invoices Japan - ML package
- Invoices Shipping - ML package
- Packing Lists - ML package
- Payslips - ML package
- Passports - ML package
- Purchase Orders - ML package
- Receipts - ML Package
- Remittance Advices - ML package
- UB04 - ML package
- Utility Bills - ML package
- Vehicle Titles - ML package
- W2 - ML package
- W9 - ML package
- Other Out-of-the-box ML Packages
- Public endpoints
- Traffic limitations
- OCR Configuration
- Pipelines
- OCR services
- Supported languages
- Deep Learning
- Licensing
Document Understanding User Guide
About pipelines
Document UnderstandingTM ML Packages can run all three types of pipelines:
Once completed, a pipeline run has associated outputs and logs. To see this information, in the Pipelines tab from the left sidebar, click a pipeline to open the Pipeline view which consists of:
- the Pipeline details such as type, ML Package name and version, dataset, GPU usage, parameters, and execution time
- the Outputs pane; this always includes a
_results.json
file containing a summary of the Pipeline details - the Logs page; the logs can also be obtained in the ML Logs tab from the left sidebar
evaluation_scores_<package name>.txt
- This file contains Accuracy scores for all fields.evaluation_<package name>.xlsx
- This file contains detailed accuracy breakdown per field and per batch, as well as side-by-side comparison for each field, with color highlighting for missed (red) or partially matched (yellow) fields.evaluation_F1_scores.txt
- This file contains the F1 scores for all fields.
Partial matches using Levenshtein distance are the default scoring method on fields with Content Type: String. All other Content Types (Dates, Numbers, ID Numbers, Phone Numbers) only use Exact Match scoring.
For String fields you can change this setting in the Advanced tab of the Field Settings dialog in the Document Type view of Document Understanding.
For example, if an evaluation dataset has 100 documents, and a field, say Purchase Order Number, appears on half of the documents, then if model predicted 40 of them correctly and 10 of them partially correct with a Levenshtein distance of 0.8, then the accuracy would be (40 + 10 x 0.8 + 50)/100 = 98%.
Note that the 50 documents where the field is missing and model did not predict anything are also counted as successful predictions.
On Training pipelines, the scores are calculated on the Validation dataset. The Validation dataset is a randomly selected subset of 20% of the total training dataset submitted in the Training Pipeline.
Training pipelines or Full pipelines can also be used to:
- Fine-tune ML models with data from Validation Station
- Auto-Fine-tune an ML model
Training Pipelines and Full Pipelines support training sets of maximum 18.000 labelled pages.