Out of the Box Packages > UiPath > UiPath Document Understanding
These are Out of the box Machine Learning Models to classify and extract any commonly occurring data points from Semi-structured or Unstructured documents, including regular fields, table columns and classification fields in a template-less approach.
Document Understanding contains multiple ML Packages split into 4 main categories:
UiPath Document OCR
This is a non-retrainable model which can be used with the UiPath Document OCR Engine Activity as part of the Digitize Document Activity. In order to be used it must first be made public so that a URL can be copy-pasted into the Document OCR Engine activity.
This is a generic, retrainable model for extracting any commonly occurring data points from any type of structured or semi-structured documents, building a model from scratch. This ML Package must be trained. If deployed without training first, deployment will fail with an error stating that the model is not trained.
Out of the box Pre-Trained ML Models
These are retrainable ML Packages that hold the knowledge of different Machine Learning Models.
These ML Packages can be customized to extract additional fields or support additional languages using Pipeline runs. Using state-of-the-art transfer learning capabilities, this model can be retrained on additional labelled documents and tailored to specific use-case or expand it for additional Latin, Cyrillic or Greek languages support.
The dataset used may have the same fields, a subset of the fields, or have additional fields. In order to benefit from the intelligence already contained in the pre-trained model you need to use fields with the same names as in the OOB model itself.
These ML Packages are:
Invoices: The fields extracted out-of-the-box can be found here
Receipts: The fields extracted out-of-the-box can be found here
Purchase Orders (Preview): The fields extracted out-of-the-box can be found here
Utility Bills (Preview): The fields extracted out-of-the-box can be found here
Invoices India (Preview): The fields extracted out-of-the-box can be found here
Invoices Australia (Preview): The fields extracted out-of-the-box can be found here
Invoices Japan (Preview): The fields extracted out-of-the-box can be found here
These models are deep learning architectures built by UiPath. A GPU can be used both at serving time and training time but is not mandatory. A GPU delivers>10x improvement in speed for Training in particular.
- Other Out of the Box Packages
These are non-retrainable Packages that are required for non-ML components of the Document Understanding suite.
These ML Packages are:
Form Extractor (FE): Deploy as Public Skill and paste the URL into the Form Extractor activity
Intelligent Form Extractor (IFE): Deploy as Public Skill and paste the URL into the Intelligent Form Extractor activity. Make sure to first deploy the Handwriting OCR Skill and configure that as OCR for the IFE package.
Intelligent Keyword Classifier (IKC): Deploy as Public Skill and paste the URL into the Intelligent Keyword Classifier activity
Handwriting OCR: Deploy as Public Skill and use as OCR when creating the IFE Package.
Intelligent Form Extractor
Intelligent Keyword Classifier
file (accepted formats: pdf, png, bmp, jpeg, tiff)
The file needs to be digitized and the input will be in Data Extraction Scope activity. Details: https://docs.uipath.com/activities/docs/data-extraction-scope.
json file with all fields extracted from the Machine Learning model.
The output will be configured in Data Extraction Scope, stored in ExtractionResults variable. This result can be transformed into a DataSet type using Export Extraction Results activity. Details: https://docs.uipath.com/activities/docs/export-extraction-results.
Document Understanding ML Packages can run all three types of pipelines (Full pipeline, Training and Evaluation).
For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model.
You may obtain information about a Pipeline in two places: in the Details view accessible from the contextual dropdown menu on the right side of the Pipelines table, or in the ML Logs tab from the left sidebar. The Details view contains a Outputs pane and a Logs page. The Outputs pane will always contain a "_results.json" file containing a summary of the Pipeline details, such as Package version, dataset, GPU usage and execution time.
There are two kinds of training pipelines:
- On a ML Package of type Document Understanding
Training using a “Document Understanding” Package just trains a model from scratch on the dataset provided as input. For usecases with low-diversity documents (forms) you might get good results with as little as 30-50 samples. For usecases with diverse documents where you only need regular ("header") fields you need at least 20-50 samples per field, so if you need to extract 10 regular fields, you would need at least 200-500 samples. When you need to extract column fields (for example line items) you need 50-200 samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300-400 samples, but for highly complex and diverse layouts, it might require up to 1000. If you also need to cover multiple languages, then you need at least 200-300 samples per language. These numbers do not need to add up, except for the languages. So for 10 header fields and 5 column fields 500 samples might be enough, but in some cases might require over 1000.
- On a ML package of a different type, such as Invoices, Receipts, Purchase Orders, Utility Bills, Invoices India or Invoices Australia.
Training using one of these packages has one additional input: a base model. We also refer to this as retraining because you are not starting from scratch but from a base model. This approach uses a technique called Transfer Learning where the model takes advantage of the information encoded in another, preexisting model. When you are training on the same fields to optimize the accuracy only, you might get good results with only 100-500 additional documents. If you are adding new fields to the model, you need 30-50 docs per new field to get good results. When choosing which Base model version to use, we strongly suggest you always use 1.0, the pre-trained version provided by UiPath out-of-the-box. Note that Classification fields are not retrained, so you need to make sure, when you retrain a model, that the dataset you label has at least 10-20 samples from each class you want the model to be able to recognize, regardless of the performance of the Pre-trained model you are using as a base model.
The September 2020 release of AI Fabric includes the capability of fine-tuning ML models using data which has been validated by a human using Validation Station. As your RPA workflow processes documents using an existing ML model, some documents may require human validation using the Validation Station activity (available on attended bots or in the browser using Orchestrator Action Center). The validated data generated in Validation Station can be exported using Machine Learning Extractor Trainer activity, and can be used to fine-tune ML models in AI Fabric. We do not recommend training ML models from scratch (i.e. using the DocumentUnderstanding ML Package) using data from Validation Station, but only to fine-tune existing ML models (including Out-of-the-Box ML models) using data from Validation Station.
For the detailed steps involved in fine-tuning a ML model see the Validation Station Dataset Import section of the Data Manager documentation.
Minimal dataset size
For successfully running Training or Full pipelines we strongly recommend at least 25 documents and at least 10 samples from each labelled field in your dataset. Otherwise the pipeline will show an error "Dataset Creation Failed"
Retraining on top of previously trained models
As more data gets labelled, either using Data Manager or coming from Validation Station, best results are obtained by maintaining a single dataset and adding more data to it, and always retraining on the base model provided by UiPath, with minor version 0. It is strongly recommended to avoid retraining using a base model which you trained yourself previously (minor version 1 or higher).
Using a GPU (AI Robot Pro) for training is at least 10 times faster than using a CPU (AI Robot). Moreover, training on CPU is supported for datasets up to 500 images in size only. For larger datasets you will need to train using GPU.
A folder containing the exported dataset coming from Data Manager. This includes:
- images : folder containing images of all the labelled pages;
- latest : folder containing json files with the labelled data from each page;
- schema.json : file containing the fields to be extracted and their types;
- split.csv : file containing the split per each document that will be used either for TRAIN or VALIDATE during the Training Pipeline
- ml_model.epochs: customize number of epochs for Training or Full Pipeline (default 150)
When the pipeline is Full or Evaluation pipeline, Outputs pane will also contain an "artifacts" folder which contains two files:
- evaluation_metrics.txt contains the F1 scores of the fields which were predicted. Note that for line items only a global score is obtained for all columns taken together.
- evaluation.xlsx is an Excel spreadsheet with a side-by-side comparison of ground truth versus predicted value for each field predicted by the model, as well as a per-document accuracy metric, in order of increasing accuracy. Hence, the most inaccurate documents are presented at the top to facilitate diagnosis and troubleshooting.
Updated 8 days ago