- Getting Started
- Framework Components
- Document Understanding in AI Center
- Pipelines
- Training and Evaluation Pipelines
- ML Packages
- Data Manager
- OCR Services
- Licensing
- References
Document Understanding User Guide
Training and Evaluation Pipelines
Document Understanding ML Packages can run all three types of pipelines (Full pipeline, Training, and Evaluation).
For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model.
_results.json
file containing a summary of the Pipeline details, such as Package version, dataset, GPU usage, and execution time.
There are two kinds of training pipelines:
- On an ML Package of type Document Understanding
- On a ML package of a different type, such as Invoices, Receipts, Purchase Orders, Utility Bills, Invoices India, or Invoices Australia.
Training using a “Document Understanding” Package just trains a model from scratch on the dataset provided as input.
For use cases with low-diversity documents (forms), you might get good results with as little as 30-50 samples.
For use cases with diverse documents where you only need regular ("header") fields, you need at least 20-50 samples per field, so if you need to extract 10 regular fields, you would need at least 200-500 samples.
When you need to extract column fields (for example line items) you need 50-200 samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300-400 samples, but for highly complex and diverse layouts, it might require up to 1000.
If you also need to cover multiple languages, then you need at least 200-300 samples per language. These numbers do not need to add up, except for the languages. So for 10 header fields and 5 column fields, 500 samples might be enough, but in some cases might require over 1000.
Training using one of the packages described at step 2 require one additional input: a base model. We also refer to this as retraining because you are not starting from scratch but from a base model. This approach uses a technique called Transfer Learning where the model takes advantage of the information encoded in another, preexisting model.When you are training on the same fields to optimize the accuracy only, you might get good results with only 100-500 additional documents. If you are adding new fields to the model, you need 30-50 docs per new field to get good results.When choosing which Base model version to use, we strongly suggest you always use 1.0, the pre-trained version provided by UiPath out-of-the-box.
Classification fields are not retrained, so you need to make sure, when you retrain a model, that the dataset you label has at least 10-20 samples from each class you want the model to be able to recognize, regardless of the performance of the Pre-trained model you are using as a base model.
The September 2020 release of AI Fabric includes the capability of fine-tuning ML models using data that has been validated by a human using Validation Station.
As your RPA workflow processes documents using an existing ML model, some documents may require human validation using the Validation Station activity (available on attended bots or in the browser using Orchestrator Action Center).
The validated data generated in Validation Station can be exported using Machine Learning Extractor Trainer activity, and can be used to fine-tune ML models in AI Fabric.
We do not recommend training ML models from scratch (i.e. using the DocumentUnderstanding ML Package) using data from Validation Station, but only to fine-tune existing ML models (including Out-of-the-Box ML models) using data from Validation Station.
For the detailed steps involved in fine-tuning an ML model see the Validation Station Dataset Import section of the Data Manager documentation.
Using a GPU (AI Robot Pro) for training is at least 10 times faster than using a CPU (AI Robot). Please be aware that training Document Understanding models on GPU requires a GPU with at least 11GB of video RAM to run successfully.
The GPU models need to support version 418.0+ NVIDIA Drivers and version 9.0+ CUDA Drivers.
Training on CPU is supported for datasets up to 500 images in size only. For larger datasets you will need to train using GPU.
A folder containing the exported dataset coming from Data Manager. This includes:
- images: a folder containing images of all the labelled pages;
- latest: a folder containing .json files with the labelled data from each page;
- schema.json: a file containing the fields to be extracted and their types;
- split.csv: a file containing the split per each
document that will be used either for TRAIN or VALIDATE during the Training Pipeline
- ml_model.epochs: customize the number of epochs for Training or Full Pipeline (the default value is 150)
When the pipeline is Full or Evaluation pipeline, Outputs pane will also contain an "artifacts" folder which contains two files:
- evaluation_metrics.txt contains the F1 scores of the fields which were predicted. Note that for line items only a global score is obtained for all columns taken together.
- evaluation.xlsx is an Excel spreadsheet with a side-by-side comparison of ground truth versus predicted value for each field predicted by the model, as well as a per-document accuracy metric, in order of increasing accuracy. Hence, the most inaccurate documents are presented at the top to facilitate diagnosis and troubleshooting.