Document Understanding ML Packages can run all three types of pipelines (Full pipeline, Training, and Evaluation).
For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model.
You may obtain information about a Pipeline in two places: in the Details view accessible from the contextual dropdown menu on the right side of the Pipelines table, or in the ML Logs tab from the left sidebar. The Details view contains an Outputs pane and a Logs page. The Outputs pane will always contain a
_results.json file containing a summary of the Pipeline details, such as Package version, dataset, GPU usage, and execution time.
There are two kinds of training pipelines:
- On an ML Package of type Document Understanding
Training using a “Document Understanding” Package just trains a model from scratch on the dataset provided as input.
For use cases with low-diversity documents (forms), you might get good results with as little as 30-50 samples.
For use cases with diverse documents where you only need regular ("header") fields, you need at least 20-50 samples per field, so if you need to extract 10 regular fields, you would need at least 200-500 samples.
When you need to extract column fields (for example line items) you need 50-200 samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300-400 samples, but for highly complex and diverse layouts, it might require up to 1000.
If you also need to cover multiple languages, then you need at least 200-300 samples per language. These numbers do not need to add up, except for the languages. So for 10 header fields and 5 column fields, 500 samples might be enough, but in some cases might require over 1000.
- On a ML package of a different type, such as Invoices, Receipts, Purchase Orders, Utility Bills, Invoices India or Invoices Australia.
Training using one of these packages has one additional input: a base model. We also refer to this as retraining because you are not starting from scratch but from a base model. This approach uses a technique called Transfer Learning where the model takes advantage of the information encoded in another, preexisting model.
When you are training on the same fields to optimize the accuracy only, you might get good results with only 100-500 additional documents. If you are adding new fields to the model, you need 30-50 docs per new field to get good results.
When choosing which Base model version to use, we strongly suggest you always use 1.0, the pre-trained version provided by UiPath out-of-the-box.
When using the latest versions of ML Packages, classification fields are retrained.
For older versions of ML Packages, classification fields are not retrained, so you need to make sure, when you retrain a model, that the dataset you label has at least 10-20 samples from each class you want the model to be able to recognize, regardless of the performance of the Pre-trained model you are using as a base model.
AI Center includes the capability of fine-tuning ML models using data that has been validated by a human using Validation Station.
As your RPA workflow processes documents using an existing ML model, some documents may require human validation using the Validation Station activity (available on attended bots or in the browser using Orchestrator Action Center).
The validated data generated in Validation Station can be exported using Machine Learning Extractor Trainer activity, and can be used to fine-tune ML models in AI Center.
We do not recommend training ML models from scratch (i.e. using the DocumentUnderstanding ML Package) using data from Validation Station, but only to fine-tune existing ML models (including Out-of-the-Box models) using data from Validation Station. For the detailed steps involved in fine-tuning an ML model see the Import Documents section of the Data Manager documentation.
Regarding the volumes of data to be used for fine-tuning, if the initial model which was deployed to the RPA automation used 500 pages, then add the first 1000-1500 pages coming through to Validation Station directly. Following that only add documents with layouts (i.e. vendors, in the case of invoices) that are not already represented at least 5 times in the dataset. In order to keep track of vendors already present, you might use the UiPath Data Service.
Minimal dataset size
For successfully running Training or Full pipelines we strongly recommend at least 25 documents and at least 10 samples from each labelled field in your dataset. Otherwise the pipeline will show an error "Dataset Creation Failed"
Retraining on top of previously trained models
As more data gets labelled, either using Data Manager or coming from Validation Station, best results are obtained by maintaining a single dataset and adding more data to it, and always retraining on the base model provided by UiPath, with minor version 0. It is strongly recommended to avoid retraining using a base model which you trained yourself previously (minor version 1 or higher).
Using a GPU (AI Robot Pro) for training is at least 10 times faster than using a CPU (AI Robot).
Moreover, training on CPU is supported for datasets up to 500 images in size only. For larger datasets you will need to train using GPU.
A folder containing the exported dataset coming from Data Manager. This includes:
- images : folder containing images of all the labelled pages;
- latest : folder containing json files with the labelled data from each page;
- schema.json : file containing the fields to be extracted and their types;
- split.csv : file containing the split per each document that will be used either for TRAIN or VALIDATE during the Training Pipeline
- model.epochs: customize the number of epochs for Training or Full Pipeline (the default value is 100)
When the pipeline is Full or Evaluation pipeline, Outputs pane will also contain an "artifacts" folder which contains two files:
evaluation_metrics.txt contains the F1 scores of the fields which were predicted. Note that for line items only a global score is obtained for all columns taken together.
evaluation.xlsx is an Excel spreadsheet with a side-by-side comparison of ground truth versus predicted value for each field predicted by the model, as well as a per-document accuracy metric, in order of increasing accuracy. Hence, the most inaccurate documents are presented at the top to facilitate diagnosis and troubleshooting.
Updated 2 days ago