Document Understanding
2020.10
false
Banner background image
DEPRECATED
Document Understanding User Guide
Last updated Feb 28, 2024

Training and Evaluation Pipelines

Document Understanding ML Packages can run all three types of pipelines (Full pipeline, Training, and Evaluation).

For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model.

You may obtain information about a Pipeline in two places: in the Details view accessible from the contextual dropdown menu on the right side of the Pipelines table, or in the ML Logs tab from the left sidebar. The Details view contains an Outputs pane and a Logs page. The Outputs pane will always contain a _results.json file containing a summary of the Pipeline details, such as Package version, dataset, GPU usage, and execution time.

Training and Retraining Pipelines

There are two kinds of training pipelines:

  1. On an ML Package of type Document Understanding
  2. On a ML package of a different type, such as Invoices, Receipts, Purchase Orders, Utility Bills, Invoices India, or Invoices Australia.

Training using a “Document Understanding” Package just trains a model from scratch on the dataset provided as input.

For use cases with low-diversity documents (forms), you might get good results with as little as 30-50 samples.

For use cases with diverse documents where you only need regular ("header") fields, you need at least 20-50 samples per field, so if you need to extract 10 regular fields, you would need at least 200-500 samples.

When you need to extract column fields (for example line items) you need 50-200 samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300-400 samples, but for highly complex and diverse layouts, it might require up to 1000.

If you also need to cover multiple languages, then you need at least 200-300 samples per language. These numbers do not need to add up, except for the languages. So for 10 header fields and 5 column fields, 500 samples might be enough, but in some cases might require over 1000.

Training using one of the packages described at step 2 require one additional input: a base model. We also refer to this as retraining because you are not starting from scratch but from a base model. This approach uses a technique called Transfer Learning where the model takes advantage of the information encoded in another, preexisting model.When you are training on the same fields to optimize the accuracy only, you might get good results with only 100-500 additional documents. If you are adding new fields to the model, you need 30-50 docs per new field to get good results.When choosing which Base model version to use, we strongly suggest you always use 1.0, the pre-trained version provided by UiPath out-of-the-box.

Note:

Classification fields are not retrained, so you need to make sure, when you retrain a model, that the dataset you label has at least 10-20 samples from each class you want the model to be able to recognize, regardless of the performance of the Pre-trained model you are using as a base model.

Fine-tuning Using Data From Validation Station (Preview)

The September 2020 release of AI Fabric includes the capability of fine-tuning ML models using data that has been validated by a human using Validation Station.

As your RPA workflow processes documents using an existing ML model, some documents may require human validation using the Validation Station activity (available on attended bots or in the browser using Orchestrator Action Center).

The validated data generated in Validation Station can be exported using Machine Learning Extractor Trainer activity, and can be used to fine-tune ML models in AI Fabric.

We do not recommend training ML models from scratch (i.e. using the DocumentUnderstanding ML Package) using data from Validation Station, but only to fine-tune existing ML models (including Out-of-the-Box ML models) using data from Validation Station.

For the detailed steps involved in fine-tuning an ML model see the Validation Station Dataset Import section of the Data Manager documentation.

Important: For successfully running Training or Full pipelines we strongly recommend at least 25 documents and at least 10 samples from each labelled field in your dataset. Otherwise the pipeline will show an error "Dataset Creation Failed"
Important: As more data gets labelled, either using Data Manager or coming from Validation Station, best results are obtained by maintaining a single dataset and adding more data to it, and always retraining on the base model provided by UiPath, with minor version 0. It is strongly recommended to avoid retraining using a base model which you trained yourself previously (minor version 1 or higher).

Training on GPU or on CPU

Using a GPU (AI Robot Pro) for training is at least 10 times faster than using a CPU (AI Robot). Please be aware that training Document Understanding models on GPU requires a GPU with at least 11GB of video RAM to run successfully.

The GPU models need to support version 418.0+ NVIDIA Drivers and version 9.0+ CUDA Drivers.

Training on CPU is supported for datasets up to 500 images in size only. For larger datasets you will need to train using GPU.

Dataset Format

A folder containing the exported dataset coming from Data Manager. This includes:

  • images: a folder containing images of all the labelled pages;
  • latest: a folder containing .json files with the labelled data from each page;
  • schema.json: a file containing the fields to be extracted and their types;
  • split.csv: a file containing the split per each document that will be used either for TRAIN or VALIDATE during the Training Pipeline


Environment Variables

  • ml_model.epochs: customize the number of epochs for Training or Full Pipeline (the default value is 150)

Artifacts

When the pipeline is Full or Evaluation pipeline, Outputs pane will also contain an "artifacts" folder which contains two files:

  • evaluation_metrics.txt contains the F1 scores of the fields which were predicted. Note that for line items only a global score is obtained for all columns taken together.
  • evaluation.xlsx is an Excel spreadsheet with a side-by-side comparison of ground truth versus predicted value for each field predicted by the model, as well as a per-document accuracy metric, in order of increasing accuracy. Hence, the most inaccurate documents are presented at the top to facilitate diagnosis and troubleshooting.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.