document-understanding

2022.4

false

Document Understanding User Guide

DELIVERY:

Last updated Apr 4, 2025

The Auto-Fine-tuning Loop (Public Preview)

When training/retraining an ML Model, the first thing to keep in mind is that best results are obtained by accumulating all data into a single large and, ideally, carefully curated dataset. Training on dataset A, and then retraining the resulting model on dataset B will produce much worse results than training on the combined dataset A+B.

The second thing to keep in mind is that not all data is the same. Data labeled in a dedicated tool like Document Manager is in general better quality and will result in a better performing model than data labeled in tools with a different focus - such as Validation Station. Data from Validation Station may be high quality from a business process point of view, but less so from a model training point of view, because an ML Model needs data in a very specific form, which is almost always different from the form needed by business processes. For instance, on a 10-page invoice, the invoice number may appear on each page, but in Validation Station it is sufficient to indicate it on the first page, while in Document Manager you would label it on every page. In this case, 90% of the correct labels are missing from the Validation Station data. For this reason, Validation Station data has a limited utility, as described above.

To effectively train an ML Model, you need a single, well-rounded, high quality, and representative dataset. A cumulative approach, therefore, is to add more data to the input dataset and therefore train the ML Model with a larger dataset each time. One way to do this is to use the Auto-Fine-tuning loop.

To get a better understanding of this feature, let's see where Auto-Fine-tuning fits into the ML Model lifecycle.

The Lifecycle of an ML Model

In the lifecycle of any Machine Learning model, there are two major phases:

the build phase
the maintenance phase

The Build Phase

In this first phase, you use Document Manager to prepare the training dataset and the evaluation dataset in order to obtain the best performance possible.

At the same time, you build the RPA automation and business logic around the ML Model which is at least as important as the model itself for obtaining the Return on Investment you expect.

The Maintenance Phase

In this second phase, you try to maintain the high-performance level you achieved in the build phase, preventing regressions.

Auto-Fine-tuning (and Validation Station data in general) pertains strictly to the maintenance phase. The objective of Auto-Fine-tuning is mainly to prevent the ML Model from regressing as data flowing through the process changes.

Important: Data fed back from the human validation using Validation Station should not be used to build a model from scratch. Building a model should be done by preparing training and evaluation datasets in Document Manager.

The Auto-Fine-tuning Loop Components

The Auto-Fine-tuning loop has the following components:

1. Robot Workflow: Machine Learning Extractor Trainer activity
2. Document Manager: Schedule Export feature
3. AI Center: Scheduled Auto-retraining Pipeline
4. (Optional) Auto-update ML Skills

Prerequisites

To be able to implement this functionality, two requirements have to be met beforehand:

You need to have created a Document Manager session in AI Center and to have configured a certain number of fields, more precisely to label high-quality Training and Evaluation datasets. You can either manually define your fields or you can import a schema. Should fields not be configured, the Schedule (Preview) tab is not enabled and the following message is displayed on the screen:
You need to have trained a few versions of your ML Model, tested it, fixed any issues that could have occurred, and deployed it to your RPA+AI automation.

1. Robot Workflow: Machine Learning Extractor Trainer Activity

Add the Machine Learning Extractor Trainer activity into your workflow in a Train Extractors Scope, properly configure the scope, making sure the Framework Alias contains the same alias as the Machine Learning Extractor alias in the Data Extraction Scope.

Then, select the Project and the Dataset associated with the Document Manager session that contains your Training and Evaluation datasets. The drop-down menus are prepopulated once you are connected to Orchestrator.

Note: You can set a value for the Output Folder property if you want to export the data locally in the workflow.

You can see the Dataset name on the Data Labelling view in AI Center, next to the name of the Data Labelling session:

For the selected dataset, what the Machine Learning Extractor Trainer activity does is create a folder called fine-tune and to write the exported documents there in 3 folders: documents, metadata, and predictions folders.

This is the folder where the data will then be imported into Document Manager automatically, merged with the previously existing data, and exported into the right format to be consumed by a Training or a Full pipeline.

2. Document Manager: Schedule Export Feature

From a Document Manager session, click the Export button , go to the Schedule (Preview) tab, and enable the Scheduling slider. Then select a start time and a recurrence. When ready, click the Schedule button.

The Backwards-compatible export checkbox enables you to apply legacy export behavior, which is to export each page as a separate document. Try this if the model trained using default export is below expectations. Leave this unchecked to export the documents in their original multi-page form.

Note:

The minimum recurrence is 1 day and the maximum recurrence is 60 days.

Given the fact that AI Center training pipelines are mainly configured to run weekly, a recurrence of 7 days is recommended.

When you set the schedule for export, the imported data from the fine-tune folder is exported to the export folder under auto-export time_stamp.

To be more specific, the Scheduled Export imports the data which exists in the fine-tune folder created in Step 1, and then it exports the full dataset, including the previously existing data and the newly imported Validation Station data, into the export folder. So with each scheduled export, the exported dataset gets larger and larger.

The file latest.txt is updated or created if this is the first scheduled export. Here you can see the name of the latest export made by Document Manager. Schema export, however, does not update latest.txt. This file is used by the auto-retraining pipeline in AI Center to determine which is the latest export so it can always train on the latest data, so you should never remove or modify it, otherwise, your auto-retraining pipelines will fail.

Note: The Scheduled import+export operation might take up to 1-2 hours, depending on how much data was sent from Step 1 during the previous week. We recommend you choose a time when you will not use the Document Manager due to the fact that when an export operation is ongoing no other exports or imports are allowed. However, labeling is always possible.

3. AI Center: Scheduled Auto-retraining Pipeline

When scheduling a Training or Full Pipeline in AI Center, there are a few aspects that need to be taken into consideration.

First, we strongly recommend you create an Evaluation dataset and that you only schedule Full pipelines. Full pipelines run Training and Evaluation together, and the Evaluation pipeline uses the Evaluation dataset to produce a score. This score will be critical for deciding whether the new version is better than the previous version, and it can be deployed for being consumed by Robots.

Second, for the Full Pipeline you need to specify two datasets: an Input Dataset and an Evaluation Dataset.

There is no change to the Evaluation dataset in the context of the Auto-Fine-tuning Loop feature. You still need to select a dataset as usual, containing the two folders: images and latest, and the two files: schema.json and split.csv.

However, the Input dataset is no longer a dataset, but you need to select the export folder in the AI Center dataset which is connected to the Data Labelling session. This way, the Training runs on the latest export from your Data Labelling session while the Evaluation runs on the same Evaluation dataset you specify.

Important: If you do not select the export folder, the auto-retraining does not work.

Third, you need to set the auto-retraining environment variable to True.

Finally, you need to select Recurring and set a day and time to leave enough time for the export from Document Manager to finish. For example, if the Document Manager export runs at 1 AM on Saturday, then the Pipeline might run at 2 or 3 AM on Saturday. If the export is not finished when the pipeline runs, it uses the previous export, and it might retrain on the same data it trained on the previous week.

4. (Optional) Auto-update ML Skills

If you want to automatically deploy the latest version of the ML package which is produced by the automatically scheduled Training Pipelines, you can enable the Auto-update feature on the ML Skill.

Note:

The ML Skill is automatically updated regardless of whether the accuracy score improved over the previous training, so please use this feature with care.

In some cases, it is possible that the overall score improves even if a specific field might regress a little bit. However, that field might be critical for your business process, so auto-updating and auto-retraining, in general, requires careful monitoring in order to be successful.