Document Understanding
Banner background image
Document Understanding User Guide for Modern Experience
Last updated 4 avr. 2024

Training High Performing Models

The power of Machine Learning Models is that they are defined by training data rather than by explicit logic expressed in computer code. This means that extraordinary care is needed when preparing datasets because a model is only as good as the dataset that was used to train it. In that sense, what UiPath Studio is to RPA workflows, a Document Type session (in Document &Understanding Cloud) is to Machine Learning capabilities. Both require some experience to be used effectively.

What Can a Data Extraction ML Model Do?

An ML Model can extract data from a single type of document, though it might cover several different languages. It is essential that each field (Total Amount, Date, etc.) has a single consistent meaning. If a human can be confused about the correct value for a field, then an ML Model will also.

Ambiguous situations can appear. For instance, is a utility bill just another type of invoice? Or are these two different document types that require two different ML Models? If the fields you need to extract are the same, (i.e. they have the same meaning) then you can treat them as a single document type. However, if you need to extract different fields for different reasons (different business processes), then you need to treat these as two different document types, and, hence, train two different models.

When in doubt, start by training a single model, but keep the documents in different Document Manager batches (see the Filter drop-down at the top of the view) so you can easily separate them later if needed. In this way, the labeling work is not lost. When it comes to ML Models, the more data, the better. So, having a single model with ample data is a good place to start.

Training and Evaluation Datasets

Document Manager can be used to build two types of datasets: training datasets and evaluation datasets.

Evaluation datasets are relevant for data scientists or for doing detailed troubleshooting to reproduce certain issues. For the vast majority of enterprise automation scenarios however, only the training dataset is needed and recommended. Training Pipelines now return full scoring artefacts, including the following:
  • Model overall Accuracy – this is used as the main score displayed in both Document Understanding Extractor Details view, and in the AI Center artefacts folder (also accessible from Document Understanding).
  • Model overall and field level F1 score – this is provided for historical reasons in the AI Center artefacts folder (also accessible from Document Understanding).
  • Detailed metrics per field and per batch, as well as side-by-side performance comparison in the Excel file available in the AI Center artefacts folder (also accessible from Document Understanding).

Scoring the model as part of the training is possible because the training set is automatically split randomly 80%/20% between Train and Validation data, and the scores are calculated on the Validation data.

As the dataset increases, both the Train and the Validation splits evolve, which means the scores are not directly comparable over time (which is what a data scientist would be interested in). However, they do reflect the performance of the model on the most recent data, so they are more representative of the current performance of the model on current production data (which is what business process owners are interested in).

With this approach, you have the following benefits:
  • You don't run the risk of looking at scores measured on obsolete data, potentially years old.
  • You reduce the amount of labelling you need to do.
  • All data you label contributes to actually improving the model, leading to better performance faster.

Data Extraction Components

Data extraction relies on the following components:

  • Optical Character Recognition
  • OCR Post-processing
    • Word and Line building
    • Grouping characters into words and words into lines of text
  • Machine Learning Model prediction for each word/box on the page
  • Data Normalization
    • For example, grouping words on several lines into an address, formatting a date to the standard yyyy-mm-dd format
  • Value selection
    • Applying an algorithm for selecting which of multiple instances of a certain value is actually returned, for the cases in which the document has two or more pages, and some fields appear on more than one page.

Build a High Performing ML Model

For the best outcome in terms of automation rate (percent reduction of manual work measured in person-months per year required to process your document flow) you need to carefully follow these steps:

  1. Choose the best OCR engine for your documents

    This influences both the OCR and the Word and Line building (which depends partially on the OCR), and everything downstream, of course.

  2. Select a well balanced and representative dataset for Training
  3. Define the fields to be extracted
  4. Label the Training dataset
  5. Train the Extractor
  6. Define and implement the business rules for processing model output
  7. (Optional) Choose confidence threshold(s) for the Extraction
  8. Training using data from Validation Station
  9. Deploy your automation

1. Choose an OCR Engine

Your default option should be UiPath Document OCR for European Latin-based languages, or UiPath Chinese-Japanese-Korean OCR. In case you need to process other languages, including Cyrillic, Devanagiri script, Thai, Vietnamese, Arabic, Hebrew or Persian, you may prefer Microsoft Azure Read OCR (Cloud only), Google Cloud Vision OCR (Cloud only) or Omnipage OCR (Activity pack in Studio). It is worth creating a few different Document Manager sessions with different OCR engines to check which performs best on your documents. Changing OCR engine later in the project can be costly.

Please be aware that the Digitize Document activity has the ApplyOcrOnPDF setting set to Auto by default, which means when operating on .pdf documents by default, Digitize tries to scrape as much text as possible from the .pdf itself, and only OCR images like logos, and combine the results.

However, .pdf documents are sometimes corrupt or unusually formatted leading to errors in the extracted text. In this case, set ApplyOCRonPDFs to Yes.

Another benefit of applying OCR on all .pdf documents using UiPath Document OCR is that UiPath Document OCR recognizes checkboxes which are critical elements of documents like forms. Be aware however that applying OCR on everything will slow down the Digitization a bit.

2. Define Fields

Defining fields is a conversation that needs to happen with the Subject Matter Expert or Domain Expert who owns the business process itself. For invoices, it would be the Accounts Payable process owner. This conversation is critical, it needs to happen before any documents are labeled to avoid wasted time, and it requires looking together at a minimum of 20 randomly chosen document samples. A one-hour slot needs to be reserved for this, and it often needs to be repeated after a couple of days, as the person preparing the data runs into ambiguous situations or edge cases.

It is not uncommon that the conversation starts with the assumption that you need to extract, let's say, 10 fields, and later you end up with 15. Some examples are described in the subsections below.

Some key configurations you need to be aware of:

  • Content type

    This is the most important setting as it determines the postprocessing of the values, especially for dates (detects if the format is US-style or non-US style, and then formats them as yyyy-mm-dd) and for numbers (detects the decimal separator – comma or period). ID numbers clean up anything coming before a colon or hash symbol. String content type performs no cleanup and can be used when you want to do your own parsing in the RPA workflow.

  • Multi-line checkbox

    This is for parsing strings like addresses that may appear on more than 1 line of text.

  • Multi-valued checkbox

    This is for handling multiple choice fields or other fields which may have more than one value, but are NOT represented as a table column. For example, an Ethnic group question on a government form may contain multiple checkboxes where you can select all that apply.

  • Hidden fields

    Fields marked as Hidden can be labelled but they are held out when data is exported, so the model cannot be trained on them. This is handy when labeling a field is a work in progress, when it is too rare, or when it is low priority.

  • Scoring

    This is relevant only for Evaluation pipelines, and it affects how the accuracy score is calculated. A field that uses Levenshtein scoring is more permissive: if a single character out of 10 is wrong, the score is 0.9. However, if scoring is Exact Match it is more strict: a single wrong character leads to a score of zero. Only String type fields have the option to select Levenshtein scoring by default.

Amounts on Utility Bills

A total amount might seem straightforward enough, but utility bills contain many amounts. Sometimes you need the total amount to be paid. Other times you need only the current bill amount – without the amounts carried forward from previous billing periods. In the latter case, you need to label differently even if the current bill and total amount can be the same. The concepts are different and the amounts are often different.

Note: Each field represents a different concept, and they need to be defined as cleanly as possible, so there is no confusion. If a human might be confused, the ML model will also be confused.

Moreover, the current bill amount can sometimes be composed of a few different amounts, fees, and taxes and may not appear individualized anywhere on the bill. A possible solution to this is to create two fields: a previous-charges field, and a total field. These two always appear as distinct explicit values on the utility bill. Then the current bill amount can be obtained as the difference between the two. You might even want to include all 3 fields (previous-charges, total, and current-charges) in order to be able to do some consistency checks in cases where the current bill amount appears explicitly on the document. So you could go from one to three fields in some cases.

Purchase Order numbers on Invoices

PO numbers can appear as single values for an invoice, or they might appear as part of the table of line items on an invoice, where each line item has a different PO number. In this case, it could make sense to have two different fields: po-no and item-po-no. By keeping each field visually and conceptually consistent, the model is likely to do a much better job. However, you need to make sure both are well represented in your Training and your Evaluation datasets.

Vendor name and Payment address name on Invoices

The company name usually appears at the top of an invoice or a utility bill, but sometimes it might not be readable because there is just a logo, and the company name is not explicitly written out. There could also be some stamp, or handwriting, or wrinkle over the text. In these cases, people might label the name that appears at the bottom right, in the Remit payment to section of the payslip on utility bills. That name is often the same, but not always, since it is a different concept. Payments can be made to some other parent or holding company, or other affiliate entity, and it is visually different on the document. This might lead to poor model performance. In this case, you should create two fields, vendor-name and payment-addr-name. Then you can look both up in a vendor database and use the one that matches, or use payment-addr-name when the vendor-name is missing.

Rows of tables

There are two distinct concepts to keep in mind: table rows and lines of text. A table row includes all the values of all columns fields which belong together in that row. Sometimes they may all be part of the same line of text going across the page. Other times they may be on different lines.

If a table row consists of more than one line of text, then you need to group all values on that table row using the “/” hotkey. When you do this, a green box will appear covering the entire table row. Here is an example of a table where the top two rows consist of multiple lines of text and need to be grouped using the “/” hotkey, while the third row is a single line of text and does not need to be grouped.

Here is an example of a table where each table row consists of a single line of text. You do not need to group these using the “/” hotkey because this is done implicitly by Document Manager.

Identifying where a row ends and another begins as one reads from top to bottom can often be a major challenge for ML extraction models, especially on documents like forms where there are no visual horizontal lines separating rows. Inside our ML Packages there is a special model which is trained to split tables into rows correctly. This model is trained by using these groups you label using the “/” or “Enter” keys and which are indicated by the green transparent boxes.

3. Create a Training Dataset

Machine Learning technology has the main benefit of being able to handle complex problems with high diversity. When estimating the size of a training dataset, one looks first at the number of fields and their types, and the number of languages. A single model can handle multiple languages as long as they are not Chinese/Japanese/Korean. The later scenarios generally require separate Training datasets and separate models.

There are 3 types of fields:

  1. Regular fields(date, total amount)

    For Regular fields, you need at least 20-50 document samples per field. So, if you need to extract 10 regular fields, you need at least 200-500 samples. If you need to extract 20 regular fields, you need at least 400-1000 samples. The amount of samples you need increases with the number of fields. More fields means you need more samples, about 20-50X more.

  2. Column fields (item unit price, item quantity)

    For Column fields, you need at least 50-200 document samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300 document samples, but for highly complex and diverse layouts, it might require over 1000. To cover multiple languages, then you need at least 200-300 samples per language assuming they cover all the different fields. So, for 10 header fields and 4 column fields with 2 languages, 600 samples might be enough (400 for the columns and headers plus 200 for the additional language), but some cases might require 1200 or more.

  3. Classification fields (currency)

    Classification fields generally require at least 10-20 samples from each class.

The recommended range for dataset size is based on the information provided in the Calculator tab. For simpler scenarios with few regular fields and clear document layouts, good results may be obtained with datasets in low orange range. For complex scenarios, especially involving complex tables with many columns, good results may require datasets in high orange or even green range.

Important: ML technology is designed to handle high diversity scenarios. Using it to train models on low diversity scenarios (1-10 layouts) requires special care to avoid brittle models that are sensitive to slight changes in the OCR text. Avoid this by having some deliberate variability in the training documents, by printing and then scanning or photographing them using mobile phone scanner apps. The slight distortions or changing resolutions make the model more robust.

In addition, these estimates assume that most pages contain all or most of the fields. In cases where you have documents with multiple pages but most fields are on a single page, then the relevant number of pages is the number of examples of that one page where most fields appear.

The numbers above are general guidelines, not strict requirements. In general, you can start with a smaller dataset, and then keep adding data until you get good accuracy. This is especially useful to parallelize the RPA work with the model building. Also, a first version of the model can be used to prelabel additional data (see Settings view and Predict button in Document Manager) which can accelerate labeling additional Training data.

Deep Learning models can generalize

You do not need to have every single layout represented in a training set. In fact, it is possible that most layouts in our production document flow have zero samples in your training set, or perhaps 1 or 2 samples. This is desirable, because you want to leverage the power of AI to understand the documents and be able to make correct predictions on documents that it has not seen during training. A large number of samples per layout is not mandatory because most layouts might either not be present at all, or be present only 1 or 2 times, and the model would still be able to predict correctly, based on the learning from other layouts.

Training over an Out-of-the-box model

A common situation is extracting data from invoices, for which we have a pretrained Out-of-the-box model, but additionally you have 2 more Regular fields and 1 more Column field which the pretrained Invoices model does not recognize. In this case you will usually need a much smaller dataset than if you had trained all the fields from scratch. The dataset size estimation is provided when you create a Document Type in Document Understanding Cloud, and it is then accessible in the Dataset Diagnostics view. However, keep in mind that whatever documents you use to train a model must be labelled fully, including the new fields, which the out-of-the-box model does not recognize, and also the original of the fields, which the out-of-the-box model does recognize.

Unequal field occurrences

Some fields might occur on every document (e.g., date, invoice number) while some fields might appear only on 10% of documents (e.g., handling charges, discount). In these cases, you need to make a business decision. If those rare fields are not critical to your automation, you can get away with a small number of samples (10-15) of that particular field, i.e. pages which contain a value for that field. However, if those fields are critical, then you need to make sure that you include in your Training set at least 30-50 samples of that field to make sure to cover the full diversity.

Balanced datasets

In the case of invoices, if a dataset contains invoices from 100 vendors, but half of the dataset consists only of invoices from one single vendor, then that is a very unbalanced dataset. A perfectly balanced dataset is where each vendor appears an equal number of times. Datasets do not need to be perfectly balanced, but you should avoid having more than 20% of your entire dataset coming from any single vendor. At some point, more data does not help, and it might even degrade the accuracy on other vendors because the model optimizes (overfits) so much for one vendor.

Representative datasets

Data should be chosen to cover the diversity of the documents likely to be seen in the production workflow. For example, if you get invoices in English but some of them come from the US, India and Australia, they will likely look different, so you need to make sure you have samples from all three. This is relevant not only for the model training itself, but also for labeling purposes because as you label the documents you might discover that you need to extract new, different fields from some of these regions, like GSTIN code from India, or ABN code from Australia.

Training/Validation Split

Any training dataset is automatically split behind the scenes into a Training set (randomly selected 80%) and Validation set (randomly selected 20%). During training of deep learning models, the Training set is used for backpropagation, the part that actually modifies the weights of the nodes in the network, while the Validation set is only used to know when to stop training. In other words it is used to prevent overfit on the Training set.

This is how we can get full evaluation scores at the end of any training run: we return the scores on the Validation set. This set is not technically speaking used for training the network, only for preventing overfit. However, since it is selected randomly 20% from the entire dataset, it does tend to be similarly distributed to the Training set. This consistency is a good thing because it means that the scores you get reflect the model’s ability to learn from the data which is what we usually care about. If a model is not able to learn, it means the data is inconsistent or model has a limitation, and those are critical things we need to know immediately when training a model.

The downside of this approach is that scores cannot necessarily be compared exactly apples to apples as the dataset grows, and also that scores do not reflect the model’s ability to generalize – only its ability to learn. However, apples to apples comparisons and measuring ability to generalize are technical, data science concerns which have only indirect impact on business performance or ROI for most automations.

4. Label the Training Dataset

When labeling Training data, you need to focus on the bounding boxes of the words in the document pane of Document Manager. The parsed values in the right or top sidebars are not important as they are not used for training.

Whenever a field appears multiple times on a page, as long as they represent the same concept, all of them should be labeled.

When the OCR misses a word or gets a few characters wrong, just label the bounding box if there is one, and if not, then just skip it and keep going. There is no possibility to add a word in Document Manager because even if you did, the word would still be missing at run time, so adding it doesn't help the model at all.

As you label, remain vigilant about fields that may have multiple or overlapping meanings/concepts, in case you might need to split a field into two separate fields, or fields that you do not explicitly need, but which, if labelled, might help you to do certain validation or self-consistency check logic in the RPA workflow. Typical examples are quantity,unit-price, and line-amount on invoice line items. Line-amount is the product of quantity and unit-price, but this is very useful to check for consistency without the need for confidence levels.

5. Train the Extractor

To create an extractor, go to the Extractors view in Document Understanding and click the Create Extractor button at the top right. You can then select the Document Type, the ML Model and Version you would like to use. You can monitor the progress on the Extractors tab, or in the Details view of the Extractor, which contains a link to the AI Center pipeline, where you can see the detailed logs in real time.

When evaluating an ML model, the most powerful tool is the evaluation_<package name>.xlsx file generated in the artifacts/eval_metrics folder in AI Center pipeline details view. On the first sheet you can see a detailed Accuracy scores report, including overall scores, and also per field and per batch.

In this Excel file you can see what predictions are failing and on which files, and you can see immediately if it is an OCR error or an ML Extraction or parsing error, and if it may be fixed by simple logic in the RPA workflow, or it requires a different OCR engine, more training data, or improving the labelling or the field configurations in Document Manager.

This Excel file is also very useful to identify the most relevant business rules you need to apply to the RPA workflow in order to catch common mistakes for routing to Validation Station in Action Center for manual review. Business rules are by far the most reliable way to detect errors.

For those errors which cannot be caught by business rules, you may also use confidence levels. The Excel file also contains confidence levels for each predictions so you can use Excel features like sorting and filtering to determine what a good confidence threshold is for your business scenario.

Overall, the evaluation_<package_name>.xlsx Excel file is a key resource you need to focus on to get the best results from your AI automation.

Important: GPU training is highly recommended for large and production datasets. CPU training is much slower and should be used sparingly, for small datasets for demo or testing purposes. For more information, check the Training Pipelines page.

6. Define and Implement Business Rules

In this step, you should be concerned with model errors and how to detect them. There are two main ways to detect errors:

  • through enforcing business rules
  • through applying lookups in Systems of Record in the customer organization
  • through enforcing a minimum confidence level threshold

The most effective and reliable way to detect errors is by defining business rules and lookups. Confidence levels can never be 100% perfect, there will always be a small but non-zero percentage of correct predictions with low confidence or wrong predictions with high confidence. In addition, and perhaps most importantly, a missing field has no confidence, so a confidence threshold can never catch errors whereby a field is not extracted at all. Consequently, confidence level thresholds should only be used as a fallback, a safety net, but never as the main way to detect business-critical errors.

Examples of business rules:

  • Net amount plus Tax amount must equal Total amount
  • Total amount must be greater than or equal to Net amount
  • Invoice number, Date, Total amount (and other fields) must be present
  • PO number (if present) must exist in PO database
  • Invoice date must be in the past and cannot be more than X months old
  • Due date must be in the future and not more than Y days/months
  • For each line item, the quantity multiplied by unit price must equal the line amount
  • Sum of line amounts must equal net amount or total amount
  • Etc.
Note: In case of numbers, a rounding to eight decimals is performed.

In particular, the confidence levels of column fields should almost never be used as an error detection mechanism, since column fields (e.g., line items on invoices or POs) can have dozens of values, so setting a minimum threshold over so many values can be especially unreliable, as one value is more than likely to have small confidence, so this would lead to most/all documents being routed to human validation, many times unnecessarily.

Business rules must be enforced as part of the RPA workflow, and the business rule failure is passed to the human validator so as to direct their attention and accelerate the process.
Note: When defining Business Rules, please keep in mind that the Starts with, Ends with, and Contains values are case sensitive.

7. (Optional) Choose a Confidence Threshold

After the business rules have been defined, sometimes there might remain a small number of fields for which there are no business rules in place, or for which the business rules are unlikely to catch all errors. For this, you may need to use a confidence threshold as a last resort.

The main tool to set this threshold is the Excel spreadsheet which is output by the Training pipeline in the Outputs > artifacts > eval_metrics folder.

This evaluation_<package name>.xlsx file contains a column for each field, and a column for the confidence level of each prediction. By sorting the table based on the confidence columns you may see where the errors start appearing for any given field and set a threshold above that level to ensure that only correctly extracted documents are sent straight through.

8. Fine-tune with Data From Validation Station

Validation Station data can help improve the model predictions, yet, in many cases, it turns out that most errors are NOT due to the model itself but to the OCR, labelling errors or inconsistencies, or to postprocessing issues (e.g., date or number formatting). So, the first key aspect is that Validation Station data should be used only after the other Data Extraction Components have been verified and optimized to ensure good accuracy, and the only remaining area of improvement is the model prediction itself.

The second key aspect is that Validation Station data has a lower information density than the data labeled in Document Manager. Fundamentally, the Validation Station user only cares about getting the right value once. If an invoice has 5 pages, and the invoice number appears on every page, the Validation Station user validates it only on the first page. So, 80% of the values remain unlabelled. In Document Manager, all the values are labelled.

Finally, keep in mind Validation Station data needs to be added to the original manually labelled dataset, so that you always have a single training dataset which increases in size over time. You always need to train on the ML Package with the 0 (zero) minor version, which is the version released by UiPath Out-of-the-box.

Important: It is often wrongly assumed that the way to use Validation Station data is to iteratively train the previous model version, so the current batch is used to train package X.1 to obtain X.2. Then the next batch trains on X.2 to obtain X.3 and so on. This is the wrong way to use the product. Each Validation Station batch needs to be imported into the same Document Manager session as the original manually labeled data making a larger dataset, which must be used to train always on the X.0 ML Package version.

Cautions on using Validation Station data

Validation station data can potentially be of much higher volume since it is used in the production workflow. You do not want the dataset to become overwhelmed with Validation Station data because this can degrade the quality of the model due to the information density issue mentioned above.

The recommendation is to add a maximum of 2-3X the number of pages of Document Manager data and, beyond that, only cherry pick those vendors or samples where you see major failures. If there are known major changes to the production data, such as a new language, or a new geographic region being onboarded to the business process (expanding from US to Europe or South Asia), then representative data for those languages and regions should be added to Document Manager for manual labelling. Validation Station data is not appropriate for such major scope expansion.

One other potential issue with Validation Station data is balance. In production it is common that a majority of traffic comes from a small subset of vendors/customers/world regions. If allowed into the training set as is, this can lead to a highly biased model which performs well on a small subset of the data but performs poorly on most of the rest of the data. Therefore it is important to take special care when adding Validation Station data into a training set.

Here is a sample scenario. You have chosen a good OCR engine, labeled 500 pages in Document Manager, resulting in good performance, and you have deployed the model in a production RPA workflow. Validation Station is starting to generate data. You should randomly select up to a maximum of 1000-1500 pages from Validation Station and import them into the Document Manager together with the first 500 pages and train your ML model again. After that, you should look very carefully at the evaluation_<package name>.xlsx to make sure the model actually improved, and then you should deploy the new model to production.

9. Deploy Your Automation

Make sure to use the Document Understanding Process: Studio Template from the Templates section in the Studio start screen in order to apply best practices in Enterprise RPA architecture.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.