The power of Machine Learning Models is that they are defined by training data rather than by explicit logic expressed in computer code. This means that extraordinary care is needed when preparing datasets because a model is only as good as the dataset that was used to train it. In that sense, what UiPath Studio is to RPA workflows, Document Manager is to Machine Learning capabilities. Both require some experience to be used effectively.
An ML Model can extract data from a single type of document, though it might cover several different languages. It is essential that each field (Total Amount, Date, etc.) has a single consistent meaning. If a human can be confused about the correct value for a field, then the same happens for an ML Model.
Ambiguous situations can appear. For instance, is a utility bill just another type of invoice? Or are these two different document types that require two different ML Models? If the fields you need to extract are the same, (i.e. they have the same meaning) then you can treat them as a single document type. However, if you need to extract different fields for different reasons (different business processes), then you need to treat these as two different document types, and, hence, train two different models.
When in doubt, start by training a single model, but keep the documents in different Document Manager batches (see the Filter dropdown at the top center of the Document Manager view). You can easily separate the documents later if needed, avoiding losing the labeling work. When it comes to ML Models, the more data, the better. Start by having a single model with ample data.
Document Manager can be used to build two types of datasets:
- training datasets
- evaluation datasets
Both types of dataset are essential for building a high-performing ML Model and require time and effort for creating and maintaining them. An evaluation dataset that is representative of the production document traffic is required for obtaining a high-performing ML model.
Each dataset type is labeled in a different way:
- Training datasets rely on the bounding boxes of the words on the page representing the different pieces of information you need to extract.
- When labeling a Training set, focus on the page itself and the word boxes.
- Evaluation datasets rely on the values of the fields, that appear in the sidebar (for Regular fields) or the top bar (for Column fields).
- When labeling an Evaluation set, focus on the values under the field names in the sidebar or top bar. That does not mean you need to type them in manually. We recommend labelling by clicking on the boxes on the page, and checking the corectness of the values.
More details on how to conduct proper Evaluations can be found below.
Data extraction relies on the following components:
- Optical Character Recognition
- Word and Line building
- Grouping characters into words and words into left-to-right lines of text
- Machine Learning Model prediction for each word/box on the page
- Cleanup, parsing, and formatting of the spans of text
- For example, grouping words on several lines into an address, formatting a date to the standard yyyy-mm-dd format
- Applying an algorithm for selecting which value is returned
- For the cases when the document has two or more pages and some fields appear on more than one page
For the best outcome in terms of automation rate (percent reduction of manual work measured in person-months per year required to process your document flow) follow these steps:
- This influences the OCR, the Word and Line building (that partially depends on the OCR), and everything that follows.
- Select a well-balanced and representative dataset for Training
- Select a representative dataset for Evaluation
- Define the fields to be extracted
- Configure the fields
- Label the Training dataset
- Label the Evaluation dataset
- Train and Evaluate the model in AI Center
- Define and implement the business rules for processing model output
- (Optional) Choose confidence threshold(s) for the Extraction
- Training using data from Validation Station
- The Auto-Fine-tuning Loop (Preview)
- Deploy your automation
To choose an OCR engine, you should create different Document Manager sessions, configure different OCR engines, and try to import the same files into each of them to examine differences. Focus on the areas that you want to extract. For example, if you need to extract company names that appear as part of logos on invoices, you may want to see which OCR engine performs better on the text in logos.
Your default option should be UiPath Document OCR since it is included with Document Understanding licenses at no charge. However, in cases where some unsupported languages are required, or some very hard-to-read documents are involved, you might want to try Google Cloud (Cloud only) or Microsoft Read (Cloud or On Premises), which have better language coverage. These engines come at a cost, indeed it is low, but if the accuracy is higher on some critical data fields for your business process, it is strongly recommended to use the best OCR available – saving your time later on since everything downstream depends on it.
Please be aware that the Digitize Document activity has the ApplyOcrOnPDF setting set to Auto by default, determining if the document requires to apply the OCR algorithm depending on the input document. Avoid missing the extraction of important information (from logos, headers, footers, etc.) by setting the ApplyOcrOnPDF to Yes, making sure that all text is detected, though it might slow down your process.
Machine Learning technology has the main benefit of being able to handle complex problems with high diversity. When estimating the size of a training dataset, one looks first at the number of fields and their types, and the number of languages. A single model can handle multiple languages as long as they are not Chinese/Japanese/Korean. Chinese/Japanese/Korean scenarios generally require separate Training datasets and separate models.
There are 3 types of fields:
- Regular fields (date, total amount)
- For Regular fields, you need at least 20-50 document samples per field. So, if you need to extract 10 regular fields, you need at least 200-500 document samples. If you need to extract 20 regular fields, you need at least 400-1000 document samples. The amount of document samples you need increases with the number of fields. More fields means you need more document samples, about 20-50X more.
- Column fields (item unit price, item quantity)
- For Column fields, you need at least 50-200 document samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300 document samples. For highly complex and diverse layouts, it might require over 1000 document samples. To cover multiple languages, then you need at least 200-300 document samples per language, assuming they cover all the different fields. So, for 10 header fields and 4 column fields with 2 languages, 600 document samples might be enough (400 for the columns and headers, plus 200 for the additional language), but in some cases might require 1200 or more document samples.
- Classification fields (currency)
- Classification fields generally require at least 10-20 document samples from each class.
The above guidelines assume that you are solving a high diversity scenario like invoices or purchase orders with dozens to hundreds or thousands of layouts. However, if you are solving a low-diversity scenario like a tax form or invoices with very few layouts (under 5-10), then the dataset size is determined more by the number of layouts. In this case, you should start with 20-30 pages per layout and add more if needed - especially if the pages are very dense with large number of fields to be extracted. For instance, creating a model for extracting 10 fields from 2 layouts might require 60 pages, but if you need to extract 50 or 100 fields from 2 layouts, then you might start with 100 or 200 pages and add more as needed to get the desired accuracy. In this case, the distinction regular fields/column fields is less important.
ML technology is designed to handle high diversity scenarios. Using it to train models on low diversity scenarios (1-10 layouts) requires special care to avoid brittle models that are sensitive to slight changes in the OCR text. Avoid this by having some deliberate variability in the training documents, by printing and then scanning or photographing them using mobile phone scanner apps. The slight distortions or changing resolutions make the model more robust.
These estimates assume that most pages contain all or most of the fields. For documents with multiple pages but most fields are on a single page, then the relevant number of pages is the number of examples of that one page where most fields appear.
The numbers above are general guidelines, not strict requirements. In general, you can start with a smaller dataset, and then keep adding data until you get good accuracy. This is especially useful to parallelize the RPA work with the model building. Also, a first version of the model can be used to prelabel additional data (see Settings view and Predict button in Document Manager) which can accelerate labeling additional Training data.
You do not need to have every single layout represented in a training set. In fact, most layouts in our production document flow have zero samples in your training set, or one or two document samples. This is desirable, because you want to leverage the power of AI to understand documents and be able to make correct predictions on documents that it has not seen during training. Many document samples per layout is not mandatory because most layouts might either not be present at all, or be present only one or two times, and the model would still be able to predict correctly, based on the learning from other layouts.
There are three main types of scenarios when training a ML model for Document Understanding:
- training a new type of document from scratch using the DocumentUnderstanding ML Package in AI Center
- retraining over a pre-trained Out Of the Box model for optimizing the accuracy
- retraining over a pre-trained Out Of the Box model for optimizing the accuracy and adding some new fields
The dataset size estimates for the first type of scenario are described in the first part of this section titled "Create a Training Set".
For the second type of scenario, the dataset size depends on how well the pre-trained models already work on your documents. If they work very well already, then you may need very little data 50-100 pages. If they fail on a number of important fields, you may need more, but a good starting point would still be four times smaller than if you were training from scratch.
And finally for the third type of scenario, start with the dataset size for the second scenario above, and then increase the dataset depending on how many new fields you have, using the same guidance as for training from scratch: at least 20-50 pages per new regular field, or at least 50-200 pages per column field.
In all these cases, all the documents need to be labelled fully, including the new fields, which the out-of-the-box model does not recognize, and also the origine of the fields, which the out-of-the-box model does recognize.
Some fields might appear on every document (e.g., date, invoice number) while some fields might appear only on 10% of the pages (e.g., handling charges, discount). In these cases, you need to make a business decision. If those rare fields are not critical to your automation, you can get away with a small number of document samples (10-15) of that particular field, i.e. pages which contain a value for that field. However, if those fields are critical, then you need to make sure that you include in your Training set at least 30-50 document samples of that field to make sure to cover the full diversity.
In the case of invoices, if a dataset contains invoices from 100 vendors, but half of the dataset consists of invoices from one single vendor, then that is a very unbalanced dataset. A perfectly balanced dataset is where each vendor appears an equal number of times. Datasets do not need to be perfectly balanced, but you should avoid having more than 20% of your entire dataset coming from any single vendor. At some point, more data does not help, and it might even affect the accuracy on other vendors because the model optimizes (overfits) so much for one vendor.
Data should be chosen to cover the diversity of the documents likely to be seen in the production workflow. For example, if you get invoices in English but some of them come from the US, India and Australia, they probably look different, so you need to make sure you have document samples from all three. This is relevant not only for the model training itself, but also for labeling purposes. When you label the documents you might discover that you need to extract new, different fields from some of these regions, like GSTIN code from India, or ABN code from Australia. See more in the Define fields section.
For Training sets, the pages and the number of pages are the most important. For Evaluation sets, we refer only to documents and the number of documents. The scores for releases v2021.10 and after are calculated per document.
Evaluation datasets can be smaller. They can be 50-100 documents (or even just 30-50 in low diversity scenarios), and they can grow over time to be a few hundred documents. It is important that they are representative of the production data flow. So, a good approach is to randomly select from the documents processed in the RPA workflow. Even if some vendors are overrepresented, that is ok. For example, if a single vendor represents 20% of your invoice traffic, it is ok to have that vendor be 20% of your Evaluation set too, so the Evaluation metrics approximate your business metric, i.e. reduction in person-months spent on manual document processing.
When importing Evaluation data into Document Manager, you need to check the “Make this an evaluation set” box on the Import dialog window. This guarantees that the data is held out when training, and also you can easily export it for running Evaluations using the evaluation-set option in the Filter dropdown in Document Manager.
Starting with the 21.9 Preview release in Automation Cloud and the 21.10 GA On Premises release, Document Manager has switched to handling multi-page documents rather than each page as a separate entity. This is a major change, especially for Evaluations, which were its main motivation. Evaluations need to be representative of the runtime process, and at runtime, multipage documents are processed as a whole, rather than as separate pages. To benefit from this enhancement in ML Packages with version 21.10 or later, you just need to leave the "Backward compatible export" box unchecked in the Export dialog. If you check this box, the dataset is exported in the old page-by-page way, and evaluation scores are not as representative of the runtime performance.
Defining fields is a conversation that needs to happen with the Subject Matter Expert or Domain Expert who owns the business process itself. For invoices, it would be the Accounts Payable process owner. This conversation is critical, it needs to happen before any documents are labeled to avoid wasted time, and it requires looking together at a minimum of 20 randomly chosen document samples. A one-hour slot needs to be reserved for this, and it often needs to be repeated after a couple of days, as the person preparing the data runs into ambiguous situations or edge cases.
It is not uncommon that the conversation starts with the assumption that you need to extract, let's say, 10 fields, and later you end up with 15. Some examples are described in the subsections below.
Some key configurations you need to be aware of:
- Content type
- This is the most important setting as it determines the postprocessing of the values, especially for dates (detects if the format is US-style or non-US style, and then formats them as yyyy-mm-dd) and for numbers (detects the decimal separator – comma or period). ID numbers clean up anything coming before a colon or hash symbol. String content type performs no cleanup and can be used when you want to do your own parsing in the RPA workflow.
- Multi-line checkbox
- This is for parsing strings like addresses that may appear on more than 1 line of text.
- Hidden fields
- Fields marked as Hidden can be labelled but they are held out when data is exported, so the model cannot be trained on them. This is handy when labeling a field is a work in progress, when it is too rare, or when it is low priority.
- This is relevant only for Evaluation pipelines, and it affects how the accuracy score is calculated. A field that uses Levenshtein scoring is more permissive: if a single character out of 10 is wrong, the score is 0.9. However, if scoring is Exact Match it is more strict: a single wrong character leads to a score of zero. All fields are by default Exact Match. Only String type fields have the option to select Levenshtein scoring.
A total amount might seem straightforward enough, but utility bills contain many amounts. Sometimes you need the total amount to be paid. Other times you need only the current bill amount – without the amounts carried forward from previous billing periods. In the latter case, you need to label differently even if the current bill and total amount can be the same. The concepts are different and the amounts are often different.
Each field represents a different concept, and they need to be defined as cleanly and crisply as possible, so there is no confusion. If a human confuses it, the ML model also does that.
Moreover, the current bill amount can sometimes be composed of a few different amounts, fees, and taxes and may not appear individualized anywhere on the bill. A possible solution to this is to create two fields: a previous-charges field and a total field. These two always appear as distinct explicit values on the utility bill. Then the current bill amount can be obtained as the difference between the two. You might even want to include all 3 fields (previous-charges, total, and current-charges) in order to be able to do some consistency checks in cases where the current bill amount appears explicitly on the document. So you could go from one to three fields in some cases.
PO numbers can appear as single values for an invoice, or they might appear as part of the table of line items on an invoice, where each line item has a different PO number. In this case, it could make sense to have two different fields: po-no and item-po-no. By keeping each field visually and conceptually consistent, the model is likely to do a much better job. However, you need to make sure both are well represented in your Training and your Evaluation datasets.
The company name usually appears at the top of an invoice or a utility bill, but sometimes it might not be readable because there is just a logo, and the company name is not explicitly written out. There could also be some stamp, or handwriting, or wrinkle over the text. In these cases, people might label the name that appears at the bottom right, in the 'Remit payment to' section of the payslip on utility bills. That name is often the same, but not always, since it is a different concept. Payments can be made to some other parent or holding company, or other affiliate entity, and it is visually different on the document. This might lead to poor model performance. In this case, you should create two fields, vendor-name and payment-addr-name. Then you can look both up in a vendor database and use the one that matches, or use payment-addr-name when the vendor-name is missing.
There are two distinct concepts to keep in mind: table rows and lines of text. A table row includes all the values of all columns fields which belong together in that row. Sometimes they may all be part of the same line of text going across the page. Other times they may be on different lines.
If a table row consists of more than one line of text, then you need to group all values on that table row using the “/” hotkey. When you do this, a green box appears, covering the entire table row. Here is an example of a table where the top two rows consist of multiple lines of text and need to be grouped using the “/” hotkey, while the third row is a single line of text and does not need to be grouped.
Here is an example of a table where each table row consists of a single line of text. You do not need to group these using the “/” hotkey because this is done implicitly by Document Manager.
Split Items is a setting that appears only for Column fields, and it helps the model know when a line item ends and another begins. As a human looking at a document, to tell how many rows are there, you probably look at how many amounts are on the right side. Each amount refers to a line item in general. This indicates that line-amount is a column where you should enable Split Items. Other columns can also be marked, in case the OCR misses the line-amount, or the model does not recognize it: quantity and unit-price are usually also marked as 'Split Items'.
The most important configuration is the Content Type, except for String:
- ID Number
These impact post-processing, especially the cleanup, the parsing, and the formatting. The most complex is the Date formatting, but also Number formatting requires determining the decimal point separator and thousands decorator. In some cases, if parsing fails, your option is to report the issue to UiPath support and to fall back on the String content type, which does no parsing. In that case, you need to parse the value in your RPA workflow logic.
Another relevant configuration is the Multi-line checkbox, which is relevant mainly for String type fields. Whenever some other field produces unexpected results or no results, the first thing to try is to change it to String Multi-line field, to see the unaltered output of the model prediction.
When labeling Training data, you need to focus on the bounding boxes of the words in the document pane of Document Manager. The parsed values in the right or top sidebars are not important as they are not used for training.
All fields should be labelled, even if a field appears multiple times on a page, as long as they represent the same concept (see the Define fields section above).
When the OCR misses a word or gets a few characters wrong, just label the bounding box if there is one, and if not, then just skip it and keep going. There is no possibility to add a word in Document Manager because even if you did, the word would still be missing at run time, so adding it doesn't help the model at all.
As you label, remain vigilant about fields that may have multiple or overlapping meanings/concepts, in case you might need to split a field into two separate fields, or fields that you do not explicitly need, but which, if labelled, might help you to do certain validation or self-consistency check logic in the RPA workflow. Typical examples are quantity, unit-price, and line-amount on invoice line items. Line-amount is the product of quantity and unit-price, but this is very useful to check for consistency without the need for confidence levels.
When labeling Evaluation datasets (also called Test datasets) you need to focus on something slightly different from labeling Training datasets. Whereas for Training datasets only the bounding boxes of the fields on the document matter, for Evaluation datasets only the values of the fields matter. You may edit them by clicking on the value in the right or top sidebar and editing it. To return to the automatically parsed value, click on the lock icon.
For convenience, speed, and to avoid typos, we recommend clicking on the boxes on the document when labeling and only making corrections manually. Typing full values manually is slower and more error-prone.
Exporting the entire dataset, including both Training and Test batches is allowed because the Training pipelines in AI Center ignore Test data. However, Evaluation pipelines run the evaluation on the whole evaluation dataset regardless if it is comprised of train or test data. The type of a given document is displayed right under the file name, at the top center of the Document Manager window.
When evaluating an ML model, the most powerful tool is the evaluation.xlsx file generated in the artifacts/eval_metrics folder. In this Excel file you can see what predictions are failing and on which files, and you can see immediately if it is an OCR error, an ML Extraction, or parsing error, and if it may be fixed by simple logic in the RPA workflow, or it requires a different OCR engine, more training data, or improving the labelling of the field configurations in Document Manager.
This Excel file is also very useful to identify the most relevant business rules you need to apply in the RPA workflow to catch common mistakes for routing to Validation Station in Action Center for manual review. Business rules are by far the most reliable way to detect errors.
For those errors that cannot be caught by business rules, you may also use confidence levels. The Excel file also contains confidence levels for each prediction, so you can use Excel features like sorting and filtering to determine what a good confidence threshold is for your business scenario.
Overall, the evaluation.xlsx Excel file is a key resource you need to focus on to get the best results from your AI automation.
In this step, you should be concerned with model errors and how to detect them. There are two main ways to detect errors:
- through enforcing business rules
- through enforcing a minimum confidence level threshold
The most effective and reliable way to detect errors is by defining business rules. Confidence levels can never be 100% perfect. There is always a small, but non-zero percentage of correct predictions with low confidence or wrong predictions with high confidence. Also, and most importantly, a missing field has no confidence, so a confidence threshold can never catch errors whereby a field is not extracted at all. Consequently, confidence level thresholds should only be used as a fallback, a safety net, but never as the main way to detect business-critical errors.
Examples of business rules:
- Net amount plus Tax amount must equal Total amount
- Total amount must be greater than or equal to Net amount
- Invoice number, Date, Total amount (and other fields) must be present
- PO number (if present) must exist in PO database
- Invoice date must be in the past and cannot be more than X months old
- Due date must be in the future and not more than Y days/months
- For each line item, the quantity multiplied by unit price must equal the line amount
- Sum of line amounts must equal net amount or total amount
In particular, the confidence levels of column fields should rarely be used as an error detection mechanism, since column fields (e.g., line items on invoices or POs) can have dozens of values, so setting a minimum threshold over so many values can be especially unreliable, as one value is more than likely to have small confidence, so this would lead to most/all documents being routed to human validation, many times unnecessarily.
Business rules must be enforced as part of the RPA workflow, and the business rule failure is passed to the human validator so as to direct their attention and accelerate the process.
After the business rules have been defined, sometimes there might remain a small number of fields for which there are no business rules in place, or for which the business rules are unlikely to catch all errors. For this, you may need to use a confidence threshold as a last resort.
The main tool to set this threshold is the Evaluation pipeline feature in AI Center, and specifically, the Excel spreadsheet which is output by the Evaluation pipeline in the Outputs > artifacts > eval_metrics folder.
This spreadsheet contains a column for each field and a column for the confidence level of each prediction. You can add a column called min_confidence that takes the minimum of all the confidences over all fields that are important for your business process and are not already covered by business rules. For instance, you may not want to put a threshold on the line items confidences, but rather on vendor name, total amount, date, due date, invoice number, and other essential fields. By sorting the table based on the min_confidence column you may see where the errors start appearing and set a threshold above that level to ensure that correctly extracted documents are sent straight through.
Validation Station data can help improve the model predictions, yet, often, it turns out that most errors are not due to the model itself but to the OCR, labelling errors, inconsistencies, or to postprocessing issues (e.g., date or number formatting). So, the first key aspect is that Validation Station data should be used only after the other Data Extraction Components have been verified and optimized to ensure good accuracy, and the only remaining area of improvement is the model prediction itself.
The second key aspect is that Validation Station data has a lower information density than the data labeled in Document Manager. Fundamentally, the Validation Station user only cares about getting the right value once. If an invoice has 5 pages, and the invoice number appears on every page, the Validation Station user validates it only on the first page. So, 80% of the values remain unlabelled. In Document Manager, all the values are labelled.
Finally, keep in mind Validation Station data needs to be added to the original manually labelled dataset, so that you always have a single training dataset which increases in size over time. You always need to train on the ML Package with the 0 (zero) minor version, which is the version released by UiPath Out of the Box.
It if often wrongly assumed that the way to use Validation Station data is to iteratively retrain the previous model version, so the current batch is used to train package X.1 to obtain X.2. Then the next batch trains on X.2 to obtain X.3 and so on. This is the wrong way to use the product. Each Validation Station batch needs to be imported into the same Document Manager session as the original manually labeled data making a larger dataset, which must be used to train always on the X.0 ML Package version.
Validation station data can potentially be of much higher volume since it is used in the production workflow. As a consequence, you need a guideline for how much data is likely to be useful, given that model Training requires time and infrastructure. Also, you do not want the dataset to become overwhelmed with Validation Station data because this can degrade the quality of the model due to the information density issue mentioned above.
The recommendation is to add a maximum of 2-3X the number of pages of Document Manager data and, beyond that, only cherry pick those vendors or document samples where you see major failures. If there are known major changes to the production data, such as a new language, or a new geographic region being onboarded to the business process (expanding from US to Europe or South Asia), then representative data for those languages and regions should be added to Document Manager for manual labelling. Validation Station data is not appropriate for such major scope expansion.
Here is a sample scenario. You have chosen a good OCR engine, labeled 500 pages in Document Manager, resulting in good performance and you have deployed the model in a production RPA workflow. Validation Station is starting to generate data. You should randomly select up to a maximum of 1000-1500 pages from Validation Station and import them into the Document Manager together with the first 500 pages and retrain your ML model. After that, you should run an Evaluation Pipeline to make sure the model actually improved, and then you should deploy the new model to production.
The Auto-Fine-tuning Loop is a Preview capability that becomes useful for maintaining a high-preforming model which you have already created using the steps described above. To ensure that auto-fine-tuning produces better versions of the model it is critical that you have a good Evaluation dataset and that you use an automatically rescheduled Full Pipeline which runs both Training and Evaluation at the same time. In this way, you can see if the most recent Training produced a more accurate model than the previous one, and if so, you are ready to deploy the new model to the ML Skill invoked by the Robots in your business process.
The Training dataset keeps changing as more data comes in and Document Manager exports periodically as scheduled in the Scheduled Export dialog. The Evaluation runs on the same Evaluation dataset you specify in the pipeline. Evaluation datasets never change automatically, they always need to be curated, labeled and exported manually. You should change your Evaluation set rarely so that accuracy scores can be compared between different Training runs.
AI Center offers the possibility to automatically update the ML Skill when a new version of an ML Package is retrained. However, this automatic update does not take into account the score of the Full Pipeline, and therefore it is not recommended to use this feature with Document Understanding auto-retraining pipelines.
As mentioned above in the Create an Evaluation Dataset section, Evaluation Pipeline implementations for ML Packages release 21.10 or later calculate scores on a per-document basis - which reflects accurately the results you see in an RPA workflow. This assumes your dataset was labelled on a per-document basis in Document Manager. You can tell if a multi-page document is labeled on a per-document basis if you can scroll naturally through the pages like in a regular PDF reader. If you need to click next to pass from one page to the next, then each page is considered a separate document.
Make sure to use the Document Understanding Process from the Templates section in the Studio start screen in order to apply best practices in Enterprise RPA architecture.
Updated 15 days ago