The power of Machine Learning Models is that they are defined by training data rather than by explicit logic expressed in computer code. This means that extraordinary care is needed when preparing datasets because a model is only as good as the dataset that was used to train it. In that sense, what UiPath Studio is to RPA workflows, Document Manager is to Machine Learning capabilities. Both require some experience to be used effectively.
An ML Model can extract data from a single type of document, though it might cover several different languages. It is essential that each field (Total Amount, Date, etc.) has a single consistent meaning. If a human might get confused about the correct value for a field, then an ML Model will also.
Ambiguous situations might appear. For instance, is a utility bill just another type of invoice? Or are these two different document types that require two different ML Models? If the fields you need to extract are the same, (i.e. they have the same meaning) then you can treat them as a single document type. However, if you need to extract different fields for different reasons (different business processes), that might indicate you need to treat these as two different document types, and, hence, train two different models.
Whenever in doubt, start by training a single model, but keep the documents in different Document Manager batches (see the Filter dropdown at the top center of the Document Manager view) so you can easily separate them later if needed. In this way, the labeling work is not lost. When it comes to ML Models, the more data, the better. So, having a single model with ample data is a good place to start.
Document Manager can be used to build two types of datasets: training datasets and evaluation datasets. Both are essential for building a high-performing ML Model and you need to budget time and effort for creating and maintaining both. You cannot obtain a high-performing ML model without an evaluation dataset that is representative of the production document traffic.
Each dataset type is labeled in a different way:
Training datasets rely only on the bounding boxes of the words on the page representing the different pieces of information you need to extract.
- When labeling a Training set, focus only on the page itself and the word boxes.
Evaluation datasets rely only on the values of the fields, which appear in the sidebar (for Regular fields) or the top bar (for Column fields).
- When labeling an Evaluation set, focus on the values under the field names in the sidebar or top bar. That does not mean you need to type them in manually, in fact, that is prone to typographical errors, so it is still recommended you label by clicking on the boxes on the page, but make sure to check that the values are correct.
More details on how to conduct proper Evaluations can be found below.
Data extraction relies on the following components:
- Optical Character Recognition
- Word and Line building
- Grouping characters into words and words into left-to-right lines of text
- Machine Learning Model prediction for each word/box on the page
- Cleanup, parsing, and formatting of the spans of text
- For example, grouping words on several lines into an address, formatting a date to the standard yyyy-mm-dd format
- Applying an algorithm for selecting which value is returned
- For the cases in which the document has 2 or more pages and some fields appear on more than one page
For the best outcome in terms of automation rate (percent reduction of manual work measured in person-months per year required to process your document flow) you need to carefully follow these steps:
- Choose the best OCR engine for your documents
- This influences both the OCR and the Word and Line building (which depends partially on the OCR), and everything downstream of course.
- Select a well balanced and representative dataset for Training
- Select a representative dataset for Evaluation
- Define the fields to be extracted
- Configure the fields
- Label the Training dataset
- Label the Evaluation dataset
- Train and Evaluate the model in AI Center
- Define and implement the business rules for processing model output
- (Optional) Choose confidence threshold(s) for the Extraction
- Training using data from Validation Station
- The Auto-Fine-tuning Loop (Preview)
- Deploy your automation
To choose an OCR engine, you should create different Document Manager sessions, configure different OCR engines, and try to import the same files into each of them to examine differences. You should focus on those areas which you intend to extract. For example, if you need to extract company names that appear as part of logos on invoices you may want to see which OCR engine performs better on the text in logos.
Your default option should be UiPath Document OCR since it is included with Document Understanding licenses at no charge. However, in cases where some unsupported languages are required, or some very hard-to-read documents are involved, you might want to try Google Cloud (Cloud only) or Microsoft Read (Cloud or On Premises), which have better language coverage. These engines come at a cost, indeed it is low, but if the accuracy is higher on some critical data fields for your business process, it is strongly recommended to use the best OCR available – it will be well worth your time later on since everything downstream depends on it.
Please be aware that the Digitize Document activity has the ForceApplyOCR setting set to False by default, which means when operating on .pdf documents by default the OCR engine is not used at all, the text is extracted directly from the .pdf document itself in the case of native .pdf documents. However, many native .pdf documents may contain logos or even headers or footers which are not picked up. These may contain the identifying information of the company, such as its name, address, VAT code or payment information, which is highly relevant in business processes. If your process has this situation, then set ForceApplyOCR to True to ensure that all text is detected, though it might slow down your process.
Machine Learning technology has the main benefit of being able to handle complex problems with high diversity. When estimating the size of a training dataset, one looks first at the number of fields and their types and the number of languages. A single model can handle multiple languages as long as they are not Chinese/Japanese/Korean. CJK scenarios will generally require separate Training datasets and separate models.
There are 3 types of fields:
- Regular fields (date, total amount)
- For Regular fields, you need at least 20-50 samples per field. So, if you need to extract 10 regular fields, you would need at least 200-500 samples. If you need to extract 20 regular fields, you would need at least 400-1000 samples. The amount of samples you need increases with the number of fields. More fields means you need more samples, about 20-50X more.
- Column fields (item unit price, item quantity)
- For Column fields, you need at least 50-200 samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300 samples, but for highly complex and diverse layouts, it might require over 1000. To cover multiple languages, then you need at least 200-300 samples per language assuming they cover all the different fields. So, for 10 header fields and 4 column fields with 2 languages, 600 samples might be enough (400 for the columns and headers plus 200 for the additional language), but in some cases might require 1200 or more.
- Classification fields (currency)
- Classification fields generally require at least 10-20 samples from each class.
The above guidelines assume that you are solving a high diversity scenario like invoices or purchase orders with dozens to hundreds or thousands of layouts. However, if you are solving a low-diversity scenario like a tax form or invoices with very few layouts (under 5-10), then the dataset size is determined more by the number of layouts. In this case, you should start with 20-30 pages per layout and add more if needed - especially if the pages are very dense with large number of fields to be extracted. For instance, creating a model for extracting 10 fields from 2 layouts might require 60 pages, but if you need to extract 50 or 100 fields from 2 layouts, then you might start with 100 or 200 pages and add more as needed to get the desired accuracy. In this case, the distinction regular fields/column fields is less important.
ML technology is designed to handle high diversity scenarios. Using it to train models on low diversity scenarios (1-10 layouts) requires special care to avoid brittle models which are sensitive to slight changes in the OCR text. One way to avoid this is to make sure that there is some deliberate variability in the training documents by printing them and then scanning or photographing them using mobile phone scanner apps. The slight distortions or changing resolutions make the model more robust.
In addition, these estimates assume that most pages contain all or most of the fields. In cases where you have documents with multiple pages but most fields are on a single page, then the relevant number of pages is the number of examples of that one page where most fields appear.
The numbers above are general guidelines, not strict requirements. In general, you can start with a smaller dataset, and then keep adding data until you get good accuracy. This is especially useful in order to parallelize the RPA work with the model building. Also, a first version of the model can be used to prelabel additional data (see Settings view and Predict button in Document Manager) which can accelerate labeling additional Training data.
You do not need to have every single layout represented in a training set. In fact, it is possible that most layouts in our production document flow have zero samples in your training set, or perhaps 1 or 2 samples. This is desirable, because you want to leverage the power of AI to understand the documents and be able to make correct predictions on documents that it has not seen during training. A large number of samples per layout is not mandatory because most layouts might either not be present at all, or be present only 1 or 2 times, and the model would still be able to predict correctly, based on the learning from other layouts.
A common situation is extracting data from invoices, but you also have 2 more Regular fields and 1 more Column field which the out-of-the-box Invoices model does not recognize. In this case, you need a training set with 50 samples per new Regular field and a minimum of 100 samples per new Column field. So, 200 pages is a good start. This is usually a much smaller dataset than if you had trained all the fields from scratch.
These 200 pages need to be labeled fully, including the new fields, which the out-of-the-box model does not recognize, and also the original of the fields, which the out-of-the-box model does recognize.
Some fields might occur on every document (e.g., date, invoice number) while some fields might appear only on 10% of the pages (e.g., handling charges, discount). In these cases, you need to make a business decision. If those rare fields are not critical to your automation, you can get away with a small number of samples (10-15) of that particular field, i.e. pages which contain a value for that field. However, if those fields are critical, then you need to make sure that you include in your Training set at least 30-50 samples of that field to make sure to cover the full diversity.
In the case of invoices, if a dataset contains invoices from 100 vendors, but half of the dataset consists only of invoices from one single vendor, then that is a very unbalanced dataset. A perfectly balanced dataset is where each vendor appears an equal number of times. Datasets do not need to be perfectly balanced, but you should avoid having more than 20% of your entire dataset coming from any single vendor. At some point, more data does not help, and it might even affect the accuracy on other vendors because the model optimizes (overfits) so much for one vendor.
Data should be chosen to cover the diversity of the documents likely to be seen in the production workflow. For example, if you get invoices in English but some of them come from the US, India and Australia, they will likely look different, so you need to make sure you have samples from all three. This is relevant not only for the model training itself, but also for labeling purposes because as you label the documents you might discover that you need to extract new, different fields from some of these regions, like GSTIN code from India, or ABN code from Australia. See more in the Define fields section.
Whereas when referring to Training sets the pages and the number of pages are the most important, in the case of Evaluation sets, we refer only to documents and the number of documents. The scores for releases v2021.10 and after are calculated per document.
Evaluation datasets can be smaller. They can be 50-100 documents (or even just 30-50 in low diversity scenarios), and they can grow over time to be a few hundred documents. It is important that they are representative of the production data flow. So, a good approach is to randomly select from the documents processed in the RPA workflow. Even if some vendors will be overrepresented, that is ok. For example, if a single vendor represents 20% of your invoice traffic, it is ok to have that vendor be 20% of your Evaluation set too, so the Evaluation metrics approximate your business metric, i.e. reduction in person-months spent on manual document processing.
When importing Evaluation data into Document Manager you need to make sure you check the “Make this an evaluation set” box on the Import dialog window. In this way, you guarantee that the data is held out when training, and also you can easily export it for running Evaluations using the evaluation-set option in the Filter dropdown in Document Manager.
Starting with the 21.9 Preview release in Automation Cloud and the 21.10 GA On Premises release, Document Manager has switched to handling multi-page documents rather than each page as a separate entity as was the case previously. This is a major change, especially for Evaluations, which were its main motivation. Evaluations need to be representative of the runtime process, and at runtime multipage documents are processed as a whole, rather than as separate pages. To benefit from this enhancement in ML Packages with version 21.10 or later, you just need to leave the "Backward compatible export" box unchecked in the Export dialog. If you check this box, the dataset will be exported in the old page-by-page way, and evaluation scores will not be as representative of the runtime performance.
Defining fields is a conversation that needs to happen with the Subject Matter Expert or Domain Expert who owns the business process itself. For invoices, it would be the Accounts Payable process owner. This conversation is critical, it needs to happen before any documents are labeled to avoid wasted time, and it requires looking together at a minimum of 20 randomly chosen document samples. A one-hour slot needs to be reserved for this, and it often needs to be repeated after a couple of days, as the person preparing the data runs into ambiguous situations or edge cases.
It is not uncommon that the conversation starts with the assumption that you need to extract, let's say, 10 fields, and later you end up with 15. Some examples are described in the subsections below.
Some key configurations you need to be aware of:
- Content type
- This is the most important setting as it determines the postprocessing of the values, especially for dates (detects if the format is US-style or non-US style, and then formats them as yyyy-mm-dd) and for numbers (detects the decimal separator – comma or period). ID numbers clean up anything coming before a colon or hash symbol. String content type performs no cleanup and can be used when you want to do your own parsing in the RPA workflow.
- Multi-line checkbox
- This is for parsing strings like addresses that may appear on more than 1 line of text.
- Hidden fields
- Fields marked as Hidden can be labeled but they are held out when data is exported, so the model will not be trained on them. This is handy when labeling a field is work in progress, when it is too rare, or when it is low priority.
- This is relevant only for Evaluation pipelines, and it affects how the accuracy score is calculated. A field that uses Levenshtein scoring is more permissive: if a single character out of 10 is wrong, the score will be 0.9. However, if scoring is Exact Match it is more strict: a single character wrong will lead to a score of zero. All fields are by default Exact Match. Only String type fields have the option to select Levenshtein scoring.
A total amount might seem straightforward enough, but utility bills contain many amounts. Sometimes you need the total amount to be paid, other times you need only the current bill amount – without the amounts carried forward from previous billing periods. In the latter case, you need to label differently even if the current bill and total amount can be the same. The concepts are different and the amounts are often different.
Each field represents a different concept, and they need to be defined as cleanly and crisply as possible so there is no confusion. If a human might confuse it, the ML model will also.
Moreover, the current bill amount can sometimes be composed of a few different amounts, fees, and taxes and may not appear individualized anywhere on the bill. A possible solution to this is to create two fields: a previous-charges field and a total field. These two always appear as distinct explicit values on the utility bill. Then the current bill amount can be obtained as the difference between the two. You might even want to include all 3 fields (previous-charges, total, and current-charges) in order to be able to do some consistency checks in cases where the current bill amount appears explicitly on the document. So you could go from one to three fields in some cases.
PO numbers can appear as single values for an invoice, or they might appear as part of the table of line items on an invoice, where each line item has a different PO number. In this case, it could make sense to have two different fields: po-no and item-po-no. By keeping each field visually and conceptually consistent, the model is likely to do a much better job. However, you need to make sure both are well represented in your Training and your Evaluation datasets.
The company name usually appears at the top of an invoice or a utility bill, but sometimes it might not be readable because there is just a logo, and the company name is not explicitly written out. Or there may be some other stamp or handwriting or wrinkle over the text. In these cases, people might label the name which appears at the bottom right, in the 'Remit payment to' section of the payslip on utility bills. That name is often the same, but not always since it is a different concept. Payments can be done to some other parent or holding company or other affiliate entity, and it is visually different on the document. This might lead to poor model performance. In this case, you would want to create 2 fields, vendor-name and payment-addr-name. Then you can look both up in a vendor database and use the one that matches or only use payment-addr-name when the vendor-name is missing.
There are two distinct concepts to keep in mind: table rows and lines of text. A table row includes all the values of all columns fields which belong together in that row. Sometimes they may all be part of the same line of text going across the page. Other times they may be on different lines.
If a table row consists of more than one line of text, then you need to group all values on that table row using the “/” hotkey. When you do this a green box will appear covering the entire table row. Here is an example of a table where the top 2 rows consist of multiple lines of text and need to be grouped using the “/” hotkey, while the third row is a single line of text and does not need to be grouped.
Here is an example of a table where each table row consists of a single line of text. You do not need to group these using the “/” hotkey because this is done implicitly by Document Manager.
Split Items is a setting that appears only for Column fields, and it helps the model know when a line item ends and another begins. As a human looking at a document, to tell how many are rows there you probably look at how many amounts there are on the right side. Each amount refers to a line item in general. This is an indication that line-amount is a column where you should enable Split Items. Other columns can be also marked, in case the OCR misses the line-amount, or the model does not recognize it: quantity and unit-price are usually also marked as 'Split Items'.
The most important configuration is the Content Type, with the exception of String:
- ID Number
These impact post-processing, especially the cleanup, the parsing, and the formatting. The most complex is the Date formatting, but also Number formatting requires determining the decimal point separator and thousands decorator. In some cases, if parsing fails your option is to report the issue to UiPath support and to fall back on the String content type, which does no parsing. In that case, you would need to parse the value in your RPA workflow logic.
Another relevant configuration is the Multi-line checkbox which is relevant mainly for String type fields. Whenever some other field produces unexpected results or no results, the first thing to try is to change it to String Multi-line field, to see the unaltered output of the model prediction.
When labeling Training data you need to focus on the bounding boxes of the words in the document pane of Document Manager. The parsed values in the right or top sidebars are not important as they are not used for training.
Whenever a field appears multiple times on a page, as long as they represent the same concept (see the Define fields section above), all of them should be labeled.
When the OCR misses a word or gets a few characters wrong, just label the bounding box if there is one, and if not, then just skip it and keep going. There is no possibility to add a word in Document Manager because even if you did, the word would still be missing at run time so adding it would not help the model at all.
As you label remain vigilant about fields that may have multiple or overlapping meanings/concepts, in case you might need to split a field into two separate fields, or fields that you potentially do not explicitly need, but which, if labeled, might help you to do certain validation or self-consistency check logic in the RPA workflow. Typical examples are quantity, unit-price, and line-amount on invoice line items. Line-amount is the product of quantity and unit-price, but this is very useful to check for consistency without the need for confidence levels.
When labeling Evaluation datasets (also called Test datasets) you need to focus on something slightly different from labeling Training datasets. Whereas for Training datasets only the bounding boxes of the fields on the document matter, for Evaluation datasets only the values of the fields matter. You may edit them by clicking on the value in the right or top sidebar and editing it. To return to the automatically parsed value click on the lock icon.
For convenience, speed, and to avoid typos we recommend clicking on the boxes on the document when labeling and only making corrections manually. Typing full values manually is slower and more error-prone.
Exporting the entire dataset, including both Training and Test batches is allowed because the Training pipelines in AI Center ignore Test data. However, Evaluation pipelines will run the evaluation on the whole evaluation dataset regardless if it is comprised of train or test data. The type of a given document is displayed right under the file name, at the top center of the Document Manager window.
When evaluating a ML model the most powerful tool is the evaluation.xlsx file generated in the artifacts/eval_metrics folder. In this Excel file you can see what predictions are failing and on which files, and you can see immediately if it is an OCR error or a ML Extraction or parsing error, and if it may be fixed by simple logic in the RPA workflow, or it requires a different OCR engine, more training data, or improving the labelling or the field configurations in Document Manager.
This Excel file is also very useful to identify the most relevant business rules you need to apply in the RPA workflow in order to catch common mistakes for routing to Validation Station in Action Center for manual review. Business rules are by far the most reliable way to detect errors.
For those errors which cannot be caught by business rules, you may also use confidence levels. The Excel file also contains confidence levels for each predictions so you can use Excel features like sorting and filtering to determine what a good confidence threshold is for your business scenario.
Overall, the evaluation.xlsx Excel file is a key resource you need to focus on in order to get the best results from your AI automation.
In this step, you should be concerned with model errors and how to detect them. There are 2 main ways to detect errors:
- through enforcing business rules
- through enforcing a minimum confidence level threshold
The most effective and reliable way to detect errors is by defining business rules. Confidence levels can never be 100% perfect, there will always be a small but non-zero percentage of correct predictions with low confidence or wrong predictions with high confidence. In addition, and perhaps most importantly, a missing field has no confidence, so a confidence threshold can never catch errors whereby a field is not extracted at all. Consequently, confidence level thresholds should only be used as a fallback, a safety net, but never as the main way to detect business-critical errors.
Examples of business rules:
- Net amount plus Tax amount must equal Total amount
- Total amount must be greater than or equal to Net amount
- Invoice number, Date, Total amount (and potentially other fields) must be present
- PO number (if present) must exist in PO database
- Invoice date must be in the past and cannot be more than X months old
- Due date must be in the future and not more than Y days/months
- For each line item the quantity multiplied by unit price must equal the line amount
- Sum of line amounts must equal net amount or total amount
In particular, the confidence levels of column fields should almost never be used as an error detection mechanism, since column fields (e.g., line items on invoices or POs) can have dozens of values, so setting a minimum threshold over so many values can be especially unreliable, as one value is more than likely to have small confidence, so this would lead to most/all documents being routed to human validation, many times unnecessarily.
Business rules must be enforced as part of the RPA workflow, and it would be ideal if the business rule failure is passed to the human validator so as to direct their attention and expedite the process.
After the business rules have been defined, sometimes there might remain a small number of fields for which there are no business rules in place, or for which the business rules are unlikely to catch all errors. For this, you may need to use a confidence threshold as a last resort.
The main tool to set this threshold is the Evaluation pipeline feature in AI Center, and specifically, the Excel spreadsheet which is output by the Evaluation pipeline in the Outputs > artifacts > eval_metrics folder.
This spreadsheet contains a column for each field and a column for the confidence level of each prediction. You can add a column called min_confidence which takes the minimum of all the confidences over all fields which are important for your business process and are not already covered by business rules. For instance, you may not want to put a threshold on the line items confidences, but rather on vendor name, total amount, date, due date, invoice number, and other essential fields. By sorting the table based on the min_confidence column you may see where the errors start appearing and set a threshold above that level to ensure that only correctly extracted documents are sent straight through.
Validation Station data can help improve the model predictions, yet, in many cases, it turns out that most errors are not due to the model itself but to the OCR, labelling errors or inconsistencies, or to postprocessing issues (e.g., date or number formatting). So, the first key aspect is that Validation Station data should be used only after the other Data Extraction Components have been verified and optimized to ensure good accuracy, and the only remaining area of improvement is the model prediction itself.
The second key aspect is that Validation Station data has lower information density than the data labeled in Document Manager. Fundamentally, the Validation Station user only cares about getting the right value once. If an invoice has 5 pages, and the invoice number appears on every page, the Validation Station user validates it only on the first page. So, 80% of the values remain unlabeled. In Document Manager, all the values are labeled.
Finally, keep in mind Validation Station data needs to be added to the original manually labelled dataset, so that you always have a single training dataset which increases in size over time. You will always need to train on the ML Package with the 0 (zero) minor version, which is the version released by UiPath Out of the Box.
Always add Validation Station data to same dataset and train on ML Package minor version 0 (zero)
It if often wrongly assumed that the way to use Validation Station data is to iteratively retrain the previous model version, so the current batch is used to train package X.1 to obtain X.2. Then the next batch trains on X.2 to obtain X.3 and so on. This is the wrong way to use the product. Each Validation Station batch needs to be imported into the same Document Manager session as the original manually labeled data making a larger dataset, which must be used to train always on the X.0 ML Package version.
Validation station data can potentially be of much higher volume since it is used in the production workflow. As a consequence, you need a guideline for how much data is likely to be useful, given that model Training requires time and infrastructure. Also, you do not want the dataset to become overwhelmed with Validation Station data because this can degrade the quality of the model due to the information density issue mentioned above.
The recommendation is to add a maximum of 2-3X the number of pages of Document Manager data and, beyond that, only cherry pick those vendors or samples where you see major failures. If there are known major changes to the production data, such as a new language, or a new geographic region being onboarded to the business process (expanding from US to Europe or South Asia), then representative data for those languages and regions should be added to Document Manager for manual labelling. Validation Station data is not appropriate for such major scope expansion.
Here is a sample scenario. You have chosen a good OCR engine, labeled 500 pages in Document Manager, resulting in good performance and you have deployed the model in a production RPA workflow. Validation Station is starting to generate data. You should randomly select up to a maximum of 1000-1500 pages from Validation Station and import them into the Document Manager together with the first 500 pages and retrain your ML model. After that, you should run an Evaluation Pipeline to make sure the model actually improved, and then you should deploy the new model to production.
The Auto-Fine-tuning Loop is a Preview capability that becomes useful for maintaining a high-preforming model which you have already created using the steps described above. To ensure that auto-fine-tuning produces better versions of the model it is critical that you have a good Evaluation dataset and that you use an automatically rescheduled Full Pipeline which runs both Training and Evaluation at the same time. In this way, you can see if the most recent Training produced a more accurate model than the previous one, and if so, you are ready to deploy the new model to the ML Skill invoked by the Robots in your business process.
The Training dataset keeps changing as more data comes in and Document Manager exports periodically as scheduled in the Scheduled Export dialog. The Evaluation runs on the same Evaluation dataset you specify in the pipeline. Evaluation datasets never change automatically, they always need to be curated, labeled and exported manually. You should change your Evaluation set rarely so that accuracy scores can be compared between different Training runs.
AI Center offers the possibility to automatically update the ML Skill when a new version of an ML Package is retrained. However, this automatic update does not take into account the score of the Full Pipeline, and therefore it is not recommended to use this feature with Document Understanding auto-retraining pipelines.
As mentioned above in the Create an Evaluation Dataset section, Evaluation Pipeline implementations for ML Packages release 21.10 or later calculate scores on a per-document basis - which reflects accurately the results you see in an RPA workflow. This assumes your dataset was labelled on a per-document basis in Document Manager. You can tell if a multi-page document is labeled on a per-document basis if you can scroll naturally through the pages like in a regular PDF reader. If you need to click next to pass from one page to the next, then each page is considered a separate document.
Make sure to use the Document Understanding Process from the Templates section in the Studio start screen in order to apply best practices in Enterprise RPA architecture.
Updated about a month ago