Document Understanding - Training high performing models

document-understanding

2023.4

false

Document Understanding User Guide

Training high performing models

The power of Machine Learning Models is that they are defined by training data rather than by explicit logic expressed in computer code. This means that extraordinary care is needed when preparing datasets because a model is only as good as the dataset that was used to train it. In that sense, what UiPath® Studio is to RPA workflows, a Document Type session (in Document UnderstandingDocument Understanding^TM Cloud) is to Machine Learning capabilities. Both require some experience to be used effectively.

What can a data extraction ML model do?

An ML Model can extract data from a single type of document, though it might cover several different languages. It is essential that each field (Total Amount, Date, etc.) has a single consistent meaning. If a human can be confused about the correct value for a field, then an ML Model will also.

Ambiguous situations can appear. For instance, is a utility bill just another type of invoice? Or are these two different document types that require two different ML Models? If the fields you need to extract are the same, (i.e. they have the same meaning) then you can treat them as a single document type. However, if you need to extract different fields for different reasons (different business processes), then you need to treat these as two different document types, and, hence, train two different models.

When in doubt, start by training a single model, but keep the documents in different Document Manager batches (check the Filter drop-down at the top of the view) so you can easily separate them later if needed. In this way, the labeling work is not lost. When it comes to ML Models, the more data, the better. So, having a single model with ample data is a good place to start.

Training and evaluation datasets

Document Manager can be used to build two types of datasets:

training datasets
evaluation datasets

Both types of dataset are essential for building a high-performing ML Model and require time and effort for creating and maintaining them. An evaluation dataset that is representative of the production document traffic is required for obtaining a high-performing ML model.

Each dataset type is labeled in a different way:

Training datasets rely on the bounding boxes of the words on the page representing the different pieces of information you need to extract.
When labeling a Training set, focus on the page itself and the word boxes.
Evaluation datasets rely on the values of the fields, that appear in the sidebar (for Regular fields) or the top bar (for Column fields).
When labeling an Evaluation set, focus on the values under the field names in the sidebar or top bar. That does not mean you need to type them in manually. We recommend labelling by selecting the boxes on the page, and checking the corectness of the values.

Data extraction components

Data extraction relies on the following components:

Optical Character Recognition
Word and Line building
Grouping characters into words and words into left-to-right lines of text
Machine Learning Model prediction for each word/box on the page
Cleanup, parsing, and formatting of the spans of text
For example, grouping words on several lines into an address, formatting a date to the standard yyyy-mm-dd format
Applying an algorithm for selecting which value is returned
For the cases when the document has two or more pages and some fields appear on more than one page

Confidence levels

What are confidence levels?

When ML models make predictions, they are basically statistical guesses. The model is saying "this is probably the Total amount" of this Invoice. This begs the question: how probably? Confidence levels are an attempt to answer that question, on a scale from 0 to 1. However, they are NOT true probability estimates. They are just how confident the model is about its guesses, and therefore they depend on what the model has been trained on. A better way to think of them is as a measure of familiarity: how familiar is the model with this model input? If it resembles something the model has seen in training, then it might have a higher confidence. Otherwise it might have a lower confidence.

What are confidence levels useful for?

Business automations need ways to detect and handle exceptions - i.e. instances where an automation goes wrong. In traditional automations this is pretty obvious because when an RPA workflow breaks, it will just stop, hang, or throw an error. That error can be caught and handled accordingly. However, Machine Learning models do not throw errors when they make bad predictions. So how do we determine when an ML model has made a mistake and the exception handling flow needs to be triggered? This often involves human manual involvement, perhaps using Action Center.

The best way to catch bad predictions, by far, is through enforcing business rules. For example, we know that on an invoice, the net amount plus the tax amount must equal the total amount. Or that the part numbers for the components ordered must have 9 digits. When these conditions do not hold, we know something has gone wrong, and we can trigger the exception handling flow. The is the preferred and strongly recommended approach. It is worth investing significant effort in building these kinds of rules, even using complex Regular Expressions, or even lookups in databases to validate Vendor names, or part numbers, etc. In some cases you may even want to extract some other document that is not of interest but only to cross reference and validate some values to the original document of interest.

However, in some cases, none of these options exist and you still want to detect potentially bad predictions. In these cases you can fall back on the confidence level. When confidence level for a prediction is low, say, below 0.6, the risk that the prediction is incorrect is higher than if the confidence is 0.95. However, this correlation is fairly weak. There are many instances where a value is extracted with low confidence but it is correct. It is also possible, though relatively rare, that a value is extracted with high confidence (over 0.9) but it is incorrect. For these reasons we strongly encourage users to rely on business rules as much as possible and only use confidence levels as a last resort.

What types of confidence levels are there?

Most components in the Document Understanding^TM product return a confidence level. The main components of a Document Understanding^TM workflow are Digitization, Classification and Extraction. Each of these has a certain confidence for each prediction. The Digitization and the Extraction confidence are both visually exposed in the Validation Station, so you can filter predictions and focus only on low confidence ones, to save time.

Confidence score scaling (or calibration)

The confidence levels of different models will be scaled differently, depending on the model design. For example some models return confidence levels in the range 0.9-1 almost always, and only very rarely below 0.8. Other models have confidence levels spread much more evenly between 0 and 1, even if usually they are clustered at the higher end of the scale. As a result, the confidence thresholds on different models will be different. For example a confidence threshold on the OCR will not be the same as the threshold on the ML Extractor or the ML Classifier. Also, whenever there is a major architecture update on a model, like the one coming up with the release of the Helix Extractor Generative AI-based model architecture, the confidence level distribution will change, and confidence thresholds will need to be re-assessed.

Build a high performing ML model

For the best outcome in terms of automation rate (percent reduction of manual work measured in person-months per year required to process your document flow) you need to carefully follow these steps:

Choose the best OCR engine for your documents

This influences both the OCR and the Word and Line building (which depends partially on the OCR), and everything downstream, of course.
Select a well balanced and representative dataset for Training
Define the fields to be extracted
Label the Training dataset
Train the Extractor
Define and implement the business rules for processing model output
(Optional) Choose confidence threshold(s) for the Extraction
Training using data from Validation Station
Deploy your automation

1. Choose an OCR engine

To choose an OCR engine, you should create different Document Manager sessions, configure different OCR engines, and try to import the same files into each of them to examine differences. Focus on the areas that you want to extract. For example, if you need to extract company names that appear as part of logos on invoices, you may want to check which OCR engine performs better on the text in logos.

Your default option should be UiPath Document OCR since it is included with Document Understanding licenses at no charge. However, in cases where some unsupported languages are required, or some very hard-to-read documents are involved, you might want to try Google Cloud (Cloud only) or Microsoft Read (Cloud or On Premises), which have better language coverage. These engines come at a cost, indeed it is low, but if the accuracy is higher on some critical data fields for your business process, it is strongly recommended to use the best OCR available – saving your time later on since everything downstream depends on it.

Please be aware that the Digitize Document activity has the ApplyOcrOnPDF setting set to Auto by default, determining if the document requires to apply the OCR algorithm depending on the input document. Avoid missing the extraction of important information (from logos, headers, footers, etc.) by setting the ApplyOcrOnPDF to Yes, making sure that all text is detected, though it might slow down your process.

2. Define fields

Defining fields is a conversation that needs to happen with the Subject Matter Expert or Domain Expert who owns the business process itself. For invoices, it would be the Accounts Payable process owner. This conversation is critical, it needs to happen before any documents are labeled to avoid wasted time, and it requires looking together at a minimum of 20 randomly chosen document samples. A one-hour slot needs to be reserved for this, and it often needs to be repeated after a couple of days, as the person preparing the data runs into ambiguous situations or edge cases.

It is not uncommon that the conversation starts with the assumption that you need to extract, let's say, 10 fields, and later you end up with 15.

Some key configurations you need to be aware of:

Content type
This is the most important setting as it determines the postprocessing of the values, especially for dates (detects if the format is US-style or non-US style, and then formats them as yyyy-mm-dd) and for numbers (detects the decimal separator – comma or period). ID numbers clean up anything coming before a colon or hash symbol. String content type performs no cleanup and can be used when you want to do your own parsing in the RPA workflow.
Multi-line checkbox
This is for parsing strings like addresses that may appear on more than 1 line of text.
Multi-valued checkbox
This is for handling multiple choice fields or other fields which may have more than one value, but are NOT represented as a table column. For example, an Ethnic group question on a government form may contain multiple checkboxes where you can select all that apply.
Hidden fields
Fields marked as Hidden can be labelled but they are held out when data is exported, so the model cannot be trained on them. This is handy when labeling a field is a work in progress, when it is too rare, or when it is low priority.
Scoring
This is relevant only for Evaluation pipelines, and it affects how the accuracy score is calculated. A field that uses Levenshtein scoring is more permissive: if a single character out of 10 is wrong, the score is 0.9. However, if scoring is Exact Match it is more strict: a single wrong character leads to a score of zero. Only String type fields have the option to select Levenshtein scoring by default.

Amounts on Utility Bills

A total amount might seem straightforward enough, but utility bills contain many amounts. Sometimes you need the total amount to be paid. Other times you need only the current bill amount – without the amounts carried forward from previous billing periods. In the latter case, you need to label differently even if the current bill and total amount can be the same. The concepts are different and the amounts are often different.

Note: Each field represents a different concept, and they need to be defined as cleanly as possible, so there is no confusion. If a human might be confused, the ML model will also be confused.

Moreover, the current bill amount can sometimes be composed of a few different amounts, fees, and taxes and may not appear individualized anywhere on the bill. A possible solution to this is to create two fields: a previous-charges field, and a total field. These two always appear as distinct explicit values on the utility bill. Then the current bill amount can be obtained as the difference between the two. You might even want to include all 3 fields (previous-charges, total, and current-charges) in order to be able to do some consistency checks in cases where the current bill amount appears explicitly on the document. So you could go from one to three fields in some cases.

Purchase Order numbers on Invoices

PO numbers can appear as single values for an invoice, or they might appear as part of the table of line items on an invoice, where each line item has a different PO number. In this case, it could make sense to have two different fields: po-no and item-po-no. By keeping each field visually and conceptually consistent, the model is likely to do a much better job. However, you need to make sure both are well represented in your Training and your Evaluation datasets.

Vendor name and Payment address name on Invoices

The company name usually appears at the top of an invoice or a utility bill, but sometimes it might not be readable because there is just a logo, and the company name is not explicitly written out. There could also be some stamp, or handwriting, or wrinkle over the text. In these cases, people might label the name that appears at the bottom right, in the Remit payment to section of the payslip on utility bills. That name is often the same, but not always, since it is a different concept. Payments can be made to some other parent or holding company, or other affiliate entity, and it is visually different on the document. This might lead to poor model performance. In this case, you should create two fields, vendor-name and payment-addr-name. Then you can look both up in a vendor database and use the one that matches, or use payment-addr-name when the vendor-name is missing.

Rows of tables

There are two distinct concepts to keep in mind: table rows and lines of text. A table row includes all the values of all columns fields which belong together in that row. Sometimes they may all be part of the same line of text going across the page. Other times they may be on different lines.

If a table row consists of more than one line of text, then you need to group all values on that table row using the “/” hotkey. When you do this, a green box will appear covering the entire table row. Here is an example of a table where the top two rows consist of multiple lines of text and need to be grouped using the “/” hotkey, while the third row is a single line of text and does not need to be grouped.

Here is an example of a table where each table row consists of a single line of text. You do not need to group these using the “/” hotkey because this is done implicitly by Document Manager.

Identifying where a row ends and another begins as one reads from top to bottom can often be a major challenge for ML extraction models, especially on documents like forms where there are no visual horizontal lines separating rows. Inside our ML Packages there is a special model which is trained to split tables into rows correctly. This model is trained by using these groups you label using the “/” or “Enter” keys and which are indicated by the green transparent boxes.

2. Select a well-balanced and representative dataset for Training

Machine Learning technology has the main benefit of being able to handle complex problems with high diversity. When estimating the size of a training dataset, one looks first at the number of fields and their types, and the number of languages. A single model can handle multiple languages as long as they are not Chinese/Japanese/Korean. Chinese/Japanese/Korean scenarios generally require separate Training datasets and separate models.

There are 3 types of fields:

Regular fields (date, total amount)
- For Regular fields, you need at least 20-50 document samples per field. So, if you need to extract 10 regular fields, you need at least 200-500 document samples. If you need to extract 20 regular fields, you need at least 400-1000 document samples. The amount of document samples you need increases with the number of fields. More fields means you need more document samples, about 20-50X more.
Column fields (item unit price, item quantity)
- For Column fields, you need at least 50-200 document samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300 document samples. For highly complex and diverse layouts, it might require over 1000 document samples. To cover multiple languages, then you need at least 200-300 document samples per language, assuming they cover all the different fields. So, for 10 header fields and 4 column fields with 2 languages, 600 document samples might be enough (400 for the columns and headers, plus 200 for the additional language), but in some cases might require 1200 or more document samples.
Classification fields (currency)
- Classification fields generally require at least 10-20 document samples from each class.

The guidelines assume that you are solving a high diversity scenario like invoices or purchase orders with dozens to hundreds or thousands of layouts. However, if you are solving a low-diversity scenario like a tax form or invoices with very few layouts (under 5-10), then the dataset size is determined more by the number of layouts. In this case, you should start with 20-30 pages per layout and add more if needed - especially if the pages are very dense with large number of fields to be extracted. For instance, creating a model for extracting 10 fields from 2 layouts might require 60 pages, but if you need to extract 50 or 100 fields from 2 layouts, then you might start with 100 or 200 pages and add more as needed to get the desired accuracy. In this case, the distinction regular fields/column fields is less important.

Important: ML technology is designed to handle high diversity scenarios. Using it to train models on low diversity scenarios (1-10 layouts) requires special care to avoid brittle models that are sensitive to slight changes in the OCR text. Avoid this by having some deliberate variability in the training documents, by printing and then scanning or photographing them using mobile phone scanner apps. The slight distortions or changing resolutions make the model more robust.

These estimates assume that most pages contain all or most of the fields. For documents with multiple pages but most fields are on a single page, then the relevant number of pages is the number of examples of that one page where most fields appear.

The numbers described are general guidelines, not strict requirements. In general, you can start with a smaller dataset, and then keep adding data until you get good accuracy. This is especially useful to parallelize the RPA work with the model building. Also, a first version of the model can be used to prelabel additional data (check Settings view and Predict button in Document Manager) which can accelerate labeling additional Training data.

Deep Learning models can generalize

You do not need to have every single layout represented in a training set. In fact, most layouts in our production document flow have zero samples in your training set, or one or two document samples. This is desirable, because you want to leverage the power of AI to understand documents and be able to make correct predictions on documents that it has not seen during training. Many document samples per layout is not mandatory because most layouts might either not be present at all, or be present only one or two times, and the model would still be able to predict correctly, based on the learning from other layouts.

Training over an out-of-the-box model

There are three main types of scenarios when training a ML model for Document Understanding:

training a new type of document from scratch using the DocumentUnderstanding ML Package in AI Center
retraining over a pre-trained Out Of the Box model for optimizing the accuracy
retraining over a pre-trained Out Of the Box model for optimizing the accuracy and adding some new fields

The dataset size estimates for the first type of scenario are described in the first part of this section titled "Create a Training Set".

For the second type of scenario, the dataset size depends on how well the pre-trained models already work on your documents. If they work very well already, then you may need very little data 50-100 pages. If they fail on a number of important fields, you may need more, but a good starting point would still be four times smaller than if you were training from scratch.

And finally for the third type of scenario, start with the dataset size for the second scenario, and then increase the dataset depending on how many new fields you have, using the same guidance as for training from scratch: at least 20-50 pages per new regular field, or at least 50-200 pages per column field.

In all these cases, all the documents need to be labelled fully, including the new fields, which the out-of-the-box model does not recognize, and also the origine of the fields, which the out-of-the-box model does recognize.

Unequal field occurrences

Some fields might appear on every document (e.g., date, invoice number) while some fields might appear only on 10% of the pages (e.g., handling charges, discount). In these cases, you need to make a business decision. If those rare fields are not critical to your automation, you can get away with a small number of document samples (10-15) of that particular field, i.e. pages which contain a value for that field. However, if those fields are critical, then you need to make sure that you include in your Training set at least 30-50 document samples of that field to make sure to cover the full diversity.

Balanced datasets

In the case of invoices, if a dataset contains invoices from 100 vendors, but half of the dataset consists of invoices from one single vendor, then that is a very unbalanced dataset. A perfectly balanced dataset is where each vendor appears an equal number of times. Datasets do not need to be perfectly balanced, but you should avoid having more than 20% of your entire dataset coming from any single vendor. At some point, more data does not help, and it might even affect the accuracy on other vendors because the model optimizes (overfits) so much for one vendor.

Representative datasets

Data should be chosen to cover the diversity of the documents likely to be seen in the production workflow. For example, if you get invoices in English but some of them come from the US, India and Australia, they probably look different, so you need to make sure you have document samples from all three. This is relevant not only for the model training itself, but also for labeling purposes. When you label the documents you might discover that you need to extract new, different fields from some of these regions, like GSTIN code from India, or ABN code from Australia. Check the Define fields section for more information.

4. Label the training dataset

When labeling Training data, you need to focus on the bounding boxes of the words in the document pane of Document Manager. The parsed values in the right or top sidebars are not important as they are not used for training.

Whenever a field appears multiple times on a page, as long as they represent the same concept, all of them should be labeled.

When the OCR misses a word or gets a few characters wrong, just label the bounding box if there is one, and if not, then just skip it and keep going. There is no possibility to add a word in Document Manager because even if you did, the word would still be missing at run time, so adding it doesn't help the model at all.

As you label, remain vigilant about fields that may have multiple or overlapping meanings/concepts, in case you might need to split a field into two separate fields, or fields that you do not explicitly need, but which, if labelled, might help you to do certain validation or self-consistency check logic in the RPA workflow. Typical examples are quantity,unit-price, and line-amount on invoice line items. Line-amount is the product of quantity and unit-price, but this is very useful to check for consistency without the need for confidence levels.

5. Train the extractor

To create an extractor, go to the Extractors view in Document Understanding and select the Create Extractor button at the top right. You can then select the Document Type, the ML Model and Version you would like to use. You can monitor the progress on the Extractors tab, or in the Details view of the Extractor, which contains a link to the AI Center pipeline, where you can check the detailed logs in real time.

When evaluating an ML model, the most powerful tool is the evaluation_<package name>.xlsx file generated in the artifacts/eval_metrics folder in AI Center pipeline details view. On the first sheet you can check a detailed Accuracy scores report, including overall scores, and also per field and per batch.

In this Excel file you can check what predictions are failing and on which files, and you can observe immediately if it is an OCR error or an ML Extraction or parsing error, and if it may be fixed by simple logic in the RPA workflow, or it requires a different OCR engine, more training data, or improving the labelling or the field configurations in Document Manager.

This Excel file is also very useful to identify the most relevant business rules you need to apply to the RPA workflow in order to catch common mistakes for routing to Validation Station in Action Center for manual review. Business rules are by far the most reliable way to detect errors.

For those errors which cannot be caught by business rules, you may also use confidence levels. The Excel file also contains confidence levels for each predictions so you can use Excel features like sorting and filtering to determine what a good confidence threshold is for your business scenario.

Overall, the evaluation_<package_name>.xlsx Excel file is a key resource you need to focus on to get the best results from your AI automation.

Important: GPU training is highly recommended for large and production datasets. CPU training is much slower and should be used sparingly, for small datasets for demo or testing purposes. For more information, check the Training pipelines page.

6. Define and implement business rules

In this step, you should be concerned with model errors and how to detect them. There are two main ways to detect errors:

through enforcing business rules
through applying lookups in Systems of Record in the customer organization
through enforcing a minimum confidence level threshold

The most effective and reliable way to detect errors is by defining business rules and lookups. Confidence levels can never be 100% perfect, there will always be a small but non-zero percentage of correct predictions with low confidence or wrong predictions with high confidence. In addition, and perhaps most importantly, a missing field has no confidence, so a confidence threshold can never catch errors whereby a field is not extracted at all. Consequently, confidence level thresholds should only be used as a fallback, a safety net, but never as the main way to detect business-critical errors.

Examples of business rules:

Net amount plus Tax amount must equal Total amount
Total amount must be greater than or equal to Net amount
Invoice number, Date, Total amount (and other fields) must be present
PO number (if present) must exist in PO database
Invoice date must be in the past and cannot be more than X months old
Due date must be in the future and not more than Y days/months
For each line item, the quantity multiplied by unit price must equal the line amount
Sum of line amounts must equal net amount or total amount
Etc.

Note: In case of numbers, a rounding to eight decimals is performed.

In particular, the confidence levels of column fields should almost never be used as an error detection mechanism, since column fields (e.g., line items on invoices or POs) can have dozens of values, so setting a minimum threshold over so many values can be especially unreliable, as one value is more than likely to have small confidence, so this would lead to most/all documents being routed to human validation, many times unnecessarily.

Business rules must be enforced as part of the RPA workflow, and the business rule failure is passed to the human validator so as to direct their attention and accelerate the process.

Note: When defining Business Rules, please keep in mind that the Starts with, Ends with, and Contains values are case sensitive.

7. (Optional) Choose a confidence threshold

After the business rules have been defined, sometimes there might remain a small number of fields for which there are no business rules in place, or for which the business rules are unlikely to catch all errors. For this, you may need to use a confidence threshold as a last resort.

The main tool to set this threshold is the Excel spreadsheet which is output by the Training pipeline in the Outputs > artifacts > eval_metrics folder.

This evaluation_<package name>.xlsx file contains a column for each field, and a column for the confidence level of each prediction. By sorting the table based on the confidence columns you may check where the errors start appearing for any given field and set a threshold above that level to ensure that only correctly extracted documents are sent straight through.

8. Fine-tune with data from Validation Station

Validation Station data can help improve the model predictions, yet, in many cases, it turns out that most errors are NOT due to the model itself but to the OCR, labelling errors or inconsistencies, or to postprocessing issues (e.g., date or number formatting). So, the first key aspect is that Validation Station data should be used only after the other Data extraction components have been verified and optimized to ensure good accuracy, and the only remaining area of improvement is the model prediction itself.

The second key aspect is that Validation Station data has a lower information density than the data labeled in Document Manager. Fundamentally, the Validation Station user only cares about getting the right value once. If an invoice has 5 pages, and the invoice number appears on every page, the Validation Station user validates it only on the first page. So, 80% of the values remain unlabelled. In Document Manager, all the values are labelled.

Finally, keep in mind Validation Station data needs to be added to the original manually labelled dataset, so that you always have a single training dataset which increases in size over time. You always need to train on the ML Package with the 0 (zero) minor version, which is the version released by UiPath Out-of-the-box.

Important: It is often wrongly assumed that the way to use Validation Station data is to iteratively train the previous model version, so the current batch is used to train package X.1 to obtain X.2. Then the next batch trains on X.2 to obtain X.3 and so on. This is the wrong way to use the product. Each Validation Station batch needs to be imported into the same Document Manager session as the original manually labeled data making a larger dataset, which must be used to train always on the X.0 ML Package version.

Cautions on using Validation Station data

Validation station data can potentially be of much higher volume since it is used in the production workflow. You do not want the dataset to become overwhelmed with Validation Station data because this can degrade the quality of the model due to the information density issue previously mentioned.

The recommendation is to add a maximum of 2-3X the number of pages of Document Manager data and, beyond that, only cherry pick those vendors or samples where you observe major failures. If there are known major changes to the production data, such as a new language, or a new geographic region being onboarded to the business process (expanding from US to Europe or South Asia), then representative data for those languages and regions should be added to Document Manager for manual labelling. Validation Station data is not appropriate for such major scope expansion.

One other potential issue with Validation Station data is balance. In production it is common that a majority of traffic comes from a small subset of vendors/customers/world regions. If allowed into the training set as is, this can lead to a highly biased model which performs well on a small subset of the data but performs poorly on most of the rest of the data. Therefore it is important to take special care when adding Validation Station data into a training set.

Here is a sample scenario. You have chosen a good OCR engine, labeled 500 pages in Document Manager, resulting in good performance, and you have deployed the model in a production RPA workflow. Validation Station is starting to generate data. You should randomly select up to a maximum of 1000-1500 pages from Validation Station and import them into the Document Manager together with the first 500 pages and train your ML model again. After that, you should look very carefully at the evaluation_<package name>.xlsx to make sure the model actually improved, and then you should deploy the new model to production.

9. Deploy your automation

Make sure to use the Document Understanding™ Process: Studio Template from the Templates section in the Studio start screen in order to apply best practices in Enterprise RPA architecture.

On this page

What can a data extraction ML model do?
Training and evaluation datasets
Data extraction components
Confidence levels
What are confidence levels?
What are confidence levels useful for?
What types of confidence levels are there?
Confidence score scaling (or calibration)
Build a high performing ML model
1. Choose an OCR engine
2. Define fields
2. Select a well-balanced and representative dataset for Training
4. Label the training dataset
5. Train the extractor
6. Define and implement business rules
7. (Optional) Choose a confidence threshold
8. Fine-tune with data from Validation Station
9. Deploy your automation

Was this page helpful?

PREVIOUSOCR services

NEXTDeploying high performing models