Abonnieren

UiPath Document Understanding

UiPath Document Understanding

Training von leistungsstarken Modellen

Die Leistungsfähigkeit von Machine Learning-Modellen besteht darin, dass sie durch Trainingsdaten und nicht durch explizite Logik im Computercode definiert werden. Das bedeutet, dass bei der Vorbereitung von Datasets eine besondere Vorsicht geboten ist, da ein Modell nur so gut ist wie das Dataset, mit dem es trainiert wurde. Während also UiPath Studio für RPA-Workflows zuständig ist, dient der Document Manager für Machine Learning. Beide erfordern für einen effektiven Einsatz ein gewisses Maß an Erfahrung.

Was kann ein ML-Modell zur Datenextraktion tun?

An ML Model can extract data from a single type of document, though it might cover several different languages. It is essential that each field (Total Amount, Date, etc.) has a single consistent meaning. If a human can be confused about the correct value for a field, then the same happens for an ML Model.

Ambiguous situations can appear. For instance, is a utility bill just another type of invoice? Or are these two different document types that require two different ML Models? If the fields you need to extract are the same, (i.e. they have the same meaning) then you can treat them as a single document type. However, if you need to extract different fields for different reasons (different business processes), then you need to treat these as two different document types, and, hence, train two different models.

When in doubt, start by training a single model, but keep the documents in different Document Manager batches (see the Filter drop-down at the top center of the Document Manager view). You can easily separate the documents later if needed, avoiding losing the labeling work. When it comes to ML Models, the more data, the better. Start by having a single model with ample data.

Trainings- und Auswertungs-Datasets

Document Manager can be used to build two types of datasets:
training datasets
evaluation datasets

Both types of dataset are essential for building a high-performing ML Model and require time and effort for creating and maintaining them. An evaluation dataset that is representative of the production document traffic is required for obtaining a high-performing ML model.

Jeder -Typ wird auf eine andere Weise beschriftet:

Training datasets rely on the bounding boxes of the words on the page representing the different pieces of information you need to extract.
When labeling a Training set, focus on the page itself and the word boxes.

Evaluation datasets rely on the values of the fields, that appear in the sidebar (for Regular fields) or the top bar (for Column fields).
When labeling an Evaluation set, focus on the values under the field names in the sidebar or top bar. That does not mean you need to type them in manually. We recommend labelling by clicking on the boxes on the page, and checking the corectness of the values.

Ausführliche Informationen zur Durchführung der richtigen Auswertungen finden Sie unten.

Komponenten der Datenextraktion

Die Datenextraktion basiert auf den folgenden Komponenten:

  • Optical Character Recognition (OCR)
  • Erkennen von Wörtern und Zeilen
    Grouping characters into words and words into left-to-right lines of text
  • Vorhersage des Machine Learning-Modells für jedes Wort/Feld auf der Seite
  • Bereinigung, Analyse und Formatierung der Textabschnitte
    For example, grouping words on several lines into an address, formatting a date to the standard yyyy-mm-dd format
  • Anwenden eines Algorithmus zur Auswahl, welcher Wert zurückgegeben wird
    For the cases when the document has two or more pages and some fields appear on more than one page

Erstellen eines leistungsstarken ML-Modells

For the best outcome in terms of automation rate (percent reduction of manual work measured in person-months per year required to process your document flow) follow these steps:

  1. Die geeignetste OCR-Engine für Ihre Dokumente wählen
    This influences the OCR, the Word and Line building (that partially depends on the OCR), and everything that follows.
  2. Select a well-balanced and representative dataset for Training
  3. Ein repräsentative Dataset für die Auswertung wählen
  4. Die zu extrahierenden Felder definieren
  5. Die Felder konfigurieren
  6. Das Trainings-Dataset beschriften
  7. Das Auswertungs-Dataset beschriften
  8. Das Modell im AI Center trainieren und auswerten
  9. Definieren und Implementieren der Geschäftsregeln für die Ausgabe von Verarbeitungsmodellen
  10. Konfidenz-Schwellenwert(e) für die Extraktion auswählen (optional)
  11. Training mit Daten aus der Validation Station
  12. Die Schleife für die automatische Feinabstimmung (Vorschau)
  13. Die Automatisierung bereitstellen

1. Ein OCR-Modul wählen


To choose an OCR engine, you should create different Document Manager sessions, configure different OCR engines, and try to import the same files into each of them to examine differences. Focus on the areas that you want to extract. For example, if you need to extract company names that appear as part of logos on invoices, you may want to see which OCR engine performs better on the text in logos.

Your default option should be UiPath Document OCR since it is included with Document Understanding licenses at no charge. However, in cases where some unsupported languages are required, or some very hard-to-read documents are involved, you might want to try Google Cloud (Cloud only) or Microsoft Read (Cloud or On Premises), which have better language coverage. These engines come at a cost, indeed it is low, but if the accuracy is higher on some critical data fields for your business process, it is strongly recommended to use the best OCR available – saving your time later on since everything downstream depends on it.

Please be aware that the Digitize Document activity has the ApplyOcrOnPDF setting set to Auto by default, determining if the document requires to apply the OCR algorithm depending on the input document. Avoid missing the extraction of important information (from logos, headers, footers, etc.) by setting the ApplyOcrOnPDF to Yes, making sure that all text is detected, though it might slow down your process.

2. Ein Trainings-Dataset erstellen


Machine Learning technology has the main benefit of being able to handle complex problems with high diversity. When estimating the size of a training dataset, one looks first at the number of fields and their types, and the number of languages. A single model can handle multiple languages as long as they are not Chinese/Japanese/Korean. Chinese/Japanese/Korean scenarios generally require separate Training datasets and separate models.

Es gibt 3 Arten von Feldern:

  • Regular fields (date, total amount)
    For Regular fields, you need at least 20-50 document samples per field. So, if you need to extract 10 regular fields, you need at least 200-500 document samples. If you need to extract 20 regular fields, you need at least 400-1000 document samples. The amount of document samples you need increases with the number of fields. More fields means you need more document samples, about 20-50X more.

  • Column fields (item unit price, item quantity)
    For Column fields, you need at least 50-200 document samples per column field, so for 5 column fields, with clean and simple layouts you might get good results with 300 document samples. For highly complex and diverse layouts, it might require over 1000 document samples. To cover multiple languages, then you need at least 200-300 document samples per language, assuming they cover all the different fields. So, for 10 header fields and 4 column fields with 2 languages, 600 document samples might be enough (400 for the columns and headers, plus 200 for the additional language), but in some cases might require 1200 or more document samples.

  • Classification fields (currency)
    Classification fields generally require at least 10-20 document samples from each class.

Die obigen Hinweise gehen davon aus, dass Sie ein Szenario mit hoher Vielfalt behandeln (mit Rechnungen oder Bestellungen mit Dutzenden bis Hunderten oder Tausenden von Layouts). Wenn Sie jedoch ein Szenario mit geringer Vielfalt behandeln, z. B. ein Steuerformular oder Rechnungen mit sehr wenigen Layouts (unter 5–10), dann wird die Datensatz-Dataset stärker durch die Anzahl der Layouts bestimmt. In diesem Fall sollten Sie mit 20–30 Seiten pro Layout beginnen und bei Bedarf weitere hinzufügen – insbesondere dann, wenn die Seiten sehr dicht sind (d. .h eine große Anzahl von zu extrahierenden Feldern haben). Zum Beispiel kann das Erstellen eines Modells zum Extrahieren von 10 Feldern aus 2 Layouts 60 Seiten erfordern, aber wenn Sie 50 oder 100 Felder aus 2 Layouts extrahieren müssen, dann können Sie mit 100 oder 200 Seiten beginnen und nach Bedarf mehr hinzufügen, um die gewünschte Genauigkeit zu erhalten. In diesem Fall ist die Unterscheidung regulärer Felder/Spaltenfelder weniger wichtig.

❗️

Das Training auf sehr wenigen Layouts ist führt zu Überempfindlichkeit

ML technology is designed to handle high diversity scenarios. Using it to train models on low diversity scenarios (1-10 layouts) requires special care to avoid brittle models that are sensitive to slight changes in the OCR text. Avoid this by having some deliberate variability in the training documents, by printing and then scanning or photographing them using mobile phone scanner apps. The slight distortions or changing resolutions make the model more robust.

These estimates assume that most pages contain all or most of the fields. For documents with multiple pages but most fields are on a single page, then the relevant number of pages is the number of examples of that one page where most fields appear.

The numbers above are general guidelines, not strict requirements. In general, you can start with a smaller dataset, and then keep adding data until you get good accuracy. This is especially useful to parallelize the RPA work with the model building. Also, a first version of the model can be used to prelabel additional data (see Settings view and Predict button in Document Manager) which can accelerate labeling additional Training data.

Deep-Learning-Modelle können verallgemeinern

You do not need to have every single layout represented in a training set. In fact, most layouts in our production document flow have zero samples in your training set, or one or two document samples. This is desirable, because you want to leverage the power of AI to understand documents and be able to make correct predictions on documents that it has not seen during training. Many document samples per layout is not mandatory because most layouts might either not be present at all, or be present only one or two times, and the model would still be able to predict correctly, based on the learning from other layouts.

Training über ein sofort einsetzbares Modell

Beim Training eines ML-Modells für Document Understanding gibt es drei Hauptszenarien:

  • Training eines neuen Dokumenttyps von Grund auf mit dem ML-Paket „DocumentUnderstanding“ im AI Center
  • retraining over a pre-trained Out Of the Box model for optimizing the accuracy
  • erneutes Training über ein vortrainiertes out-of-the-box Modell, um die Genauigkeit zu optimieren und einige neue Felder hinzuzufügen

The dataset size estimates for the first type of scenario are described in the first part of this section titled "Create a Training Set".

For the second type of scenario, the dataset size depends on how well the pre-trained models already work on your documents. If they work very well already, then you may need very little data 50-100 pages. If they fail on a number of important fields, you may need more, but a good starting point would still be four times smaller than if you were training from scratch.

Für das dritte Szenario beginnen Sie mit der Dataset-Größe für das zweite Szenario (s. oben) und erhöhen das Dataset, je nachdem, wie viele neue Felder Sie haben. Nutzen Sie dazu dieselben Hinweise wie für das Training von Grund auf: mindestens 20–50 Seiten pro neuem regulären Feld oder mindestens 50–200 Seiten pro Spaltenfeld.

In all these cases, all the documents need to be labelled fully, including the new fields, which the out-of-the-box model does not recognize, and also the origine of the fields, which the out-of-the-box model does recognize.

Ungleiche Feldvorkommen

Some fields might appear on every document (e.g., date, invoice number) while some fields might appear only on 10% of the pages (e.g., handling charges, discount). In these cases, you need to make a business decision. If those rare fields are not critical to your automation, you can get away with a small number of document samples (10-15) of that particular field, i.e. pages which contain a value for that field. However, if those fields are critical, then you need to make sure that you include in your Training set at least 30-50 document samples of that field to make sure to cover the full diversity.

Ausgewogene Datasets

In the case of invoices, if a dataset contains invoices from 100 vendors, but half of the dataset consists of invoices from one single vendor, then that is a very unbalanced dataset. A perfectly balanced dataset is where each vendor appears an equal number of times. Datasets do not need to be perfectly balanced, but you should avoid having more than 20% of your entire dataset coming from any single vendor. At some point, more data does not help, and it might even affect the accuracy on other vendors because the model optimizes (overfits) so much for one vendor.

Repräsentative Datasets

Data should be chosen to cover the diversity of the documents likely to be seen in the production workflow. For example, if you get invoices in English but some of them come from the US, India and Australia, they probably look different, so you need to make sure you have document samples from all three. This is relevant not only for the model training itself, but also for labeling purposes. When you label the documents you might discover that you need to extract new, different fields from some of these regions, like GSTIN code from India, or ABN code from Australia. See more in the Define fields section.

3. Ein Auswertungs-Dataset erstellen


For Training sets, the pages and the number of pages are the most important. For Evaluation sets, we refer only to documents and the number of documents. The scores for releases v2021.10 and after are calculated per document.

Evaluation datasets can be smaller. They can be 50-100 documents (or even just 30-50 in low diversity scenarios), and they can grow over time to be a few hundred documents. It is important that they are representative of the production data flow. So, a good approach is to randomly select from the documents processed in the RPA workflow. Even if some vendors are overrepresented, that is ok. For example, if a single vendor represents 20% of your invoice traffic, it is ok to have that vendor be 20% of your Evaluation set too, so the Evaluation metrics approximate your business metric, i.e. reduction in person-months spent on manual document processing.

When importing Evaluation data into Document Manager, you need to check the “Make this an evaluation set” box on the Import dialog window. This guarantees that the data is held out when training, and also you can easily export it for running Evaluations using the evaluation-set option in the Filter drop-down in Document Manager.

❗️

Wichtig!

Starting with the 21.9 Preview release in Automation Cloud and the 21.10 GA On Premises release, Document Manager has switched to handling multi-page documents rather than each page as a separate entity. This is a major change, especially for Evaluations, which were its main motivation. Evaluations need to be representative of the runtime process, and at runtime, multipage documents are processed as a whole, rather than as separate pages. To benefit from this enhancement in ML Packages with version 21.10 or later, you just need to leave the "Backward compatible export" box unchecked in the Export dialog. If you check this box, the dataset is exported in the old page-by-page way, and evaluation scores are not as representative of the runtime performance.

4. Felder definieren


Wie Felder definiert werden, muss mit dem/der Zuständigen für das jeweilige Thema besprochen werden. Bei Rechnungen etwa ist der Zuständige für Konten mit Fähigkeiten. Das zu besprechen ist äußerst wichtig. Es muss vor der Beschriftung von Dokumenten erfolgen, um Zeitverschwendung zu vermeiden und erfordert zusammen mindestens 20 zufällig ausgewählte Dokumentenbeispiele. Dafür muss ein Ein-Stunden-Slot reserviert werden, der oft nach ein paar Tagen wiederholt werden muss, wenn die Person, die die Daten vorbereitet, mehrdeutige Situationen oder Randfälle erlebt.

Nicht selten wird am Anfang angenommen, dass 10 Felder extrahiert werden müssen, am Ende sind es aber 15. Einige Beispiele werden in den Unterabschnitten unten beschrieben.

Einige wichtige Konfigurationen, die Sie kennen müssen:

  • Inhaltstyp
    This is the most important setting as it determines the postprocessing of the values, especially for dates (detects if the format is US-style or non-US style, and then formats them as yyyy-mm-dd) and for numbers (detects the decimal separator – comma or period). ID numbers clean up anything coming before a colon or hash symbol. String content type performs no cleanup and can be used when you want to do your own parsing in the RPA workflow.
  • Mehrzeiliges Kontrollkästchen
    This is for parsing strings like addresses that may appear on more than 1 line of text.
  • Ausgeblendete Felder
    Fields marked as Hidden can be labelled but they are held out when data is exported, so the model cannot be trained on them. This is handy when labeling a field is a work in progress, when it is too rare, or when it is low priority.
  • Punktzahl
    This is relevant only for Evaluation pipelines, and it affects how the accuracy score is calculated. A field that uses Levenshtein scoring is more permissive: if a single character out of 10 is wrong, the score is 0.9. However, if scoring is Exact Match it is more strict: a single wrong character leads to a score of zero. All fields are by default Exact Match. Only String type fields have the option to select Levenshtein scoring.

Beträge auf Betriebskostenabrechnungen

A total amount might seem straightforward enough, but utility bills contain many amounts. Sometimes you need the total amount to be paid. Other times you need only the current bill amount – without the amounts carried forward from previous billing periods. In the latter case, you need to label differently even if the current bill and total amount can be the same. The concepts are different and the amounts are often different.

📘

Hinweis:

Each field represents a different concept, and they need to be defined as cleanly and crisply as possible, so there is no confusion. If a human confuses it, the ML model also does that.

Darüber hinaus kann der aktuelle Rechnungsbetrag manchmal aus einigen verschiedenen Beträgen, Zahlungen und Steuern bestehen und erscheint nirgendwo auf der Rechnung isoliert. Eine mögliche Lösung hierfür ist die Erstellung von zwei Feldern: einem Feld mit vergangenen Kosten und einem Feld für die Gesamtsumme. Diese beiden erscheinen immer als eindeutige explizite Werte auf der Betriebskostenabrechnung. Dann kann der aktuelle Rechnungsbetrag als Differenz zwischen den beiden ermittelt werden. Sie könnten sogar alle drei Felder(vergangene Kosten, Gesamtbetrag und aktuelle Kosten) verwenden, um Konsistenzprüfungen durchführen zu können, wenn der aktuelle Rechnungsbetrag explizit im Dokument angezeigt wird. Sie können also in einigen Fällen von einem auf drei Felder wechseln.

Bestellnummern auf Rechnungen

Auftragsnummern können als einzelne Werte für eine Rechnung angezeigt werden, oder sie erscheinen möglicherweise als Teil der Tabelle der Positionen auf einer Rechnung, bei denen jede Position eine andere Auftragsnummer hat. In diesem Fall könnte es sinnvoll sein, zwei verschiedene Felder zu haben: po-no (Auftragsnummer) und item-po-no (Position Auftragsnummer). Wenn jedes Feld visuell und visuell konsistent bleibt, wird das Modell wahrscheinlich viel besser funktionieren. Sie müssen jedoch sicherstellen, dass beide in Ihren Trainings- und Auswertungs-Datasets gut dargestellt sind.

Lieferantenname und Zahlungsadresse auf Rechnungen

The company name usually appears at the top of an invoice or a utility bill, but sometimes it might not be readable because there is just a logo, and the company name is not explicitly written out. There could also be some stamp, or handwriting, or wrinkle over the text. In these cases, people might label the name that appears at the bottom right, in the 'Remit payment to' section of the payslip on utility bills. That name is often the same, but not always, since it is a different concept. Payments can be made to some other parent or holding company, or other affiliate entity, and it is visually different on the document. This might lead to poor model performance. In this case, you should create two fields, vendor-name and payment-addr-name. Then you can look both up in a vendor database and use the one that matches, or use payment-addr-name when the vendor-name is missing.

Tabellenzeilen

Es gibt zwei verschiedene Konzepte zu beachten: Tabellenzeilen und Textzeilen. Eine Tabellenzeile enthält alle Werte aller Spaltenfelder, die in dieser Zeile zusammen gehören. Manchmal sind sie alle Teil derselben Textzeile auf einer Dokumentenseite. Manchmal erstrecken sie sich über mehrere Zeilen.

If a table row consists of more than one line of text, then you need to group all values on that table row using the “/” hotkey. When you do this, a green box appears, covering the entire table row. Here is an example of a table where the top two rows consist of multiple lines of text and need to be grouped using the “/” hotkey, while the third row is a single line of text and does not need to be grouped.

936936

Hier ein Beispiel für eine Tabelle, in der jede Tabellenzeile aus einer einzelnen Textzeile besteht. Sie müssen diese nicht mit dem Hotkey „/“ gruppieren, da dies implizit durch den Document Manager geschieht.

932932

Elemente aufteilen

Split Items is a setting that appears only for Column fields, and it helps the model know when a line item ends and another begins. As a human looking at a document, to tell how many rows are there, you probably look at how many amounts are on the right side. Each amount refers to a line item in general. This indicates that line-amount is a column where you should enable Split Items. Other columns can also be marked, in case the OCR misses the line-amount, or the model does not recognize it: quantity and unit-price are usually also marked as 'Split Items'.

5. Die Felder konfigurieren


The most important configuration is the Content Type, except for String:

  • Nummer
  • Datum
  • Telefon
  • Identifikationsnummer

These impact post-processing, especially the cleanup, the parsing, and the formatting. The most complex is the Date formatting, but also Number formatting requires determining the decimal point separator and thousands decorator. In some cases, if parsing fails, your option is to report the issue to UiPath support and to fall back on the String content type, which does no parsing. In that case, you need to parse the value in your RPA workflow logic.

Another relevant configuration is the Multi-line checkbox, which is relevant mainly for String type fields. Whenever some other field produces unexpected results or no results, the first thing to try is to change it to String Multi-line field, to see the unaltered output of the model prediction.

6. Das Trainings-Dataset beschriften


When labeling Training data, you need to focus on the bounding boxes of the words in the document pane of Document Manager. The parsed values in the right or top sidebars are not important as they are not used for training.

All fields should be labelled, even if a field appears multiple times on a page, as long as they represent the same concept (see the Define fields section above).

When the OCR misses a word or gets a few characters wrong, just label the bounding box if there is one, and if not, then just skip it and keep going. There is no possibility to add a word in Document Manager because even if you did, the word would still be missing at run time, so adding it doesn't help the model at all.

As you label, remain vigilant about fields that may have multiple or overlapping meanings/concepts, in case you might need to split a field into two separate fields, or fields that you do not explicitly need, but which, if labelled, might help you to do certain validation or self-consistency check logic in the RPA workflow. Typical examples are quantity, unit-price, and line-amount on invoice line items. Line-amount is the product of quantity and unit-price, but this is very useful to check for consistency without the need for confidence levels.

7. Das Auswertungs-Dataset beschriften


When labeling Evaluation datasets (also called Test datasets) you need to focus on something slightly different from labeling Training datasets. Whereas for Training datasets only the bounding boxes of the fields on the document matter, for Evaluation datasets only the values of the fields matter. You may edit them by clicking on the value in the right or top sidebar and editing it. To return to the automatically parsed value, click on the lock icon.

📘

Hinweis:

For convenience, speed, and to avoid typos, we recommend clicking on the boxes on the document when labeling and only making corrections manually. Typing full values manually is slower and more error-prone.

8. Das Modell trainieren und auswerten


Exporting the entire dataset, including both Training and Test batches is allowed because the Training pipelines in AI Center ignore Test data. However, Evaluation pipelines run the evaluation on the whole evaluation dataset regardless if it is comprised of train or test data. The type of a given document is displayed right under the file name, at the top center of the Document Manager window.

When evaluating an ML model, the most powerful tool is the evaluation.xlsx file generated in the artifacts/eval_metrics folder. In this Excel file you can see what predictions are failing and on which files, and you can see immediately if it is an OCR error, an ML Extraction, or parsing error, and if it may be fixed by simple logic in the RPA workflow, or it requires a different OCR engine, more training data, or improving the labelling of the field configurations in Document Manager.

This Excel file is also very useful to identify the most relevant business rules you need to apply in the RPA workflow to catch common mistakes for routing to Validation Station in Action Center for manual review. Business rules are by far the most reliable way to detect errors.

For those errors that cannot be caught by business rules, you may also use confidence levels. The Excel file also contains confidence levels for each prediction, so you can use Excel features like sorting and filtering to determine what a good confidence threshold is for your business scenario.

Overall, the evaluation.xlsx Excel file is a key resource you need to focus on to get the best results from your AI automation.

9. Geschäftsregeln definieren und implementieren


In this step, you should be concerned with model errors and how to detect them. There are two main ways to detect errors:

  • durch Erzwingen von Geschäftsregeln
  • durch Erzwingen eines Mindestkonfidenzschwellenwerts

The most effective and reliable way to detect errors is by defining business rules. Confidence levels can never be 100% perfect. There is always a small, but non-zero percentage of correct predictions with low confidence or wrong predictions with high confidence. Also, and most importantly, a missing field has no confidence, so a confidence threshold can never catch errors whereby a field is not extracted at all. Consequently, confidence level thresholds should only be used as a fallback, a safety net, but never as the main way to detect business-critical errors.

Beispiele für Geschäftsregeln:

  • Nettobetrag plus Steuerbetrag muss dem Gesamtbetrag entsprechen
  • Der Gesamtbetrag muss größer oder gleich dem Nettobetrag sein
  • Invoice number, Date, Total amount (and other fields) must be present
  • Auftragsnummer (falls vorhanden) muss in der entsprechenden Datenbank vorhanden sein
  • Rechnungsdatum muss in der Vergangenheit liegen und darf höchstens X Monate alt sein
  • Fälligkeitsdatum darf höchstens Y Tage/Monate in der Zukunft liegen
  • For each line item, the quantity multiplied by unit price must equal the line amount
  • Summe der Zeilenbeträge muss gleich dem Nettobetrag oder Gesamtbetrag sein
  • usw.

In particular, the confidence levels of column fields should rarely be used as an error detection mechanism, since column fields (e.g., line items on invoices or POs) can have dozens of values, so setting a minimum threshold over so many values can be especially unreliable, as one value is more than likely to have small confidence, so this would lead to most/all documents being routed to human validation, many times unnecessarily.

Business rules must be enforced as part of the RPA workflow, and the business rule failure is passed to the human validator so as to direct their attention and accelerate the process.

10. Einen Konfidenz-Schwellenwert wählen (optional)


Nachdem die Geschäftsregeln definiert wurden, bleibt manchmal eine kleine Anzahl von Feldern übrig, für die keine Geschäftsregeln vorhanden sind oder für die die Geschäftsregeln wahrscheinlich nicht alle Fehler erkennen werden. Möglicherweise müssen Sie dazu einen Konfidenzschwellenwert als letztes Mittel verwenden.

Das Haupttool zum Festlegen dieses Schwellenwerts ist die Funktion „Auswertungspipeline“ im AI Center und insbesondere die Excel-Kalkulationstabelle, die von der Auswertungspipeline im Ordner Ausgaben > Artefakte > eval_metrics ausgegeben wird.

This spreadsheet contains a column for each field and a column for the confidence level of each prediction. You can add a column called min_confidence that takes the minimum of all the confidences over all fields that are important for your business process and are not already covered by business rules. For instance, you may not want to put a threshold on the line items confidences, but rather on vendor name, total amount, date, due date, invoice number, and other essential fields. By sorting the table based on the min_confidence column you may see where the errors start appearing and set a threshold above that level to ensure that correctly extracted documents are sent straight through.

11. Training mit Daten aus der Validation Station


Validation Station data can help improve the model predictions, yet, often, it turns out that most errors are not due to the model itself but to the OCR, labelling errors, inconsistencies, or to postprocessing issues (e.g., date or number formatting). So, the first key aspect is that Validation Station data should be used only after the other Data Extraction Components have been verified and optimized to ensure good accuracy, and the only remaining area of improvement is the model prediction itself.

The second key aspect is that Validation Station data has a lower information density than the data labeled in Document Manager. Fundamentally, the Validation Station user only cares about getting the right value once. If an invoice has 5 pages, and the invoice number appears on every page, the Validation Station user validates it only on the first page. So, 80% of the values remain unlabelled. In Document Manager, all the values are labelled.

Finally, keep in mind Validation Station data needs to be added to the original manually labelled dataset, so that you always have a single training dataset which increases in size over time. You always need to train on the ML Package with the 0 (zero) minor version, which is the version released by UiPath Out of the Box.

❗️

Fügen Sie immer Validation Station-Daten zu demselben Dataset hinzu, und trainieren Sie mit ML-Paket-Nebenversion 0 (null)

Es wird oft fälschlicherweise angenommen, dass Daten aus der Validation Station dazu verwendet werden, die vorherige Modellversion iterativ erneut zu trainieren, d. h. anhand des aktuellen Batch wird Paket X.1 trainiert, um X.2 zu erhalten. Dann wird der nächste Batch mit X.2 trainiert, um X.3 zu erhalten und so weiter. Dies ist die falsche Art, das Produkt zu verwenden. Jeder Batch der Validation Station muss in dieselbe Document Manager-Sitzung importiert werden wie die ursprünglichen manuell beschrifteten Daten, was ein größeres Dataset ergibt, mit dem dann immer auf der ML-Paketversion X.0 trainiert werden muss.

Validation Station-Daten können potenziell ein viel höheres Volumen haben, da sie im Produktionsworkflow verwendet werden. Folglich benötigen Sie eine Richtlinie, wie viele Daten wahrscheinlich nützlich sein werden, da das Modelltraining Zeit und Infrastruktur erfordert. Außerdem möchten Sie nicht, dass das Dataset mit Validation Station-Daten überlastet wird, da dies die Qualität des Modells aufgrund des oben genannten Problems der Informationsdichte beeinträchtigen kann.

The recommendation is to add a maximum of 2-3X the number of pages of Document Manager data and, beyond that, only cherry pick those vendors or document samples where you see major failures. If there are known major changes to the production data, such as a new language, or a new geographic region being onboarded to the business process (expanding from US to Europe or South Asia), then representative data for those languages and regions should be added to Document Manager for manual labelling. Validation Station data is not appropriate for such major scope expansion.

Hier ist ein Beispielszenario. Sie haben ein gutes OCR-Modul gewählt und 500 Seiten im Document Manager beschriftet, was zu einer guten Leistung führte. Dann haben Sie das Modell in einem RPA-Workflow für die Produktion bereitgestellt. Validation Station beginnt mit der Generierung von Daten. Sie sollten zufällig bis zu maximal 1000–1500 Seiten von der Validation Station auswählen und sie zusammen mit den ersten 500 Seiten zum Document importieren und Ihr ML-Modell erneut trainieren. Danach sollten Sie eine Auswertungspipeline ausführen, um sicherzustellen, dass das Modell tatsächlich verbessert wurde, und dann sollten Sie das neue Modell in der Produktion bereitstellen.

12. Die Schleife für die automatische Feinabstimmung (Vorschau)


The Auto-Fine-tuning Loop is a Preview capability that becomes useful for maintaining a high-preforming model which you have already created using the steps described above. To ensure that auto-fine-tuning produces better versions of the model it is critical that you have a good Evaluation dataset and that you use an automatically rescheduled Full Pipeline which runs both Training and Evaluation at the same time. In this way, you can see if the most recent Training produced a more accurate model than the previous one, and if so, you are ready to deploy the new model to the ML Skill invoked by the Robots in your business process.

📘

Hinweis:

Das Trainings-Dataset ändert sich ständig, sobald mehr Daten kommen, und Document Manager exportiert periodisch wie im Dialogfeld „Geplanter Export“ vorgesehen. Die Auswertung wird für dasselbe Auswertungs-Dataset ausgeführt, das Sie in der Pipeline angeben. Auswertungs-Datasets ändern sich nie automatisch, sie müssen immer manuell kuratiert, beschriftet und exportiert werden. Sie sollten Ihren Auswertungssatz nur selten ändern, damit die Genauigkeitsbewertungen zwischen verschiedenen Trainingsausführungen verglichen werden können.

DAS AI Center bietet die Möglichkeit, den ML-Skill automatisch zu aktualisieren, wenn eine neue Version eines ML-Pakets erneut trainiert wird. Diese automatische Aktualisierung berücksichtigt jedoch nicht die Punktzahl der vollständigen Pipeline. Daher wird nicht empfohlen, diese Funktion mit Document Understanding Pipelines für automatisches erneutes Training zu verwenden.

As mentioned above in the Create an Evaluation Dataset section, Evaluation Pipeline implementations for ML Packages release 21.10 or later calculate scores on a per-document basis - which reflects accurately the results you see in an RPA workflow. This assumes your dataset was labelled on a per-document basis in Document Manager. You can tell if a multi-page document is labeled on a per-document basis if you can scroll naturally through the pages like in a regular PDF reader. If you need to click next to pass from one page to the next, then each page is considered a separate document.

13. Die Automatisierung bereitstellen


Make sure to use the Document Understanding Process from the Templates section in the Studio start screen in order to apply best practices in Enterprise RPA architecture.

Updated 27 days ago


Training von leistungsstarken Modellen


Auf API-Referenzseiten sind Änderungsvorschläge beschränkt

Sie können nur Änderungen an dem Textkörperinhalt von Markdown, aber nicht an der API-Spezifikation vorschlagen.