Subscribe

UiPath AI Fabric

UiPath AI Fabric

Using Data Manager

Importing

❗️

Exporting to AI Fabric Cloud or AI Fabric on premises 2020.7 or 2020.10

AI Fabric Cloud, AI Fabric on premises 2020.7 and 2020.10 do not support filenames containing special characters, so we strongly recommend that before importing documents to Data Manager, you make sure that their names contain only latin characters, numbers, dash (-) and underscore (_).

There are 4 types of Import supported in Data Manager:

  • Schema Import
  • Raw documents import
  • Data Manager dataset import
  • Machine Learning Extractor Trainer dataset import (PREVIEW feature)

Schema import

If you would like to launch a new instance of Data Manager using the same schema as an existing instance, you can follow these steps:

  1. Enter a random string in the filter of the existing instance, such that no documents remain in the view
  2. Click on Export button. A zip file will be exported.
  3. Import zip file directly into the new instance of Data Manager (do not unzip). The schema will be imported.

You may also use one of the predefined schemas provided in the Configuring Data Manager section of this documentation.

Raw documents import

The types of documents that can be imported for labeling are: .pdf, .tiff, .png, .jpg. The steps are:

  1. Click Import. The Import Data window is displayed.
  2. Provide a batch name in the Batch Name field. This enables you to easily filter and find these documents using the Filter drop-down later on.
  3. If you want to use this document batch for training an ML model, leave the Make this a test set checkbox unselected.
  4. If you want to use this document batch for evaluating an ML model (i.e. measuring its performance), select the Make this a test set checkbox. This ensures the data is ignored by the training pipelines.
  5. Upload or drag & drop a file or set of files into the Browse or drop files section.
    Any type of file is accepted. The application inspects them and indicates how many of them can be imported. .zip files are also accepted. The application unzips the archive and goes through folders recursively to find all files inside.
    Importing a dataset zip file exported from another Data Manager instance will import the documents with the labels. This works only if the dataset schema is the same or is a subset of the pre-existing schema in the Data Manager.

Data Manager dataset import

To import a dataset that was labelled previously on another instance of Data Manager, you need to get the zip file which was exported originally, and import it directly into the new Data Manager instance. If your new Data Manager instance is completely empty (no data and no fields defined) then both the data and the schema will be imported. If your new Data Manager instance already has fields defined, then the newly imported dataset needs to have the same fields, or a subset of those fields. Otherwise the import will be rejected.

Validation Station dataset import (PREVIEW feature)

As your RPA workflow processes documents using an existing ML model, some documents may require human validation using the Validation Station activity (available on attended bots or in the browser using Orchestrator Action Center). The validated data generated in Validation Station can be exported using Machine Learning Extractor Trainer activity, and can be used to train ML models using the feature described here. The steps involved are:

  1. Configure ML Extractor Trainer to output data into a folder with path <Trainer/Output/Folder> (use any empty folder path).
  2. Run RPA workflow including Validation Station and ML Extractor Trainer.
  3. ML Extractor Trainer will create 3 subfolders named: documents, metadata and predictions inside of the output folder.
  4. Zip the <Trainer/Output/Folder> to obtain a zip file such as TrainerOutputFolder.zip
  5. Import zip file into Data Manager. The Data Manager will detect that the import contains data produced by ML Extractor Trainer and will import the data accordingly.
  6. Export data as usual, and upload to AI Fabric.
  7. Launch Training pipeline or Full pipeline and make sure to select the ML Package and version which you would like to fine tune.

Adding and Configuring Fields

Fields cannot be deleted or renamed, so please think carefully before adding new fields. If, however, there are fields which you later decide you do not want to use for training an ML model, you can always hide them using the Hidden checkbox in the Edit Field window.
Click here for details about fields, their meaning and when to use them.

Column Fields

A line item Description or Unit Price on an invoice document would be examples of Column fields.

  1. Click in the table section at the top of the page to add a new Column field. The Create Column Field window is displayed.
  2. In the Enter Unique Field Name field, fill in a unique name for the field. The field does not accept uppercase letters.
  3. Click Create. The Edit Field window is displayed.
  4. From the Content Type drop-down, select the content type.
  5. From the Scoring drop-down, select the measure used to determine accuracy when running evaluations of model predictions.
  6. Click the Hotkey field and press a key on your keyboard to automatically populate it.
  7. Fill in the hex code of the desired field color on the Color field.
  8. Select the Multi line checkbox if the field to be checked against might span across multiple text lines, such as addresses or descriptions. If this option is not selected, only the first line is returned.
  9. Select the Split items checkbox if you want this field to be used as a delimiter between line items or rows in a table. Any line on which this field appears is considered to be a new line item or row in the table. Most commonly this is used on Line Amount fields on Invoice line items.
  10. Select the Hidden checkbox if you do not want this field to be part of exported datasets.
  11. Click Save to save your settings.

Regular Fields

These are fields which appear only once on a given document. A line item Invoice Number or Total Amount on an invoice document would be examples of Column fields.

  1. Click on the right pane in the Regular Fields section. The Create Regular Field window is displayed.
  2. Fill in a unique name for the field in the Enter Unique Field Name field. The field does not accept uppercase letters.
  3. Click Create. The Edit Field window is displayed.
  4. Select the content type from the Content Type drop-down.
  5. Select the post processing mechanism in case the model predicts more than one instance of a field on a given page from the Post processing drop-down.
  6. Click the Hotkey field and press a key on your keyboard to automatically populate it.
  7. In the Color field, fill in the hex code of the desired field color o
  8. From the Multi page drop-down, select the data retrieval strategy. This option defines how in case that fields appear on a few different pages of a multi-page document. This option defines how the model decides which one to return.
  9. From the Scoring drop-down, select the measure used to determine accuracy when running evaluations of model predictions.
  10. Select the Multi line checkbox if the field to be checked against might span across multiple text lines, such as addresses or descriptions. If this option is not selected, only the first line is returned.
  11. Select the Hidden checkbox if you do not want this field to be part of exported datasets.
  12. Click Save to save your settings.

Classification Fields

Data points which refer to a document as a whole. For instance, the Expense Type of a receipt (Food, Hotel, Airline, Transportation) or the Currency of an invoice (USD, EUR, JPY) would be examples of Classification fields.

  1. Click on the right pane in the Classification Fields section. The Create Classification Field window is displayed.
  2. Fill in a unique name for the field in the Enter Unique Field Name field. The field does not accept uppercase letters.
  3. Click Create. The Edit Field window is displayed.
  4. In the text area, fill in the list of classes and type the names as a comma separated list.
  5. Click Save to save your settings.

🚧

Classification fields are not retrained

Contrary to Regular and Column fields, Classification fields are not Re-trained. For example for Currency field, if you retrain the Invoices model on a dataset containing only USD and INR invoices, then the resulting model will only be able to recognize those two currencies.

Labeling Data

Data preparation

For the volumes of documents needed, see the Training and Retraining Pipelines section here.

When selecting the documents to be used for training, you will also need to be aware of a few details. First, you will need to remove garbage pages which do not include fields of interest, or which include only 1 or 2. You can do this in Data Manager using the Delete button. Pages are not lost, they can always be recovered from the Deleted view.

Then, if your usecase involves a highly diverse document type (like invoices or receipts), then you need a highly diverse training set. You do not need more than 3-4 documents (i.e. ~5-10 pages if there are 2 pages per document on average) from a given layout. However, if your usecase involves a document type with a very consistent layout (like a form) then you would need at least 30 samples from it, because if the trainset is too small, the ML model training might fail.

Multiple users labelling in parallel

You can have multiple people use the same instance to label at the same time only if the following conditions are observed:

  • no two users should be labelling the same document at the same time
  • whenever fields are added, removed or their configuration is edited, this should be done by one user and all other users should immediately refresh their browser to see the changes. Making changes to fields while other people are labelling will cause unexpected behavior.

Labelling for training

When you import a dataset without checking the "Make this a Testset" checkbox on the Import Data dialog, then that dataset will be used for training. In this case you only need to focus on the labelling of the words (grey boxes) on the document. If once in a while the text that gets filled in the sidebar fields is not correct, that's not a problem, the ML model will still learn. In some cases, you may need to adjust the configuration of the fields - for instance by checking the Multi-line checkbox. But in general, the main focus is on labelling the words on the page.

Fields which occur multiple times on the same document

There are many situations where a field will appear in multiple places in the same document or even on the same page. These should all be labelled as long as they have the same meaning. An example, from many utility bills, is the total amount. It often appears at the top, and also within a line item list in the middle, and then also in a pay slip at the bottom, which can be detached and send in the mail with the check. In this situation, all three occurrences would be labelled. This is useful because in some cases, if there is an OCR error, or the layout is different, and one of them cannot be identified, the model can still identify the other occurrences.

It is important to not that what counts is the meaning of the value, not the value itself. For instance, on some invoices which carry no tax, the net amount and the total amount have the same value. But they are clearly different concepts. Consequently, they should not both be labelled as total amount. Only the one whose meaning is to represent the total amount, should be labelled as total amount.

Labelling for testing

When you import a dataset and you check the "Make this a Testset" checkbox on the Import Data dialog, then that dataset will not be used by Training pipelines in AI Fabric, but only be Evaluation pipelines. In this case, it is important that the correct text is filled into the fields in the sidebar (or the top bar in the case of Column fields). This takes much longer to verify for each field, but it is the only way you will get a reliable metric of the accuracy of the ML model you are building.

Labelling actions

See below the main actions you need to perform when labeling documents. A given field may be labeled in multiple places on the same page.

  1. Label field
    • Select words by dragging mouse (rubber banding) or by clicking on them, holding down Shift to select multiple words.
    • Tap the shortcut key to label the field
  2. Remove label
    • Select words, then tap the Delete or Backspace key on your keyboard.
  3. Group table row
    • After you have labeled some Column fields, and only if some rows span multiple lines of text, then you may group them together by using the “/” key to indicate that they are part of the same table row. A green box will appear around the group.
  4. Ungroup table row
    • Select the group and tap “/” again
  5. Make correction to OCR
    • Right-click on the word and edit the text in the tooltip that appears. This is rarely recommended, since when in production the OCR will still make those errors. Consequently, it is usually best to just skip and move on.
  6. Make correction to labeled value
    • Click on the text in the sidebar or the top bar and edit the content. A small lock will appear to indicate the field has been manually edited. This is necessary when labelling test sets.
  7. Reset labeled value to auto-extracted value
    • Click on the lock, and the field will revert to its auto-extracted value.

Exporting Labeled Documents

A labelled image is an image with at least one labelled field, of any kind. You can see how many images are visible at the top-left of the page. The Export button enables you to easily export data for training ML models.

Exporting a label document takes into consideration the active filter.

  • If you have no filter applied, all labeled images visible in the current view except for testset images are exported.
  • If you have applied a filter, all labeled images visible in the view, including testset images are exported.
  • If you want to export all testset images, select test-set option from the filters drop-down.

🚧

Export requirements

Exporting a dataset requires that the following conditions be satisfied:

  • each Regular or Column field is labelled on at least 10 different images
  • each class of any one Classification field appears at least once

Uploading dataset to AI Fabric

Once dataset is exported, it is exported as a zipped file and a log file. Before you can use it in AI Fabric you need to unzip the file. The extracted folder can then be uploaded as a new dataset or as a subfolder on an existing dataset as described here.

Updated 6 days ago


Using Data Manager


Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.