Document Understanding User Guide

DELIVERY:

Last updated Apr 4, 2025

Import Documents

The Import data dialog box enables you to easily import new documents to be labeled or revised.

Click the Import button from the management bar.

The dialog box contains the following controls:

Batch name text field - it is mandatory to enter a name for your export, otherwise the Browse or drop files section is disabled; a valid name can have up to 24 characters and should not contain special characters.
Make this an evaluation set checkbox - if selected, the dataset is used for evaluation purposes.
Enable large documents checkbox - if selected, you can upload documents with more than 150 pages.
Browse or drop files section - click the Browse files to upload to navigate through your directory or simply drag and drop the files inside the frame.
Status section - click (load previous import log) to see to check the status of the latest import; when uploading data, in the Status section you receive an overview of your files and you are prompted to proceed with the import by clicking YES or abort the import by clicking CANCEL.

Important:
The 2021.10 release of Document Manager supports labeling multi-page documents. This is a major change from previous releases where each page was labeled separately. Labeling and exporting multi-page documents assumes each document represents a single logical document. For instance, a six-page document may contain a single six-page invoice but it should not contain three different invoices, two pages each. This is particularly important for evaluation sets.

This requirement is not relevant for Backwards-compatible exports.

Import Types

There are 4 types of Import supported in Document Manager:

Schema import
Raw documents import (max 2000 or 1GB pages per import)
Document Manager dataset import (max 2000 or 1GB pages per import)
Validation Station dataset import (max 2000 or 1GB pages per import)

Schema Import

If you would like to launch a new Document Manager session using the same schema as in an existing session, you can follow these steps:

Click the Export button from the management bar.
In the Export files dialog box, check the Schema option.
Click the Export button inside the dialog box. A .zip file is exported.
Click the Import button from the management bar.
Upload or drag & drop the .zip file directly into the new Document Manager session (do not unzip). In this step, you can also upload a predefined schema.
Click YES in the Status section to proceed with the import. The schema is imported.

You could also use one of the predefined schemas provided in the Use a Predefined Schema page.

Raw Documents Import

The types of documents that can be imported for labeling are: .pdf, .tiff, .png, .jpg.

.zip files are not supported for raw documents import.
OCR settings need to be configured before import.

Follow the steps below:

Click the Import button . The Import data dialog box is displayed.
Provide a batch name in the Batch name field. This enables you to easily filter and find these documents using the Search drop-down later on.
- If you want to use this document batch for training an ML model, leave unselected the Make this an evaluation set checkbox.
- If you want to use this document batch for evaluating an ML model (i.e. measuring its performance), select the Make this an evaluation set checkbox. This ensures the data is ignored by the Training Pipelines.
If you have documents with more than 150 pages, select the Enable large documents checkbox. Otherwise, leave the checkbox unselected.
Upload or drag & drop a file or set of files into the Browse or drop files section.
Click YES. The file or set of files are imported.

Document Manager Dataset Import

To import a dataset that was previously labeled in another Document Manager session, you need to get the .zip file which was exported originally, and import it directly into the new Document Manager instance.

If your new Document Manager instance is completely empty (no data and no fields defined), then both the documents with labels and the schema are imported.

If your new Document Manager instance already has fields defined, then the newly imported dataset needs to have the same fields, or a subset of those fields. Otherwise, the import is rejected.

Split large datasets

To import Document Manager datasets larger than 1GB or that have more than 1500 files, we recommend you to use this script which splits the .zip files into multiple .zip files that are smaller than 1GB and that have less than 1500 files.

Validation Station Dataset Import

As your RPA workflow processes documents using an existing ML model, some documents may require human validation using the Validation Station activity (available on attended bots or in the browser using Orchestrator Action Center).

The validated data generated in Validation Station can be exported using Machine Learning Extractor Trainer activity and can be used to train ML models using the feature described below.

Note: For Validation Station dataset import, it is mandatory to have a schema defined.

Follow the steps below:

Configure the Machine Learning Extractor Trainer to output data into a folder with path <Trainer/Output/Folder> (use any empty folder path).
Run an RPA workflow including Validation Station and Machine Learning Extractor Trainer.
Machine Learning Extractor Trainer creates three subfolders: documents, metadata, and predictions inside of the output folder.
Zip the <Trainer/Output/Folder> to obtain a .zip file, for instance TrainerOutputFolder.zip.
Import the .zip file into Document Manager which detects that the import contains data produced by Machine Learning Extractor Trainer and imports the data accordingly.