The Import data dialog box enables you to easily import new documents to be labeled or revised.
Click the Import button from the management bar.
The dialog box contains the following controls:
- Batch name text field - it is mandatory to enter a name for your export, otherwise the Browse or drop files section is disabled; a valid name can have up to 24 characters and should not contain special characters.
- Make this an evaluation set checkbox - if selected, the dataset is used for training purposes.
- Enable large documents checkbox - if selected, you can upload documents with more than 150 pages.
- Browse or drop files section - click the Browse files to upload to navigate through your directory or simply drag and drop the files inside the frame.
- Status section - click (load previous import log) to see to check the status of the latest import; when uploading data, in the Status section you receive an overview of your files and you are prompted to proceed with the import by clicking YES or abort the import by clicking CANCEL.
Labeling multi-page documents
The 2021.10 release of Data Manager supports labeling multi-page documents. This is a major change from previous releases where each page was labeled separately. Labeling and exporting multi-page documents assumes each document represents a single logical document. For instance, a six-page document may contain a single six-page invoice but it should not contain three different invoices, two pages each. This is particularly important for evaluation sets.
This requirement is not relevant for Backwards-compatible exports.
There are 4 types of Import supported in Data Manager:
- Schema import
- Raw documents import (max 2000 or 2GB pages per import)
- Data Manager dataset import (max 2000 or 2GB pages per import)
- Validation Station dataset import (max 2000 or 2GB pages per import)
If you would like to launch a new Data Manager session using the same schema as in an existing session, you can follow these steps:
- Click the Export button from the management bar.
- In the Export files dialog box, check the Schema option.
- Click the Export button inside the dialog box. A
.zipfile is exported.
- Click the Import button from the management bar.
- Upload or drag & drop the
.zipfile directly into the new Data Manager session (do not unzip). In this step, you can also upload a predefined schema.
- Click YES in the Status section to proceed with the import. The schema is imported.
You could also use one of the predefined schemas provided in the Configure Data Manager page.
The types of documents that can be imported for labeling are:
.zip files are not supported for raw documents import.
OCR settings need to be configured before import.
Follow the steps below:
- Click the Import button . The Import data dialog box is displayed.
- Provide a batch name in the Batch name field. This enables you to easily filter and find these documents using the Search drop-down later on.
- If you want to use this document batch for training an ML model, leave unselected the Make this an evaluation set checkbox.
- If you want to use this document batch for evaluating an ML model (i.e. measuring its performance), select the Make this an evaluation set checkbox. This ensures the data is ignored by the Training Pipelines.
- If you have documents with more than 150 pages, select the Enable large documents checkbox. Otherwise, leave the checkbox unselected.
- Upload or drag & drop a file or set of files into the Browse or drop files section.
- Click YES. The file or set of files are imported.
To import a dataset that was previously labeled in another Data Manager session, you need to get the
.zip file which was exported originally, and import it directly into the new Data Manager instance.
If your new Data Manager instance is completely empty (no data and no fields defined), then both the documents with labels and the schema are imported.
If your new Data Manager instance already has fields defined, then the newly imported dataset needs to have the same fields, or a subset of those fields. Otherwise, the import is rejected.
To import Data Manager datasets larger than 1GB or that have more than 1500 files, we recommend you to use this script which splits the
.zip files into multiple
.zip files that are smaller than 1GB and that have less than 1500 files.
As your RPA workflow processes documents using an existing ML model, some documents may require human validation using the Validation Station activity (available on attended bots or in the browser using Orchestrator Action Center).
The validated data generated in Validation Station can be exported using Machine Learning Extractor Trainer activity and can be used to train ML models using the feature described below.
For Validation Station dataset import, it is mandatory to have a schema defined.
Follow the steps below:
- Configure the Machine Learning Extractor Trainer to output data into a folder with path
<Trainer/Output/Folder>(use any empty folder path).
- Run an RPA workflow including Validation Station and Machine Learning Extractor Trainer.
- Machine Learning Extractor Trainer creates three subfolders: documents, metadata, and predictions inside of the output folder.
- Zip the
<Trainer/Output/Folder>to obtain a
.zipfile, for instance TrainerOutputFolder.zip.
- Import the
.zipfile into Data Manager which detects that the import contains data produced by Machine Learning Extractor Trainer and imports the data accordingly.
If there are missing fields required by the dataset, an error message is displayed in the import dialog box.
Updated 10 days ago