Document Understanding User Guide

DELIVERY:

Last updated Apr 4, 2025

Export Documents

The Export files dialog box enables you to easily export data for training ML models.

Click the Export button from the management bar.

The dialog box contains three tabs:

Export Now

The Export now tab allows you to:

Download the data locally using the Download button.
Export the data to AI Center using the Export button. The exported folders can be found in AI Center under the export folder (Datasets > dataset_name > export).

If no schema is defined, all export options are disabled.

If a schema is defined, it is mandatory to enter a name for your export, otherwise, the Download and Export buttons are disabled. A valid name can have up to 24 characters and should not contain special characters.

You can choose to export one of the following options:

Current search results - the labeled documents filtered by a predefined keyword/named batch or by a text query. If no filter is applied, all labeled documents in the current view are exported.
All labelled - all documents with at least one labeled field, of any kind; more precisely, the documents from the labelled filter.
Schema - a zip file containing the fields and their configurations which can be imported into a different Document Manager session.

The Backwards-compatible export checkbox enables you to apply legacy export behavior, which is to export each page as a separate document. Try this if the model trained using default export is below expectations. Leave this unchecked to export the documents in their original multi-page form.

Important:

The 2021.10 release of Document Manager supports labeling multi-page documents. This is a major change from previous releases where each page was labeled separately. Labeling and exporting multi-page documents assumes each document represents a single logical document. For instance, a six-page document may contain a single six-page invoice but it should not contain three different invoices, two pages each. This is particularly important for evaluation sets.

This requirement is not relevant for Backwards-compatible exports.

Export Validation

To export a dataset, all fields need to be labeled in at least 10 different documents. Otherwise, the export fails with the following messages:

For Classification fields, there is an additional requirement: each option needs to be labeled in at least one document. Otherwise, the export fails with the following message:

When exporting only Evaluation set data, all validations are disabled.

Dataset Format

A folder containing the exported dataset coming from Document Manager. This includes:

schema.json: a file containing the fields to be extracted and their types
split.csv: a file containing the split per each document that will be used either for TRAIN or VALIDATE during the Training Pipeline
images: a folder containing images of all the labeled pages
latest: a folder containing .json files with the labeled data from each page