- Overview
- Document Processing Contracts
- Release notes
- About the Document Processing Contracts
- Box Class
- IPersistedActivity interface
- PrettyBoxConverter Class
- IClassifierActivity Interface
- IClassifierCapabilitiesProvider Interface
- ClassifierDocumentType Class
- ClassifierResult Class
- ClassifierCodeActivity Class
- ClassifierNativeActivity Class
- ClassifierAsyncCodeActivity Class
- ClassifierDocumentTypeCapability Class
- ExtractorAsyncCodeActivity Class
- ExtractorCodeActivity Class
- ExtractorDocumentType Class
- ExtractorDocumentTypeCapabilities Class
- ExtractorFieldCapability Class
- ExtractorNativeActivity Class
- ExtractorResult Class
- ICapabilitiesProvider Interface
- IExtractorActivity Interface
- ExtractorPayload Class
- DocumentActionPriority Enum
- DocumentActionData Class
- DocumentActionStatus Enum
- DocumentActionType Enum
- DocumentClassificationActionData Class
- DocumentValidationActionData Class
- UserData Class
- Document Class
- DocumentSplittingResult Class
- DomExtensions Class
- Page Class
- PageSection Class
- Polygon Class
- PolygonConverter Class
- Metadata Class
- WordGroup Class
- Word Class
- ProcessingSource Enum
- ResultsTableCell Class
- ResultsTableValue Class
- ResultsTableColumnInfo Class
- ResultsTable Class
- Rotation Enum
- SectionType Enum
- WordGroupType Enum
- IDocumentTextProjection Interface
- ClassificationResult Class
- ExtractionResult Class
- ResultsDocument Class
- ResultsDocumentBounds Class
- ResultsDataPoint Class
- ResultsValue Class
- ResultsContentReference Class
- ResultsValueTokens Class
- ResultsDerivedField Class
- ResultsDataSource Enum
- ResultConstants Class
- SimpleFieldValue Class
- TableFieldValue Class
- DocumentGroup Class
- DocumentTaxonomy Class
- DocumentType Class
- Field Class
- FieldType Enum
- LanguageInfo Class
- MetadataEntry Class
- TextType Enum
- TypeField Class
- ITrackingActivity Interface
- ITrainableActivity Interface
- ITrainableClassifierActivity Interface
- ITrainableExtractorActivity Interface
- TrainableClassifierAsyncCodeActivity Class
- TrainableClassifierCodeActivity Class
- TrainableClassifierNativeActivity Class
- TrainableExtractorAsyncCodeActivity Class
- TrainableExtractorCodeActivity Class
- TrainableExtractorNativeActivity Class
- Document Understanding Digitizer
- Document Understanding ML
- Document Understanding OCR Local Server
- Document Understanding
- Release notes
- About the Document Understanding activity package
- Project compatibility
- Set PDF Password
- Merge PDFs
- Get PDF Page Count
- Extract PDF Text
- Extract PDF Images
- Extract PDF Page Range
- Extract Document Data
- Create Validation Task and Wait
- Wait for Validation Task and Resume
- Create Validation Task
- Classify Document
- Create Classification Validation Task
- Create Classification Validation Task and Wait
- Wait for Classification Validation Task and Resume
- Intelligent OCR
- Release notes
- About the IntelligentOCR activity package
- Project compatibility
- Configuring Authentication
- Load Taxonomy
- Digitize Document
- Classify Document Scope
- Keyword Based Classifier
- Document Understanding Project Classifier
- Intelligent Keyword Classifier
- Create Document Classification Action
- Wait For Document Classification Action And Resume
- Train Classifiers Scope
- Keyword Based Classifier Trainer
- Intelligent Keyword Classifier Trainer
- Data Extraction Scope
- Document Understanding Project Extractor
- RegEx Based Extractor
- Form Extractor
- Intelligent Form Extractor
- Present Validation Station
- Create Document Validation Action
- Wait For Document Validation Action And Resume
- Train Extractors Scope
- Export Extraction Results
- ML Services
- OCR
- OCR Contracts
- Release notes
- About the OCR Contracts
- Project compatibility
- IOCRActivity Interface
- OCRAsyncCodeActivity Class
- OCRCodeActivity Class
- OCRNativeActivity Class
- Character Class
- OCRResult Class
- Word Class
- FontStyles Enum
- OCRRotation Enum
- OCRCapabilities Class
- OCRScrapeBase Class
- OCRScrapeFactory Class
- ScrapeControlBase Class
- ScrapeEngineUsages Enum
- ScrapeEngineBase
- ScrapeEngineFactory Class
- ScrapeEngineProvider Class
- OmniPage
- PDF
- [Unlisted] Abbyy
- [Unlisted] Abbyy Embedded
Document Understanding Activities
Document data
Document Data is a resource that serves both as an input and output variable, within your Document Understanding workflows. The Document Data object holds all the necessary information about a single document. If you classify a document, the object includes the Document Type. If you extract data, the object contains the corresponding extracted fields. Irrespective of the activity, Document Data consistently contains the document's text and DOM (Document Object Model).
With Document Data you can: collect all the necessary information about a document in one variable, save data to each property of the object, and reuse it for other activities in the workflow.
Document Data holds information about the following attributes:
- DocumentType: Provides data about the identified Document type, populated by activities such as Classify Document or Create Classification Validation Task.
- Data: Contains the extracted
field values. It is generated on demand by the Generate Data property, which
generates an output type of
IDocumentData<ExtractorType>
. When the Generate Data property is set toFalse
, you can access the extracted field values only through methods of typeGet
. - FileDetails: Contains details
about the
IResource
. - SubDocuments: Includes a collection of Document Data, populated by activities such as Create Classification Validation Task.
- DocumentMetadata: Contains
information about processing the document, such as:
- Text detected language
- Extracted fields as Data Table
- Document Object Model (DOM): Holds the Document Object Model which is used by all activities.
Tip: Unless an activity is the first Document Understanding activity part of a Studio workflow, use Document Data as input. Use the File variable as input only if the activity is the first Document Understanding one part of a Studio workflow.
get
and set
methods on
it, designed for advanced implementations, to increase flexibility.
The Generate Data Type property in the Extract Document Data activity allows you to choose if you generate the data on demand or not. Refer to the following scenarios:
- When you set Generate Data
Type to
True
(the default setting): Document Data outputs asIDocumentData<ExtractorType>
. This data is generated on demand and changes based on modifications made in the Extract Document Data activity. With this setting, you cannot change the document type in the Validation Station, and JIT (Just in Time) is selected by default. - When you set Generate Data
Type to
False
: Document Data outputs asIDocumentData<DictionaryData>
. With this setting, the Document Data property will not be generated anymore, and you won't be able to browse through it.You can access its data using specific methods, relying on the field ID. These IDs become available when configuring the document type or when retrieving the information using APIs. Visit Editing or adding new fields and Get the extraction request API for more information.
- When you set Generate Data
Type to
False
for generative extraction, the retrieved fields correspond to the names provided in the prompt. For example, if the field name in the prompt is defined asa b c
(including the spaces), you should use the same as the field ID when using the specific methods.
- When you set Generate Data
Type to
When you use Document Data, the first output object is made from your input file. After you created this object, we recommend you to pass it along to your next activities. By passing it along to your next activities, you can reuse the Text and DOM from your original file. This approach saves you from re-digitizing the file each time.
If you configure a document type field to be multi-valued, the system expects multiple values. An example might be a multiple-choice question on a form. The results appear in the multi-value attribute on the field, returned as a list. If the document type field is configured to be single value, the system returns the result in the value attribute on the field by default.
The following table shows you how Document Data returns single and multi-value fields:
Has no value | Has one value | Has two or more values | DocumentData.Data.FieldName.Value | DocumentData.Data.FieldName.MultiValues | |
---|---|---|---|---|---|
Single value | Yes | No | N/A | "" | null |
Single value | No | Yes | N/A | <value that was identified> | null |
Multi-value | Yes | No | No | "" | [] (empty array) |
Multi-value | No | Yes | No | <value that was identified> | [<array with one value identical to the
.Value >]
|
Multi-value | No | No | Yes | <first value that was identified> | [<array with n values, with the first value being
identical to the .Value >]
|
You can return the fields you extracted from a document as a Data Table, using the Document Data object. You can then use the Data Table variable inside Excel activities.
To return the extracted fields as a Data Table, choose the ResultsAsDatatable output for the Extract Document Data activity.
The properties of the Document Data variable can be populated and consumed by one or multiple activities. Depending on the activity populating the variable, the properties can differ. Check the following list:
- DocumentType - Classify Document activity
populates the following values:
- DisplayName (used for custom models): Name of the Document Type.
- ID (used for out-of-the-box models): Name of the Document Type.
- Confidence: Classification confidence.
- URL: URL of where the Document Type is accessible; this can be either custom or predefined, referenced via the respective project in Document Understanding center.
- Fields - Extract Document Data, Create Validation Task, Create Validation Task and Wait, Wait for Validation Task and Resume activities
populate the following values:
- Field Value: Extraction value of the field.
- Extraction Confidence Score: Confidence score of the extraction, as provided by the model.
- OCR Confidence Score: Confidence score provided by the OCR engine.
- File Details - Activities
creating the Document Data object, receiving a file as input, populate the following
values:
- Full Name: Full name of the file.
- Extension: Extension of the file.
- Page Range: Page range of the file.
- Sub Documents: Collection of
Document Data, populated by the Classify Document activity.
Note: This is not currently populated and will be added in the future together with classification validation and splitting capabilities.
- DocumentMetaData:
- DOM: The document object model, used by all activities. (populated by activities creating the Document Data object, receiving a file as input.)
- Text: All extracted text. (populated by activities creating the Document Data object, receiving a file as input.)
- Language: The language detected in the document. (populated by activities creating the Document Data object, receiving a file as input.)
- Split confidence: If
the document is split, the document is returned by the splitting model.
(populated by the Classify Document activity)
Note: This is not currently populated and will be added in the future together with classification validation and splitting capabilities.
- Results as Data Tables: Fields exported as Data Table. (populated by the Extract Document Data activity).