Document Understanding User Guide

DELIVERY:

Last updated Apr 4, 2025

Form Extractor

What Is Form Extractor

The Form Extractor is best suited for extracting, matching and reporting specific information by analyzing the word's position inside the document, or detecting a signature.

The Form Extractor relies on templates defined up-front, at the design stage. A complex set of rules applies the configured templates to incoming documents that are to be processed, thus identifying and reporting the expected information.

The activity comes with a configuration wizard that helps you in defining the templates for the document types and fields you want to target for data extraction.

The activity supports both simple field and table field extraction, and, as mentioned before, can detect a signature field.

Note:

More related information about Form Extractor:

Form Extractor activity page
Taxonomy Manager - setup instructions
Template Manager wizard - setup instructions
Anchorbased workflow example

It is recommended to look into other extraction methods, in case:

there are many layouts that need to be handled
documents are not only skewed, rotated, or come in different sizes, but also manifest "warping" (curving in certain areas).

Note:
For fixed form extraction, to evaluate if layouts of two files are the same, try overlapping them in a tool, with some transparency, to see if all non-variable content overlaps (after de-rotation, de-skewing and bringing the two images to the same scale).

If you notice variability (non-variable content appears more to the left / right / top / bottom for certain areas of the document), then the layouts are not considered the same.

The Form Extractor allows you to define multiple templates for the same document type, and, at run-time, it:
identifies the best matching template for the incoming document and document type
applies the template matching algorithm, based on page-level anchors, to each page from where data needs to be extracted (missing or repeating pages are not supported)
applies all field-level anchor settings to each page, to capture values associated with any potential matches
reports the identified information from the target value areas.

It also supports fine-tuning of checkbox / boolean field processing, by allowing the configuration of "Synonyms for Yes" or "Synonyms for No" value, according to your use case.

This extractor does not have learning (training) capabilities and requires configuration.

How to Configure

Activity Configuration

The Form Extractor has two major configurations to be considered:

the Template Manager wizard - that allows you to define templates to be applied to incoming documents. This wizard enables the Template Editor and the Boolean field interpretation settings.
the MinOverlapPercentage setting - that allows you to control how strict the value area matching should be. Accepts a value between 0 and 100, and it controls what words are accepted or rejected from being part of a given value, based on how well their location fits the area defined in the template.

More information about using the Form Extractor activity wizard can be found here.

On this page