Clipboard AI User Guide

Last updated Dec 10, 2024

Data extractors

Data extractors can be used to retrieve the relevant information from various documents and other sources.

When talking about document types, there are three main categories:

Structured documents - have a fixed format and are easy to process, guiding you to fill in the required data in precise fields. These documents are designed to comprise a certain type of data. Examples of structured documents: tax forms, surveys, questionnaires, etc.
Semi-structured documents - have both a fixed format and variable parts. Semi-structured documents do not have a fixed format in the sense that they are not bound to specified data fields like structured documents, but they contain a predictable set of information, for example an invoice always contains a unique identifier, a date, or an invoice number but the placement might vary depending on the provider. These documents mainly contain label:value pairs and may contain paragraphs as well. Example of semi-structured documents: invoices, receipts, purchase orders, utility bills, etc.
Unstructured documents - the information is not organized according to a fixed format. These documents mainly contain plain text, most of the data is in unstructured form inside the text. Examples of unstructured documents: contracts, emails, health records, etc.

Data extractors can differ based on how they extract data from documents. In this regard, there are two types of extractors:

Fixed output extractors - trained to extract a predefined set of information from a document; for example the Invoice extractor always tries to extract the company name, address, total sum, etc.
Question-answering extractors - trained to answer questions based on a given context. These extractors rely on natural language understanding to parse the text and figure out what is the exact value that needs to be extracted from the text and provide an appropriate answer or even choose an option out of a list of given options.

Clipboard AI uses the following set of data extractors:

Universal extractor
Specific documents extractors
Plain-text extractor
Tables and name-value pairs extractor

The Universal extractor

The Universal extractor is the default option to extract data from your documents. It scans your data (plain text or tabular) and decides the best solution to extract it. It uses a combination of the existing extractors and it also allows queries to find the best match in your data.

Learn how to interact with the Universal extractor.

Specific documents extractors

The Specific documents extractors are a fixed-output set of extractors trained on specific document types. Each document type is extracted using its corresponding Document Understanding machine learning model, as follows:

Invoice
Passport
Receipt
ID card
W-2 form
Utility bill
Purchase order
Web/desktop forms

You can select the preferred Document Understanding model based on your document type.

Plain-text extractor

The Plain-text extractor is a question-answering extractor that uses GPT3 to retrieve data from plain text documents, webpages, emails, etc. It can be used either for semi-structured documents to handle the variable parts or for unstructured documents where the layout is irrelevant.

This extractor supports semantic understanding and besides question-answering it has other advanced capabilities, such as summarization, machine translation, document type classification, and sentiment detection.