Clipboard AI
latest
false
  • Overview
    • Introduction
  • Setup and configuration
  • Data mapping
    • Data mapper
    • Data extractors
    • Transformations
  • Data privacy
Banner background image
Clipboard AI User Guide (Public Beta)
Last updated Mar 18, 2024

Data extractors

Important: UiPath Clipboard AI is currently in Public Beta.

Data extractors can be used to retrieve the relevant information from various documents and other sources.

When talking about document types, there are three main categories:

  • Structured documents - have a fixed format and are easy to process, guiding you to fill in the required data in precise fields. These documents are designed to comprise a certain type of data. Examples of structured documents: tax forms, surveys, questionnaires, etc.
  • Semi-structured documents - have both a fixed format and variable parts. Semi-structured documents do not have a fixed format in the sense that they are not bound to specified data fields like structured documents, but they contain a predictable set of information, for example an invoice always contains a unique identifier, a date, or an invoice number but the placement might vary depending on the provider. These documents mainly contain label:value pairs and may contain paragraphs as well. Example of semi-structured documents: invoices, receipts, purchase orders, utility bills, etc.
  • Unstructured documents - the information is not organized according to a fixed format. These documents mainly contain plain text, most of the data is in unstructured form inside the text. Examples of unstructured documents: contracts, emails, health records, etc.

Data extractors can differ based on how they extract data from documents. In this regard, there are two types of extractors:

  • Fixed output extractors - trained to extract a predefined set of information from a document; for example the Invoice extractor always tries to extract the company name, address, total sum, etc.
  • Question-answering extractors - trained to answer questions based on a given context. These extractors rely on natural language understanding to parse the text and figure out what is the exact value that needs to be extracted from the text and provide an appropiate answer or even choose an option out of a list of given options.

Now that we have explained the essential differences between document layouts and data extractor types, we can look at Clipboard AI's own set of data extractors:

  • Specific documents extractors
  • Plain-text extractor
  • Tables and name-value pairs extractor
  • Semi-structured extractor

An extractor is automatically chosen when copying the data. The results of each extractor are vastly different, so it's highly recommended to try all of them and see which extractor is best suited for your document.

To use a different extractor than the one automatically selected, select the Change type button from the bottom of the Mapper. This opens the data extractors panel from which you can select another extractor from the list. Once a new extractor is selected, the data fields are updated in the Mapper and you can compare the results.

Specific documents extractors

The Specific documents extractors are a fixed output set of extractors trained on specific document types. Each document type is extracted using its corresponding Document Understanding machine learning model, as follows:

  • Invoice
  • Passport
  • Receipt
  • ID card
  • W-2 form
  • Utility bill
  • Purchase order
  • Web/desktop forms

The automatically identified document type is highlighed and marked with a star. For any other document type, except for the ones listed, use one of the other extractors.

Plain-text extractor

The Plain-text extractor is a question-answering extractor that uses GPT3 to retrieve data from plain text documents, webpages, emails, etc. It can be used either for semi-structured documents to handle the variable parts or for unstructured documents where the layout is irrelevant.

This extractor supports semantic understanding and besides question-answering it has other advanced capabilities, such as summarization, machine translation, document type classification, and sentiment detection.

Tables and name-value pairs extractor

The Tables and name-value pairs extractor is a fixed output extractor which works best for documents containing label:value pairs (for instance, Name: John, Surname: Doe), and tables.

Semi-structured extractor

The Semi-structured extractor is a question-answering extractor and, as the name suggests, can extract data from semi-structured documents different from the ones covered by the Specific documents extractors. For instance, you can use this extractor for banks statements, bills of sale, tax forms, etc.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.