- Primeros pasos
- Instalación y configuración
- Asignación de datos
- Protección de datos

Clipboard AI user guide
Extractores de datos
Los extractores de datos pueden utilizarse para recuperar la información pertinente de varios documentos y otras fuentes.
Cuando hablamos de tipos de documentos, hay tres categorías principales:
- Structured documents - have a fixed format and are easy to process, guiding you to fill in the required data in precise fields. These documents are designed to comprise a certain type of data. Examples of structured documents: tax forms, surveys, questionnaires, etc.
- Semi-structured documents - have both a fixed format and variable parts. Semi-structured documents do not have a fixed format in the sense that they are not bound to specified data fields like structured documents, but they contain a predictable set of information, for example an invoice always contains a unique identifier, a date, or an invoice number but the placement might vary depending on the provider. These documents mainly contain label:value pairs and may contain paragraphs as well. Example of semi-structured documents: invoices, receipts, purchase orders, utility bills, etc.
- Unstructured documents - the information is not organized according to a fixed format. These documents mainly contain plain text, most of the data is in unstructured form inside the text. Examples of unstructured documents: contracts, emails, health records, etc.
Los extractores de datos pueden variar en función de cómo extraen los datos de los documentos. A este respecto, existen dos tipos de extractores:
- Fixed output extractors - trained to extract a predefined set of information from a document; for example the Invoice extractor always tries to extract the company name, address, total sum, etc.
- Question-answering extractors - trained to answer questions based on a given context. These extractors rely on natural language understanding to parse the text and figure out what is the exact value that needs to be extracted from the text and provide an appropriate answer or even choose an option out of a list of given options.
Clipboard AI uses the following set of data extractors:
- Extractor universal
- Specific documents extractors
- Plain-text extractor
- Tables and name-value pairs extractor
El extractor universal
The Universal extractor is the default option to extract data from your documents. It scans your data (plain text or tabular) and decides the best solution to extract it. It uses a combination of the existing extractors and it also allows queries to find the best match in your data.
Learn how to interact with the Universal extractor.
Extractores de documentos específicos
The Specific documents extractors are a fixed-output set of extractors trained on specific document types. Each document type is extracted using its corresponding Document Understanding machine learning model, as follows:
- Factura
- Pasaporte
- Recibo
- Tarjeta de identificación
- Formulario W-2
- Factura de servicios públicos
- Orden de compra
- Formularios web/de escritorio
Puedes seleccionar el modelo de Document Understanding preferido en función de tu tipo de documento.
Extractor de texto sin formato
The Plain-text extractor is a question-answering extractor that uses GPT3 to retrieve data from plain text documents, webpages, emails, etc. It can be used either for semi-structured documents to handle the variable parts or for unstructured documents where the layout is irrelevant.
Este extractor admite la comprensión semántica y, además de responder preguntas, tiene otras funciones avanzadas, como resumir, traducir automáticamente, clasificar tipos de documentos y detectar sentimientos.
Extractor de tablas y pares nombre-valor
The Tables and name-value pairs extractor is a fixed output extractor which works best for documents containing Tables and Name:Value pairs.