This pack contains the infrastructure for enabling document processing flows using a complete, open, extensible approach.
C# Project Flavor Support
Starting with version 4.10.0, this activity package is validated for use in C# projects.
If an error mentioning the Docotic.Pdf library is encountered at runtime, then you should upgrade the
UiPath.IntelligentOCR.Activitiespackage to version 3.1.0 or higher.
UiPath.IntelligentOCR.Activitiesversion 3.0 and higher is incompatible with a
UiPath.PDF.Activitiesversion lower than 3.0, and a
UiPath.PDF.Activitiesversion 3.0 or higher is incompatible with an
UiPath.IntelligentOCR.Activitiesversion lower than 3.0. Please use compatible versions if both packages are used in the same project.
UiPath.IntelligentOCR.Activitiesversion 4.0.0, all Abbyy related activities have been moved to a separate package. Install the
UiPath.Abbyy.Activitiespackage if you want to use its activities for OCR, Cloud OCR, classification, and data extraction.
Due to the upgrade of CefSharp dependencies,
UiPath.IntelligentOCR.Activitiesversion 4.10.2 is compatible only with
UiPath.Form.Activitiesversions higher than or equal to 1.1.8 if used in the same workflow.
Also, when updating
UiPath.IntelligentOCR.Activitiesfrom a version lower than 4.10.2 to a version higher than or equal to 4.10.2, an error might be thrown. This is a known issue that will be fixed and it can be ignored.
It allows you to:
You can achieve this using the Digitize Document activity. This retrieves the text from any PDF or image, using, only if necessary, the OCR engine of your choice.
- As the documents are processed one by one, they go through the digitization process. The difference for non-digital (scanned) documents is that you need to apply the OCR engine of your choice. The outputs of this step are the Document Object Model and a string variable containing all the document text and are passed down to the next steps.
You can achieve this using the Classify Document Scope activity. This allows identifying what type of document a file is by using any classification algorithm.
- After digitization, the document is classified. If you are working with multiple documents types in the same project, to extract data properly you need to know what type of document you're working with. The important thing is that you can use multiple classifiers in the same scope, you can configure the classifiers and, later in the framework, train them. The classification results help in applying the right strategy in extraction.
- The Keyword Based Classifier activity is the first such classifier, targeting classification for titled documents.
- The Intelligent Keyword Classifier activity can not only classify but also "split" files that contain multiple document types within them.
- The FlexiCapture Classifier, embedding the Abbyy FlexiCapture technology is also incorporated into our product. This activity is part of the
Validate automatic classification
You can achieve this using the Present Classification Station attended activity, which presents a document processing specific user interface for validating and correcting automatic classification outputs.
- Especially for use cases in which file splitting is involved, using the human classification validation step is strongly recommended, to ensure that downstream processing for data extraction works properly.
- An alternative to the attended activity is available through the usage of Long-Running Workflows, which are designed to optimally enable human-robot collaboration. The Create Document Classification Action and the Wait for Document Classification Action and Resume activities enable this scenario.
You can achieve this using the Train Classifiers Scope activity. This empowers the closing of the feedback loop to any classification algorithm capable of learning. Drag and drop your classifier trainers within this Scope activity and enable them using the Configure Classifiers wizard to ensure that the information validated by humans through the Classification Station or Validation Station is used by your classifiers to improve their own performance.
- Classification is as efficient as the classifiers used are. If a document wasn’t classified properly, it means it was unknown to the active classifiers. The Framework provides the opportunity to train the classifiers, to improve recognition of the document classes.
- The Keyword Based Classifier Trainer is the trainer activity paired with the Keyword Based Classifier.
- The Intelligent Keyword Classifier Trainer enables the feedback loop for Intelligent Keyword Classifier.
Extract data from documents
You can achieve this using the Data Extraction Scope activity. This allows the usage of any data extraction algorithm for identifying different fields in a classified document.
- Extraction is getting just the data you are interested in from a given document type. For example, extracting specific data from a 5-page document is quite troublesome if you want to do it with string manipulation. In this framework, you can use different extractors, for the different document structures, in the same data extraction scope. The extraction results are passed further for validation.
- The Regex Based Extractor is a basic data extractor that applies regular expression matching to identify the best candidates for a specific field.
- The Form Extractor uses predefined templates to enable the processing of structured, fixed form documents.
- The Intelligent Form Extractor is an extension of the Form Extractor with extended capabilities related to processing handwritten forms and signatures on documents.
- The Machine Learning Extractor leverages the power of AI and Machine Learning to identify information in structured or semi-structured documents by either using one of UiPath's public data extraction services or by calling custom trained Machine Learning models that you can build and host in AI Fabric. This activity is part of the
- The FlexiCapture Extractor incorporates the Abbyy FlexiCapture technology into our product and is part of the
Validate automatic data extraction results
You can achieve this using the Present Validation Station attended activity, which presents a document processing specific user interface for data validation and correction.
- The extracted data can be validated by a human user through the Validation Station. A best practice is to build logic around the decision of adding or not a human validation step, with rules depending on the specific use case to be implemented. Validation results can then be exported and used in further automation activities.
- You can also enable human validation through Long-Running Workflows, optimizing human-robot collaboration. The Create Document Validation Action and the Wait for Document Validation Action and Resume activities enable this scenario.
You can achieve this using the Train Extractors Scope activity. This empowers the closing of the feedback loop to any data extraction algorithm capable of learning. Drag and drop your extractor trainers within this Scope activity and enable them using the Configure Extractors wizard to ensure that the information validated by humans through the Validation Station is used by your extractors to improve their own performance.
- Extraction is efficient as the extractors used are. If field values were not extracted properly, it means they were unknown to the active extractors. The Framework provides the opportunity to train the extractors, to improve recognition of field values.
- The Machine Learning Extractor Trainer closes the feedback loop for ML-based data extraction, by collecting the data required for retraining a Machine Learning model hosted in AI Fabric. This activity is the companion of Machine Learning Extractor and is part of the
Export extracted information
You can achieve this using the Export Extraction Results activity. This allows you to export the complex structure of extracted data to a simple DataSet (collection of DataTables).
- Once you have your validated information, you can use it as it is, or save it in a DataTable format that can be converted very easily into an Excel file.
UiPath.IntelligentOCR.Activities package is compatible with any custom classification or data extraction activity that is built based on the public
UiPath.DocumentProcessing.Contracts package. It offers full flexibility to build your own algorithm specific to your use case, as well as integrating it with any third-party solution for document classification and data extraction.
The following versions of the package have been removed from the official feed. Should you have any issues, please reach out to our support teams.
4.3.0-preview | 4.4.0-preview
2.1.0 | 2.2.0 | 2.3.0
1.4.0 | 1.5.0 | 1.6.0 | 1.6.1 | 2.0.0 | 2.0.1
1.2.0 | 1.2.1 | 1.3.0
Updated 2 months ago