- Release Notes
- About the Document Processing Contracts
- Preview Releases
- Box Class
- IPersistedActivity Interface
- PrettyBoxConverter Class
- IClassifierActivity Interface
- IClassifierCapabilitiesProvider Interface
- ClassifierDocumentType Class
- ClassifierResult Class
- ClassifierCodeActivity Class
- ClassifierNativeActivity Class
- ClassifierAsyncCodeActivity Class
- ClassifierDocumentTypeCapability Class
- ExtractorAsyncCodeActivity Class
- ExtractorCodeActivity Class
- ExtractorDocumentType Class
- ExtractorDocumentTypeCapabilities Class
- ExtractorFieldCapability Class
- ExtractorNativeActivity Class
- ExtractorResult Class
- ICapabilitiesProvider Interface
- IExtractorActivity Interface
- ExtractorPayload Class
- DocumentActionPriority Enum
- DocumentActionData Class
- DocumentActionStatus Enum
- DocumentActionType Enum
- DocumentClassificationActionData Class
- DocumentValidationActionData Class
- UserData Class
- Document Class
- DocumentSplittingResult Class
- DomExtensions Class
- Page Class
- PageSection Class
- Polygon Class
- PolygonConverter Class
- Metadata Class
- WordGroup Class
- Word Class
- ProcessingSource Enum
- ResultsTableCell Class
- ResultsTableValue Class
- ResultsTableColumnInfo Class
- ResultsTable Class
- Rotation Enum
- SectionType Enum
- WordGroupType Enum
- IDocumentTextProjection Interface
- ClassificationResult Class
- ExtractionResult Class
- ResultsDocument Class
- ResultsDocumentBounds Class
- ResultsDataPoint Class
- ResultsValue Class
- ResultsContentReference Class
- ResultsValueTokens Class
- ResultsDerivedField Class
- ResultsDataSource Enum
- ResultConstants Class
- SimpleFieldValue Class
- TableFieldValue Class
- DocumentGroup Class
- DocumentTaxonomy Class
- DocumentType Class
- Field Class
- FieldType Enum
- LanguageInfo Class
- MetadataEntry Class
- TextType Enum
- TypeField Class
- ITrackingActivity Interface
- ITrainableActivity Interface
- ITrainableClassifierActivity Interface
- ITrainableExtractorActivity Interface
- TrainableClassifierAsyncCodeActivity Class
- TrainableClassifierCodeActivity Class
- TrainableClassifierNativeActivity Class
- TrainableExtractorAsyncCodeActivity Class
- TrainableExtractorCodeActivity Class
- TrainableExtractorNativeActivity Class
- Release Notes
- About the IntelligentOCR Activities Package
- Project Compatibility
- Load Taxonomy
- Digitize Document
- Classify Document Scope
- Keyword Based Classifier
- Intelligent Keyword Classifier
- Present Classification Station
- Create Document Classification Action
- Wait for Document Classification Action and Resume
- Train Classifiers Scope
- Keyword Based Classifier Trainer
- Intelligent Keyword Classifier Trainer
- Data Extraction Scope
- RegEx Based Extractor
- Form Extractor
- Intelligent Form Extractor
- Present Validation Station
- Create Document Validation Action
- Wait for Document Validation Action and Resume
- Train Extractors Scope
- Export Extraction Results
- Release Notes
- About the OCR Contracts
- Project Compatibility
- IOCRActivity Interface
- OCRAsyncCodeActivity Class
- OCRCodeActivity Class
- OCRNativeActivity Class
- Character Class
- OCRResult Class
- Word Class
- FontStyles Enum
- OCRRotation Enum
- OCRCapabilities Class
- OCRScrapeBase Class
- OCRScrapeFactory Class
- ScrapeControlBase Class
- ScrapeEngineUsages Enum
- ScrapeEngineBase
- ScrapeEngineFactory Class
- ScrapeEngineProvider Class
Digitize Document
UiPath.IntelligentOCR.Activities.Digitization.DigitizeDocument
Digitizes a document, extracting its Document Object Model (DOM) and text and storing them in their corresponding variable types.
Properties
Common
- DisplayName - The display name of the activity.
Input
- ApplyOcrOnPdf -Establishes if the OCR process should be applied or not to PDF documents. If set to Yes, the OCR is applied to all PDF pages of the document. If set to No, only digitally typed text is extracted. The default value is Auto, determining if the document requires to apply the OCR algorithm depending on the input document.
- DegreeOfParalelism - Specifies how many, if any, pages to be analyzed in parallel. The
-1
value uses the "Number of Cores on the machine - 1" (meaning it tries to process as many pages in parallel as the number of cores - 1 value), while specifying a positive value uses that specific number of logical processors. By default, this property is set to-1
. - DetectCheckboxes - Detects the available checkboxes from the document while digitizing it. The default value is True.
-
DocumentPath - The file path of the document you want to digitize. This field supports only strings and String variables.
Note:- In case of failure to classify a document that has enough data, please set the ApplyOcrOnPdf property as Yes on the Digitize Document activity.
- Text extraction from PDF files has been upgraded, resulting in an optimized extraction process, where both native and scanned text is retrieved at the same time, with the OCR being applied only on the images identified in the PDF file. This improvement is available only when the ApplyOCROnPDF option is set to Auto.
Note: The supported file types for this property field are.png
,.gif
,.jpe
,.jpg
,.jpeg
,.tiff
,.tif
, and.pdf
.
Misc
- Private - If selected, the values of variables and arguments are no longer logged at Verbose level.
Output
- DocumentObjectModel - The Document Object Model (DOM) of the file, stored in a
Document
variable. This field supports onlyDocument
variables. -
DocumentText - The text extracted from the specified document. This variable can be subsequently used in the Present Validation Station activity. This field supports only String variables.
Note: Starting with UiPath.IntelligentOCR.Activities package v6.3.0-preview, the Digitize Document activity comes with a default preselected OCR engine, the UiPath Document OCR engine.
Both these output variables, paired as they are dependent, can be used further in Document Processing throughout the entire Document Processing Framework (classification, data extraction, human validation, etc)
Important
If the UiPath.IntelligentOCR.Activities package has been updated to v5.1.0, then the ForceApplyOCR parameter has been replaced with the ApplyOcrOnPDF. Here is the compatibility between the old and new parameters:
-
ForceApplyOCR = True is being replaced by ApplyOcrOnPDF = Yes
-
ForceApplyOCR = False is being replaced by ApplyOcrOnPDF = Auto • ForceApplyOCR = Empty is being replaced by ApplyOcrOnPDF = Auto
-
ForceApplyOCR = <user-defined variable> is being replaced by ApplyOcrOnPDF = Auto
Document Object Model
The Document Object Model is captured in a proprietary object documented here.
For an image to be successfully digitized/processed, its width and height dimensions should be between 50 and 10000 pixels. Any image below or above this range is to be rejected, with an exception message. An image validated with the previously mentioned dimensions and with a total size bigger than 14 MP, is to be scaled down to 14 MP, while maintaining the aspect ratio (width/height ratio).
The OCR results on scanned documents have been improved and now the best results are obtained by keeping the skew angle between +/- 20 degrees.
Example of using the Digitize Document activity
You can see how the Digitize Document activity is used in an example that incorporates multiple activities.
You can check and download the example from here.