# Digitize Document

> `UiPath.IntelligentOCR.Activities.Digitization.DigitizeDocument`

`UiPath.IntelligentOCR.Activities.Digitization.DigitizeDocument`

## Description

Digitizes a document, extracting its Document Object Model (DOM) and text and storing them in their corresponding variable types.

:::note
You must assign an OCR engine to this activity by dragging it into the body of the activity. The chosen OCR engine is to be used only if the incoming documents require OCR processing. Visit [OCR Engines](https://docs.uipath.com/document-understanding/automation-cloud/latest/classic-user-guide/ocr-engines) to check the available OCR engines. The input and output parameters of the selected OCR engine are automatically set by the **Digitize Document** activity.
:::

## Project compatibility

**Windows-Legacy | Windows**

## Configuration

#### Properties panel

**Common**
* **DisplayName** - The display name of the activity.

**Input**
* **ApplyOcrOnPdf** -Establishes if the OCR process should be applied or not to PDF documents. If set to **Yes**, the OCR is applied to all PDF pages of the document. If set to **No**, only digitally typed text is extracted. The default value is **Auto**, determining if the document requires to apply the OCR algorithm depending on the input document.
* **DegreeOfParalelism** - Specifies how many, if any, pages to be analyzed in parallel. The `-1` value uses the "Number of Cores on the machine - 1". This means that the activity tries to process as many pages in parallel as the number of cores - 1 value, while specifying a positive value uses that specific number of logical processors. By default, this property is set to `-1`.

  This property accepts any value that is not greater than `LogicalProcessorCount - 1`.
* **DetectCheckboxes** - Detects the available check-boxes from the document while digitizing it. The default value is **True**.
* **DocumentPath** - The file path of the document you want to digitize. This field supports only strings and `String` variables.
  :::note
  * Set the
  **ApplyOcrOnPdf** property to **Yes** for native PDF documents which contain logos, hidden images, or other elements that corrupt the digitization output and might lead to suboptimal extractions and/or classifications.
  * Text extraction from
  PDF files has been upgraded. This results in an optimized extraction process, where both native and scanned text is retrieved at the same time. The process applies OCR only on the images identified in the PDF file. This improvement is available only when the **ApplyOCROnPDF** option is set to **Auto**.
  :::
    :::note
    The supported file types for this property field are `.png`, `.jpe`, `.jpg`, `.jpeg`, `.tiff`, `.tif`, and `.pdf`.
    :::

**Misc**
* **Private** - If selected, the values of variables and arguments are no longer logged at Verbose level.

**Output**
* **DocumentObjectModel** - The Document Object Model (DOM) of the file, stored in a `Document` variable. This field supports only `Document` variables.
* **DocumentText** - The text extracted from the specified document. This variable can be subsequently used in the **Present Validation Station** activity. This field supports only `String` variables.
  :::note
  Starting with UiPath.IntelligentOCR.Activities package v6.3.0-preview, the **Digitize Document** activity comes with a default preselected OCR engine, the **UiPath® Document OCR** engine.
  :::

Both output variables, paired as they are dependent, can be used further in document processing throughout the entire document processing framework (classification, data extraction, human validation, etc.).

## Important

If the UiPath.IntelligentOCR.Activities package has been updated to v5.1.0, then the **ForceApplyOCR** parameter has been replaced with the **ApplyOcrOnPDF**. Here is the compatibility between the old and new parameters:

* **ForceApplyOCR** = **True** is replaced by **ApplyOcrOnPDF** = **Yes**;
* **ForceApplyOCR** = **False** is replaced by **ApplyOcrOnPDF** = **Auto**;
* **ForceApplyOCR** = **Empty** is replaced by **ApplyOcrOnPDF** = **Auto**;
* **ForceApplyOCR** = Your defined variable is replaced by **ApplyOcrOnPDF** = **Auto**.

:::note
The **Digitize Document** activity extracts the text from a PDF file and, for complex documents, it applies pre-processing and post-processing algorithms. This activity can be used together with other Document Understanding activities.
:::

## Document Object Model

The **Document Object Model** is captured in a proprietary object. Visit [Document Class](https://docs.uipath.com/activities/other/latest/document-understanding/document-class "Document is a public class that represents a digitized document.") for more information.

:::tip
To successfully digitize and process your documents, consider the following advice:
* For an image to be successfully digitized/processed, its width and height
dimensions should be between 50 and 10000 pixels. Any image below or above this range is rejected, with an exception message. An image validated with the previously mentioned dimensions and with a total size bigger than 14 MP, is scaled down to 14 MP, while maintaining the aspect ratio (width or height ratio).
* The best results are obtained by keeping the skew angle between +/- 20
degrees.
:::

## Example of using the Digitize Document activity

Visit [Manual validation for digitize documents](https://docs.uipath.com/activities/other/latest/document-understanding/manual-validation-for-digitize-documents#manual-validation-for-digitize-documents) to check how the **Digitize Document** activity is used in an example that incorporates multiple activities.
