# Read PDF files

> You can read and separately extract the content of `.pdf` files using activities that can read all characters included in the document.

You can read and separately extract the content of `.pdf` files using activities that can read all characters included in the document.

Depending on your needs, you can use a simple activity that can recognize the characters, or use one with an OCR engine. The benefits of using an OCR engine are that the document reading can be applied even on scanned, signed, or handwritten documents.

The example below presents two situations of reading a `.pdf` file:

1. The first one explains how to read the `.pdf` file while using the [Read PDF Text](https://docs.uipath.com/activities/other/latest/document-understanding/read-pdf-text) activity.
2. The second one explains how to read the `.pdf` file while using the [Read PDF With OCR](https://docs.uipath.com/activities/other/latest/document-understanding/read-pdf-with-ocr) activity. The main difference between the two scenarios is that the second one is also using OCR engines, meaning that the details of extracted information are more accurate than in the first case if the analyzed file is an image, scanned, or includes signed or handwritten fields. You can find both activities in the **UiPath.PDF.Activities** package.

Only one workflow is required for both scenarios, common until the point of asking the user to choose the desired reading method.

#### Steps

1. Open Studio and create a new **Process**.
2. Add a **Flowchart** container in the **Workflow Designer**.
   1. Create a variable named `chooseOption`, with the **GenericValue** type, and no default value.
      :::note
      Add your `.pdf` files to the project directory in order to be able to run the entire process from the same place or download this example in order to use the given file.
      :::
3. Add an **Input Dialog** activity and connect it to the **Start Node**.
   1. In the **Properties** panel, add the expression `"Choose one option below:"` in the **Label** field.
   2. Add the expression `{"Read PDF Text", "Read PDF With OCR"}` in the **Options** field.
   3. Add the value `"Options"` in the **Title** field.
   4. Add the variable `chooseOption` in the **Result** field.
4. Add a **Flow Decision** activity after the **Input Dialog** activity and connect it to it.
   1. In the **Properties** panel, add the expression `chooseOption = "Read PDF Text"` in the **Condition** field.
5. Add a **Sequence** container and connect it to the **True** branch of the **Flow Decision** activity. The name of the **Sequence** should be **Read PDF Text**. This activity extracts information by using regular expressions.
   1. Create the variables shown in the following table:

      Table 1. Variables to be created

      |  | **Variable Type** | **Default Value** |
      | --- | --- | --- |
      | `extractedText` | **String** | N/A |
      | `arrayText` | **System.String[]** | N/A |
      | `address` | **GenericValue** | N/A |
      | `city` | **String** | N/A |
      | `phoneNumber` | **String** | N/A |
      | `invoiceNumber` | **String** | N/A |
      | `vendor` | **GenericValue** | N/A |
      | `bankName` | **String** | N/A |
      | `bankAccount` | **String** | N/A |
      | `ibanCode` | **String** | N/A |
6. Add a **Sequence** container and connect it to the **False** branch of the **Flow Decision** activity. The name of the **Sequence** should be **Read PDF With OCR**. This activity extracts information by using an OCR engine (Microsoft OCR and Tesseract OCR).
   1. Create the variables shown in the following table:

      Table 2. Variables to be created

      |  | **Variable Type** | **Default Value** |
      | --- | --- | --- |
      | `extractedTextTesseract` | **String** | N/A |
      | `extractedTextMicrosoft` | **String** | N/A |

      Figure 1. Overview of the beginning of the workflow

      ![Overview of the beginning of the workflow](https://dev-assets.cms.uipath.com/assets/images/activities/document-understanding-overview-of-the-beginning-of-the-workflow-181316-fe264fda-3e297389.webp)
7. Read a PDF File using the **Read PDF Text** activity:
   1. Open the **Read PDF Text** sequence container by double-selecting it.
   2. Add a **Read PDF Text** activity inside the sequence.
      1. In the **Properties** panel, add the expression `"NPO Invoice.pdf"` in the **FileName** field.
      2. Add the value `"All"` in the **Range** field.
      3. Add the variable `extractedText` in the **Text** field.
8. Add an **Assign** activity after the **Read PDF Text** activity.
   1. Add the variable `arrayText` in the **To** field.
   2. Add the expression `extractedText.Split(Environment.NewLine.ToArray, StringSplitOptions.RemoveEmptyEntries)` in the **Value** field.
9. Add an **If** activity below the **Assign** activity.
   1. Add the expression `arrayText(0).Equals("Tiefland Glass AG")` in the **Condition** field.
10. Add an **Assign** activity inside the **Sequence** container.
    1. Add the variable `address` in the **To** field.
    2. Add the expression `arrayText(2)` in the **Value** field.
11. Add another **Assign** activity and place it after the previous one.
    1. Add the variable `city` in the **To** field.
    2. Add the expression `arrayText(3).Split(","c)(0)` in the **Value** field.
12. Add another **Assign** activity and place it after the previous one.
    1. Add the variable `phoneNumber` in the **To** field.
    2. Add the expression `arrayText(4).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(0)` in the **Value** field.
13. Add another **Assign** activity and place it after the previous one.
    1. Add the variable `invoiceNumber` in the **To** field.
    2. Add the expression `arrayText(4).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(1).Split("#"c)(1)` in the **Value** field.
14. Add another **Assign** activity and place it after the previous one.
    1. Add the variable `vendor` in the **To** field.
    2. Add the expression `arrayText(arrayText.Count-5)` in the **Value** field.
15. Add an **Assign** activity inside the **Else** field.
    1. Add the variable `address` in the **To** field.
    2. Add the expression `arrayText(1)` in the **Value** field.
16. Add another **Assign** activity and place it after the previous one.
    1. Add the variable `city` in the **To** field.
    2. Add the expression `arrayText(2).Split(","c)(0)` in the **Value** field.
17. Add another **Assign** activity and place it after the previous one.
    1. Add the variable `phoneNumber` in the **To** field.
    2. Add the expression `arrayText(3).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(0)` in the **Value** field.
18. Add another **Assign** activity and place it after the previous one.
    1. Add the variable `invoiceNumber` in the **To** field.
    2. Add the expression `arrayText(3).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(1).Split("#"c)(1)` in the **Value** field.
19. Add another **Assign** activity and place it after the previous one.
    1. Add the variable `vendor` in the **To** field.
    2. Add the expression `arrayText(arrayText.Count-5)` in the **Value** field.

       Figure 2. Overview of the sequence containing the **Assign** activities

       ![Overview of the sequence containing the Assign activities](https://dev-assets.cms.uipath.com/assets/images/activities/document-understanding-overview-of-the-sequence-containing-the-assign-activities-187494-17b91efc-d7673566.webp)
20. Place a **For Each** activity after the **If** container.
    1. Add the variable `arrayText` in the **Value** field.
21. Add an **If** activity inside the **Body** container of the **For Each** activity.
    1. Add the expression `item.Contains("Bank Name:")` in the **Condition** field.
22. Add an **Assign** activity inside the **Then** field.
    1. Add the variable `bankName` in the **To** field.
    2. Add the expression `item.Split(":"c)(1)` in the **Value** field.
23. Add an **If** activity after the previous one.
    1. Add the expression `item.Contains("Bank Account:")` in the **Condition** field.
24. Add an **Assign** activity inside the **Then** field.
    1. Add the variable `bankName` in the **To** field.
    2. Add the expression `item.Split(":"c)(1)` in the **Value** field.
25. Add an **If** activity after the previous one.
    1. Add the expression `item.contains("IBAN Code:")` in the **Condition** field.
26. Add an **Assign** activity inside the **Then** field.
    1. Add the variable `ibanCode` in the **To** field.
    2. Add the expression `item.Split(":"c)(1)` in the **Value** field.

       Figure 3. Overview of the **For Each** activity

       ![Overview of the For Each activity](https://dev-assets.cms.uipath.com/assets/images/activities/document-understanding-overview-of-the-for-each-activity-185318-afaa721e-f6218e6e.webp)
27. Return to the **Read PDF Text** sequence and add a **Write Text File** activity below the **For Each** activity.
    1. Add the value`"InvoiceDetails.txt"` in the **FileName** field.
    2. Add the expression `"Invoice details"+Environment.NewLine+Environment.NewLine+"Vendor: "+vendor+Environment.NewLine+"Vendor address: "+address+Environment.NewLine+"City: "+city+Environment.NewLine+"Phone number:"+phoneNumber+Environment.NewLine+"Invoice number:"+invoiceNumber+Environment.NewLine+"Bank name:"+bankName+Environment.NewLine+"Bank account:"+bankAccount+Environment.NewLine+"IBAN Code:"+ibanCode` in the **Text** field.

       Figure 4. Overview of the **For Each** container

       ![Overview of the For Each container](https://dev-assets.cms.uipath.com/assets/images/activities/document-understanding-overview-of-the-for-each-container-181324-6d05b3dc-9a540b9e.webp)
28. Return to the **Main** workflow working area.
29. Read a PDF File using the **Read PDF with OCR** activity.
    1. Open the **Read PDF With OCR** sequence container.
    2. Drag a **Read PDF With OCR** activity inside the sequence.
       1. Add the value `"Invoice02.pdf"` in the **FileName** field.
       2. In the **Properties** panel, add the value `1` in the **DegreeOfParallelism** field.
    3. Drag the **Google OCR** engine inside the **Read PDF With OCR** activity.
       1. In the **Properties** panel, add the variable `extractedTextTesseract` in the **Text** field.
    4. Drag another **Read PDF With OCR** activity and place it after the previous one.
       1. Add the value `"Invoice02.pdf"` in the **FileName** field.
       2. In the **Properties** panel, add the value `1` in the **DegreeOfParallelism** field.
    5. Drag the **Microsoft OCR** engine inside the **Read PDF With OCR** activity.
       1. In the **Properties** panel add the variable `extractedTextMicrosoft` in the **Text** field.
    6. Drag a **Write Text File** activity below the **Read PDF With OCR** activity.
       1. Add the value `"OCRMicrosoft.txt"` in the **FileName** field.
       2. Add the variable `extractedTextMicrosoft` in the **Text** field.
    7. Drag a **Write Text File** activity below the previous **Write Text File** activity.
       1. Add the value `"OCRTesseract.txt"` in the **FileName** field.
       2. Add the variable `extractedTextTesseract` in the **Text** field.

          Figure 5. Overview of the **Read PDF with OCR** activity

          ![Overview of the Read PDF with OCR activity](https://dev-assets.cms.uipath.com/assets/images/activities/document-understanding-overview-of-the-read-pdf-with-ocr-activity-183445-40db9ce8-7a40c813.webp)
30. Run the process. The robot extracts the data using the specified process and saves the output in a `.txt` file.

Visit the following link to download the example as a `ZIP` format: [Example](https://documentationexamplerepo.blob.core.windows.net/examples/Activities/ReadPDFFileSample.zip).
