activities
latest
false
Document Understanding Activities
Last updated Oct 8, 2024

Read PDF files

You can read and separately extract the content of .pdf files using activities that can read all characters included in the document.

Depending on your needs, you can use a simple activity that can recognize the characters, or use one with an OCR engine. The benefits of using an OCR engine are that the document reading can be applied even on scanned, signed, or handwritten documents.

The example below presents two situations of reading a .pdf file:
  1. The first one explains how to read the .pdf file while using the Read PDF Text activity.
  2. The second one explains how to read the .pdf file while using the Read PDF With OCR activity.

    The main difference between the two scenarios is that the second one is also using OCR engines, meaning that the details of extracted information are more accurate than in the first case if the analyzed file is an image, scanned, or includes signed or handwritten fields. You can find both activities in the UiPath.PDF.Activities package.

Only one workflow is required for both scenarios, common until the point of asking the user to choose the desired reading method.

Steps
  1. Open Studio and create a new Process.
  2. Add a Flowchart container in the Workflow Designer.
    1. Create a variable named chooseOption, with the GenericValue type, and no default value.
      Note: Add your .pdf files to the project directory in order to be able to run the entire process from the same place or download this example in order to use the given file.
  3. Add an Input Dialog activity and connect it to the Start Node.
    1. In the Properties panel, add the expression "Choose one option below:" in the Label field.
    2. Add the expression {"Read PDF Text", "Read PDF With OCR"} in the Options field.
    3. Add the value "Options" in the Title field.
    4. Add the variable chooseOption in the Result field.
  4. Add a Flow Decision activity after the Input Dialog activity and connect it to it.
    1. In the Properties panel, add the expression chooseOption = "Read PDF Text" in the Condition field.
  5. Add a Sequence container and connect it to the True branch of the Flow Decision activity. The name of the Sequence should be Read PDF Text. This activity extracts information by using regular expressions.
    1. Create the variables shown in the following table:
      Table 1. Variables to be created

      Variable Name

      Variable Type

      Default Value

      extractedText

      String

      N/A

      arrayText

      System.String[]

      N/A

      address

      GenericValue

      N/A

      city

      String

      N/A

      phoneNumber

      String

      N/A

      invoiceNumber

      String

      N/A

      vendor

      GenericValue

      N/A

      bankName

      String

      N/A

      bankAccount

      String

      N/A

      ibanCode

      String

      N/A
  6. Add a Sequence container and connect it to the False branch of the Flow Decision activity. The name of the Sequence should be Read PDF With OCR. This activity extracts information by using an OCR engine (Microsoft OCR and Tesseract OCR).
    1. Create the variables shown in the following table:
      Table 2. Variables to be created

      Variable Name

      Variable Type

      Default Value

      extractedTextTesseract

      String

      N/A

      extractedTextMicrosoft

      String

      N/A
    Figure 1. Overview of the beginning of the workflow

  7. Read a PDF File using the Read PDF Text activity:
    1. Open the Read PDF Text sequence container by double-selecting it.
    2. Add a Read PDF Text activity inside the sequence.
      1. In the Properties panel, add the expression "NPO Invoice.pdf" in the FileName field.
      2. Add the value "All" in the Range field.
      3. Add the variable extractedText in the Text field.
  8. Add an Assign activity after the Read PDF Text activity.
    1. Add the variable arrayText in the To field.
    2. Add the expression extractedText.Split(Environment.NewLine.ToArray, StringSplitOptions.RemoveEmptyEntries) in the Value field.
  9. Add an If activity below the Assign activity.
    1. Add the expression arrayText(0).Equals("Tiefland Glass AG") in the Condition field.
  10. Add an Assign activity inside the Sequence container.
    1. Add the variable address in the To field.
    2. Add the expression arrayText(2) in the Value field.
  11. Add another Assign activity and place it after the previous one.
    1. Add the variable city in the To field.
    2. Add the expression arrayText(3).Split(","c)(0) in the Value field.
  12. Add another Assign activity and place it after the previous one.
    1. Add the variable phoneNumber in the To field.
    2. Add the expression arrayText(4).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(0) in the Value field.
  13. Add another Assign activity and place it after the previous one.
    1. Add the variable invoiceNumber in the To field.
    2. Add the expression arrayText(4).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(1).Split("#"c)(1) in the Value field.
  14. Add another Assign activity and place it after the previous one.
    1. Add the variable vendor in the To field.
    2. Add the expression arrayText(arrayText.Count-5) in the Value field.
  15. Add an Assign activity inside the Else field.
    1. Add the variable address in the To field.
    2. Add the expression arrayText(1) in the Value field.
  16. Add another Assign activity and place it after the previous one.
    1. Add the variable city in the To field.
    2. Add the expression arrayText(2).Split(","c)(0) in the Value field.
  17. Add another Assign activity and place it after the previous one.
    1. Add the variable phoneNumber in the To field.
    2. Add the expression arrayText(3).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(0) in the Value field.
  18. Add another Assign activity and place it after the previous one.
    1. Add the variable invoiceNumber in the To field.
    2. Add the expression arrayText(3).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(1).Split("#"c)(1) in the Value field.
  19. Add another Assign activity and place it after the previous one.
    1. Add the variable vendor in the To field.
    2. Add the expression arrayText(arrayText.Count-5) in the Value field.
      Figure 2. Overview of the sequence containing the Assign activities

  20. Place a For Each activity after the If container.
    1. Add the variable arrayText in the Value field.
  21. Add an If activity inside the Body container of the For Each activity.
    1. Add the expression item.Contains("Bank Name:") in the Condition field.
  22. Add an Assign activity inside the Then field.
    1. Add the variable bankName in the To field.
    2. Add the expression item.Split(":"c)(1) in the Value field.
  23. Add an If activity after the previous one.
    1. Add the expression item.Contains("Bank Account:") in the Condition field.
  24. Add an Assign activity inside the Then field.
    1. Add the variable bankName in the To field.
    2. Add the expression item.Split(":"c)(1) in the Value field.
  25. Add an If activity after the previous one.
    1. Add the expression item.contains("IBAN Code:") in the Condition field.
  26. Add an Assign activity inside the Then field.
    1. Add the variable ibanCode in the To field.
    2. Add the expression item.Split(":"c)(1) in the Value field.
      Figure 3. Overview of the For Each activity

  27. Return to the Read PDF Text sequence and add a Write Text File activity below the For Each activity.
    1. Add the value"InvoiceDetails.txt" in the FileName field.
    2. Add the expression "Invoice details"+Environment.NewLine+Environment.NewLine+"Vendor: "+vendor+Environment.NewLine+"Vendor address: "+address+Environment.NewLine+"City: "+city+Environment.NewLine+"Phone number:"+phoneNumber+Environment.NewLine+"Invoice number:"+invoiceNumber+Environment.NewLine+"Bank name:"+bankName+Environment.NewLine+"Bank account:"+bankAccount+Environment.NewLine+"IBAN Code:"+ibanCode in the Text field.
      Figure 4. Overview of the For Each container

  28. Return to the Main workflow working area.
  29. Read a PDF File using the Read PDF with OCR activity.
    1. Open the Read PDF With OCR sequence container.
    2. Drag a Read PDF With OCR activity inside the sequence.
      1. Add the value "Invoice02.pdf" in the FileName field.
      2. In the Properties panel, add the value 1 in the DegreeOfParallelism field.
    3. Drag the Google OCR engine inside the Read PDF With OCR activity.
      1. In the Properties panel, add the variable extractedTextTesseract in the Text field.
    4. Drag another Read PDF With OCR activity and place it after the previous one.
      1. Add the value "Invoice02.pdf" in the FileName field.
      2. In the Properties panel, add the value 1 in the DegreeOfParallelism field.
    5. Drag the Microsoft OCR engine inside the Read PDF With OCR activity.
      1. In the Properties panel add the variable extractedTextMicrosoft in the Text field.
    6. Drag a Write Text File activity below the Read PDF With OCR activity.
      1. Add the value "OCRMicrosoft.txt" in the FileName field.
      2. Add the variable extractedTextMicrosoft in the Text field.
    7. Drag a Write Text File activity below the previous Write Text File activity.
      1. Add the value "OCRTesseract.txt" in the FileName field.
      2. Add the variable extractedTextTesseract in the Text field.
        Figure 5. Overview of the Read PDF with OCR activity

  30. Run the process. The robot extracts the data using the specified process and saves the output in a .txt file.
Visit the following link to download the example as a ZIP format: Example.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.