- Overview
- Document Processing Contracts
- Release notes
- About the Document Processing Contracts
- Box Class
- IPersistedActivity interface
- PrettyBoxConverter Class
- IClassifierActivity Interface
- IClassifierCapabilitiesProvider Interface
- ClassifierDocumentType Class
- ClassifierResult Class
- ClassifierCodeActivity Class
- ClassifierNativeActivity Class
- ClassifierAsyncCodeActivity Class
- ClassifierDocumentTypeCapability Class
- ExtractorAsyncCodeActivity Class
- ExtractorCodeActivity Class
- ExtractorDocumentType Class
- ExtractorDocumentTypeCapabilities Class
- ExtractorFieldCapability Class
- ExtractorNativeActivity Class
- ExtractorResult Class
- ICapabilitiesProvider Interface
- IExtractorActivity Interface
- ExtractorPayload Class
- DocumentActionPriority Enum
- DocumentActionData Class
- DocumentActionStatus Enum
- DocumentActionType Enum
- DocumentClassificationActionData Class
- DocumentValidationActionData Class
- UserData Class
- Document Class
- DocumentSplittingResult Class
- DomExtensions Class
- Page Class
- PageSection Class
- Polygon Class
- PolygonConverter Class
- Metadata Class
- WordGroup Class
- Word Class
- ProcessingSource Enum
- ResultsTableCell Class
- ResultsTableValue Class
- ResultsTableColumnInfo Class
- ResultsTable Class
- Rotation Enum
- SectionType Enum
- WordGroupType Enum
- IDocumentTextProjection Interface
- ClassificationResult Class
- ExtractionResult Class
- ResultsDocument Class
- ResultsDocumentBounds Class
- ResultsDataPoint Class
- ResultsValue Class
- ResultsContentReference Class
- ResultsValueTokens Class
- ResultsDerivedField Class
- ResultsDataSource Enum
- ResultConstants Class
- SimpleFieldValue Class
- TableFieldValue Class
- DocumentGroup Class
- DocumentTaxonomy Class
- DocumentType Class
- Field Class
- FieldType Enum
- LanguageInfo Class
- MetadataEntry Class
- TextType Enum
- TypeField Class
- ITrackingActivity Interface
- ITrainableActivity Interface
- ITrainableClassifierActivity Interface
- ITrainableExtractorActivity Interface
- TrainableClassifierAsyncCodeActivity Class
- TrainableClassifierCodeActivity Class
- TrainableClassifierNativeActivity Class
- TrainableExtractorAsyncCodeActivity Class
- TrainableExtractorCodeActivity Class
- TrainableExtractorNativeActivity Class
- Document Understanding Digitizer
- Document Understanding ML
- Document Understanding OCR Local Server
- Document Understanding
- Release notes
- About the Document Understanding activity package
- Project compatibility
- Set PDF Password
- Merge PDFs
- Get PDF Page Count
- Extract PDF Text
- Extract PDF Images
- Extract PDF Page Range
- Extract Document Data
- Create Validation Task and Wait
- Wait for Validation Task and Resume
- Create Validation Task
- Classify Document
- Create Classification Validation Task
- Create Classification Validation Task and Wait
- Wait for Classification Validation Task and Resume
- Intelligent OCR
- Release notes
- About the IntelligentOCR activity package
- Project compatibility
- Configuring Authentication
- Load Taxonomy
- Digitize Document
- Classify Document Scope
- Keyword Based Classifier
- Document Understanding Project Classifier
- Intelligent Keyword Classifier
- Create Document Classification Action
- Wait For Document Classification Action And Resume
- Train Classifiers Scope
- Keyword Based Classifier Trainer
- Intelligent Keyword Classifier Trainer
- Data Extraction Scope
- Document Understanding Project Extractor
- RegEx Based Extractor
- Form Extractor
- Intelligent Form Extractor
- Present Validation Station
- Create Document Validation Action
- Wait For Document Validation Action And Resume
- Train Extractors Scope
- Export Extraction Results
- ML Services
- OCR
- OCR Contracts
- Release notes
- About the OCR Contracts
- Project compatibility
- IOCRActivity Interface
- OCRAsyncCodeActivity Class
- OCRCodeActivity Class
- OCRNativeActivity Class
- Character Class
- OCRResult Class
- Word Class
- FontStyles Enum
- OCRRotation Enum
- OCRCapabilities Class
- OCRScrapeBase Class
- OCRScrapeFactory Class
- ScrapeControlBase Class
- ScrapeEngineUsages Enum
- ScrapeEngineBase
- ScrapeEngineFactory Class
- ScrapeEngineProvider Class
- OmniPage
- PDF
- [Unlisted] Abbyy
- [Unlisted] Abbyy Embedded
Read PDF files
.pdf
files using activities that can read all characters included in the document.
Depending on your needs, you can use a simple activity that can recognize the characters, or use one with an OCR engine. The benefits of using an OCR engine are that the document reading can be applied even on scanned, signed, or handwritten documents.
.pdf
file:
- The first one explains how to read the
.pdf
file while using the Read PDF Text activity. - The second one explains how to read the
.pdf
file while using the Read PDF With OCR activity.The main difference between the two scenarios is that the second one is also using OCR engines, meaning that the details of extracted information are more accurate than in the first case if the analyzed file is an image, scanned, or includes signed or handwritten fields. You can find both activities in the UiPath.PDF.Activities package.
Only one workflow is required for both scenarios, common until the point of asking the user to choose the desired reading method.
Steps
- Open Studio and create a new Process.
- Add a Flowchart container in the
Workflow Designer.
- Create a variable named
chooseOption
, with the GenericValue type, and no default value.Note: Add your.pdf
files to the project directory in order to be able to run the entire process from the same place or download this example in order to use the given file.
- Create a variable named
- Add an Input Dialog activity and connect
it to the Start Node.
- In the Properties panel, add the expression
"Choose one option below:"
in the Label field. - Add the expression
{"Read PDF Text", "Read PDF With OCR"}
in the Options field. - Add the value
"Options"
in the Title field. - Add the variable
chooseOption
in the Result field.
- In the Properties panel, add the expression
- Add a Flow Decision activity after the
Input Dialog activity and connect it to it.
- In the Properties panel, add the expression
chooseOption = "Read PDF Text"
in the Condition field.
- In the Properties panel, add the expression
- Add a Sequence container and connect it to
the True branch of the Flow Decision activity. The name of the
Sequence should be Read PDF Text. This activity extracts
information by using regular expressions.
- Create the variables shown in
the following table:
Table 1. Variables to be created Variable Type
Default Value
extractedText
String
N/A arrayText
System.String[]
N/A address
GenericValue
N/A city
String
N/A phoneNumber
String
N/A invoiceNumber
String
N/A vendor
GenericValue
N/A bankName
String
N/A bankAccount
String
N/A ibanCode
String
N/A
- Create the variables shown in
the following table:
- Add a Sequence container and connect it to
the False branch of the Flow Decision activity. The name of the
Sequence should be Read PDF With OCR. This activity extracts
information by using an OCR engine (Microsoft OCR and Tesseract OCR).
- Create the variables shown in
the following table:
Table 2. Variables to be created Variable Type
Default Value
extractedTextTesseract
String
N/A extractedTextMicrosoft
String
N/A
Figure 1. Overview of the beginning of the workflow
- Create the variables shown in
the following table:
- Read a PDF File using the Read PDF Text activity:
- Open the Read PDF Text sequence container by double-selecting it.
- Add a Read PDF Text activity inside the sequence.
- In the
Properties panel, add the expression
"NPO Invoice.pdf"
in the FileName field. - Add the value
"All"
in the Range field. - Add the variable
extractedText
in the Text field.
- In the
Properties panel, add the expression
- Add an Assign activity after
the Read PDF Text activity.
- Add the variable
arrayText
in the To field. - Add the expression
extractedText.Split(Environment.NewLine.ToArray, StringSplitOptions.RemoveEmptyEntries)
in the Value field.
- Add the variable
- Add an If activity below the
Assign activity.
- Add the expression
arrayText(0).Equals("Tiefland Glass AG")
in the Condition field.
- Add the expression
- Add an Assign activity inside
the Sequence container.
- Add the variable
address
in the To field. - Add the expression
arrayText(2)
in the Value field.
- Add the variable
- Add another Assign activity
and place it after the previous one.
- Add the variable
city
in the To field. - Add the expression
arrayText(3).Split(","c)(0)
in the Value field.
- Add the variable
- Add another Assign activity
and place it after the previous one.
- Add the variable
phoneNumber
in the To field. - Add the expression
arrayText(4).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(0)
in the Value field.
- Add the variable
- Add another Assign activity
and place it after the previous one.
- Add the variable
invoiceNumber
in the To field. - Add the expression
arrayText(4).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(1).Split("#"c)(1)
in the Value field.
- Add the variable
- Add another Assign activity
and place it after the previous one.
- Add the variable
vendor
in the To field. - Add the expression
arrayText(arrayText.Count-5)
in the Value field.
- Add the variable
- Add an Assign activity inside
the Else field.
- Add the variable
address
in the To field. - Add the expression
arrayText(1)
in the Value field.
- Add the variable
- Add another Assign activity
and place it after the previous one.
- Add the variable
city
in the To field. - Add the expression
arrayText(2).Split(","c)(0)
in the Value field.
- Add the variable
- Add another Assign activity
and place it after the previous one.
- Add the variable
phoneNumber
in the To field. - Add the expression
arrayText(3).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(0)
in the Value field.
- Add the variable
- Add another Assign activity
and place it after the previous one.
- Add the variable
invoiceNumber
in the To field. - Add the expression
arrayText(3).Split(":"c)(1).Split({"INVOICE"},StringSplitOptions.None)(1).Split("#"c)(1)
in the Value field.
- Add the variable
- Add another Assign activity
and place it after the previous one.
- Add the variable
vendor
in the To field. - Add the expression
arrayText(arrayText.Count-5)
in the Value field.Figure 2. Overview of the sequence containing the Assign activities
- Add the variable
- Place a For Each activity
after the If container.
- Add the variable
arrayText
in the Value field.
- Add the variable
- Add an If activity inside the
Body container of the For Each activity.
- Add the expression
item.Contains("Bank Name:")
in the Condition field.
- Add the expression
- Add an Assign activity inside
the Then field.
- Add the variable
bankName
in the To field. - Add the expression
item.Split(":"c)(1)
in the Value field.
- Add the variable
- Add an If activity after the
previous one.
- Add the expression
item.Contains("Bank Account:")
in the Condition field.
- Add the expression
- Add an Assign activity inside
the Then field.
- Add the variable
bankName
in the To field. - Add the expression
item.Split(":"c)(1)
in the Value field.
- Add the variable
- Add an If activity after the
previous one.
- Add the expression
item.contains("IBAN Code:")
in the Condition field.
- Add the expression
- Add an Assign activity inside
the Then field.
- Add the variable
ibanCode
in the To field. - Add the expression
item.Split(":"c)(1)
in the Value field.Figure 3. Overview of the For Each activity
- Add the variable
- Return to the Read PDF Text
sequence and add a Write Text File activity below the For Each
activity.
- Add the value
"InvoiceDetails.txt"
in the FileName field. - Add the expression
"Invoice details"+Environment.NewLine+Environment.NewLine+"Vendor: "+vendor+Environment.NewLine+"Vendor address: "+address+Environment.NewLine+"City: "+city+Environment.NewLine+"Phone number:"+phoneNumber+Environment.NewLine+"Invoice number:"+invoiceNumber+Environment.NewLine+"Bank name:"+bankName+Environment.NewLine+"Bank account:"+bankAccount+Environment.NewLine+"IBAN Code:"+ibanCode
in the Text field.Figure 4. Overview of the For Each container
- Add the value
- Return to the Main workflow working area.
- Read a PDF File using the Read PDF with OCR activity.
- Open the Read PDF With OCR sequence container.
- Drag a Read PDF With OCR activity inside the sequence.
- Add the value
"Invoice02.pdf"
in the FileName field. - In the Properties panel, add the value
1
in the DegreeOfParallelism field.
- Add the value
- Drag the Google OCR engine inside the Read PDF With OCR
activity.
- In the
Properties panel, add the variable
extractedTextTesseract
in the Text field.
- In the
Properties panel, add the variable
- Drag another Read PDF With OCR activity and place it after the
previous one.
- Add the value
"Invoice02.pdf"
in the FileName field. - In the
Properties panel, add the value
1
in the DegreeOfParallelism field.
- Add the value
- Drag the Microsoft OCR engine inside the Read PDF With OCR
activity.
- In the
Properties panel add the variable
extractedTextMicrosoft
in the Text field.
- In the
Properties panel add the variable
- Drag a Write Text File activity below the Read PDF With OCR
activity.
- Add the value
"OCRMicrosoft.txt"
in the FileName field. - Add the variable
extractedTextMicrosoft
in the Text field.
- Add the value
- Drag a Write Text File activity below the previous Write Text
File activity.
- Add the value
"OCRTesseract.txt"
in the FileName field. - Add the variable
extractedTextTesseract
in the Text field.Figure 5. Overview of the Read PDF with OCR activity
- Add the value
- Run the process. The robot extracts
the data using the specified process and saves the output in a
.txt
file.
ZIP
format: Example.