- 概述
- 文档处理合同
- 发行说明
- 关于文档处理合同
- Box 类
- IPersistedActivity 接口
- PrettyBoxConverter 类
- IClassifierActivity 接口
- IClassifierCapabilitiesProvider 接口
- 分类器文档类型类
- 分类器结果类
- 分类器代码活动类
- 分类器原生活动类
- 分类器异步代码活动类
- 分类器文档类型功能类
- ContentValidationData Class
- EvaluatedBusinessRulesForFieldValue Class
- EvaluatedBusinessRuleDetails Class
- 提取程序异步代码活动类
- 提取程序代码活动类
- 提取程序文档类型类
- 提取程序文档类型功能类
- 提取程序字段功能类
- 提取程序原生活动类
- 提取程序结果类
- FieldValue Class
- FieldValueResult Class
- ICapabilitiesProvider 接口
- IExtractorActivity 接口
- 提取程序有效负载类
- 文档操作优先级枚举
- 文档操作数据类
- 文档操作状态枚举
- 文档操作类型枚举
- 文档分类操作数据类
- 文档验证操作数据类
- 用户数据类
- 文档类
- 文档拆分结果类
- DomExtensions 类
- 页类
- 页面分区类
- 多边形类
- 多边形转换器类
- 元数据类
- 词组类
- 词类
- 处理源枚举
- 结果表格单元类
- 结果表值类
- 结果表列信息类
- 结果表类
- 旋转枚举
- Rule Class
- RuleResult Class
- RuleSet Class
- RuleSetResult Class
- 分区类型枚举
- 词组类型枚举
- IDocumentTextProjection 接口
- 分类结果类
- 提取结果类
- 结果文档类
- 结果文档范围类
- 结果数据点类
- 结果值类
- 结果内容引用类
- 结果值令牌类
- 结果派生字段类
- 结果数据源枚举
- 结果常量类
- 简单字段值类
- 表字段值类
- 文档组类
- 文档分类类
- 文档类型类
- 字段类
- 字段类型枚举
- FieldValueDetails Class
- 语言信息类
- 元数据输入类
- 文本类型枚举
- 类型字段类
- ITrackingActivity 接口
- ITrainableActivity 接口
- ITrainableClassifierActivity 接口
- ITrainableExtractorActivity 接口
- 可训练的分类器异步代码活动类
- 可训练的分类器代码活动类
- 可训练的分类器原生活动类
- 可训练的提取程序异步代码活动类
- 可训练的提取程序代码活动类
- 可训练的提取程序原生活动类
- 基本数据点类 - 预览
- 提取结果处理程序类 - 预览
- Document Understanding ML
- Document Understanding OCR 本地服务器
- Document Understanding
- 智能 OCR
- 发行说明
- 关于“智能 OCR”活动包
- 项目兼容性
- 加载分类
- 将文档数字化
- 分类文档作用域
- 基于关键词的分类器
- Document Understanding 项目分类器
- 智能关键词分类器
- 创建文档分类操作
- 创建文档验证工件
- 检索文档验证工件
- 等待文档分类操作然后继续
- 训练分类器范围
- 基于关键词的分类训练器
- 智能关键词分类训练器
- 数据提取作用域
- Document Understanding 项目提取程序
- Document Understanding 项目提取程序训练器
- 基于正则表达式的提取程序
- 表单提取程序
- 智能表单提取程序
- 文档脱敏
- 创建文档验证操作
- 等待文档验证操作然后继续
- 训练提取程序范围
- 导出提取结果
- 机器学习提取程序
- 机器学习提取程序训练器
- 机器学习分类器
- 机器学习分类训练器
- 生成分类器
- 生成式提取程序
- 配置身份验证
- ML 服务
- OCR
- OCR 合同
- OmniPage
- PDF
- [未公开] Abbyy
- [未列出] Abbyy 嵌入式

Document Understanding 活动
将文档数字化
UiPath.IntelligentOCR.Activities.Digitization.DigitizeDocument
描述
将文档数字化,提取其“文档对象模型”(DOM) 和文本,并将提取内容存储在相应类型的变量中。
You must assign an OCR engine to this activity by dragging it into the body of the activity. The chosen OCR engine is to be used only if the incoming documents require OCR processing. Visit OCR Engines to check the available OCR engines. The input and output parameters of the selected OCR engine are automatically set by the Digitize Document activity.
项目兼容性
Windows - Legacy | Windows
配置
属性面板
常见
- “显示名称”- 活动的显示名称。
输入
-
ApplyOcrOnPdf -Establishes if the OCR process should be applied or not to PDF documents. If set to Yes, the OCR is applied to all PDF pages of the document. If set to No, only digitally typed text is extracted. The default value is Auto, determining if the document requires to apply the OCR algorithm depending on the input document.
-
DegreeOfParalelism - Specifies how many, if any, pages to be analyzed in parallel. The
-1value uses the "Number of Cores on the machine - 1". This means that the activity tries to process as many pages in parallel as the number of cores - 1 value, while specifying a positive value uses that specific number of logical processors. By default, this property is set to-1.此属性接受不大于
LogicalProcessorCount - 1的任何值。 -
DetectCheckboxes - Detects the available check-boxes from the document while digitizing it. The default value is True.
-
DocumentPath - The file path of the document you want to digitize. This field supports only strings and
Stringvariables.备注:- Set the ApplyOcrOnPdf property to Yes for native PDF documents which contain logos, hidden images, or other elements that corrupt the digitization output and might lead to suboptimal extractions and/or classifications.
- Text extraction from PDF files has been upgraded. This results in an optimized extraction process, where both native and scanned text is retrieved at the same time. The process applies OCR only on the images identified in the PDF file. This improvement is available only when the ApplyOCROnPDF option is set to Auto.
备注:The supported file types for this property field are
.png,.jpe,.jpg,.jpeg,.tiff,.tif, and.pdf.
其他
- “私有”- 选中后将不再以“Verbose”级别记录变量和参数的值。
输出
- DocumentObjectModel - The Document Object Model (DOM) of the file, stored in a
Documentvariable. This field supports onlyDocumentvariables. - DocumentText - The text extracted from the specified document. This variable can be subsequently used in the Present Validation Station activity. This field supports only
Stringvariables.备注:Starting with UiPath.IntelligentOCR.Activities package v6.3.0-preview, the Digitize Document activity comes with a default preselected OCR engine, the UiPath® Document OCR engine.
两个输出变量(由于从属而配对)都可以在整个文档处理框架(分类、数据提取、人工验证等)的文档处理中进一步使用。
重要
If the UiPath.IntelligentOCR.Activities package has been updated to v5.1.0, then the ForceApplyOCR parameter has been replaced with the ApplyOcrOnPDF. Here is the compatibility between the old and new parameters:
- ForceApplyOCR = True is replaced by ApplyOcrOnPDF = Yes;
- ForceApplyOCR = False is replaced by ApplyOcrOnPDF = Auto;
- ForceApplyOCR = Empty is replaced by ApplyOcrOnPDF = Auto;
- ForceApplyOCR = Your defined variable is replaced by ApplyOcrOnPDF = Auto.
The Digitize Document activity extracts the text from a PDF file and, for complex documents, it applies pre-processing and post-processing algorithms. This activity can be used together with other Document Understanding activities.
文档对象模型
The Document Object Model is captured in a proprietary object. Visit Document Class for more information.
To successfully digitize and process your documents, consider the following advice:
- 要成功对图像进行数字化/处理,图像的宽度和高度尺寸应在 50 到 10000 像素之间。系统会拒绝不在此范围内的任何图像,并显示异常消息。如果图像经验证具有前述尺寸且总大小大于 1400 万像素,则系统会将该图像缩小到 1400 万像素,同时保持原先的纵横比(宽度与高度之比)。
- 通过将倾斜角保持在 +/- 20 度之间,可以获得最佳结果。
使用“数字化文档”活动的示例
Visit Manual validation for digitize documents to check how the Digitize Document activity is used in an example that incorporates multiple activities.