- 概述
- Document Understanding 流程
- 快速入门教程
- 框架组件
- ML 包
- 概述
- Document Understanding - ML 包
- DocumentClassifier - ML 包
- 具有 OCR 功能的 ML 包
- 1040 - ML 包
- 1040 附表 C - ML 包
- 1040 附表 D - ML 包
- 1040 附表 E - ML 包
- 1040x - ML 包
- 3949a - ML 包
- 4506T - ML 包
- 709 - ML 包
- 941x - ML 包
- 9465 - ML 包
- 990 - ML 包 - 预览
- ACORD125 - ML 包
- ACORD126 - ML 包
- ACORD131 - ML 包
- ACORD140 - ML 包
- ACORD25 - ML 包
- 银行对账单 - ML 包
- 提单 - ML 包
- 公司注册证书 - ML 包
- 原产地证书 - ML 包
- 检查 - ML 包
- 儿童产品证书 - ML 包
- CMS1500 - ML 包
- 欧盟符合性声明 - ML 包
- 财务报表 (Financial statements) - ML 包
- FM1003 - ML 包
- I9 - ML 包
- ID Cards - ML 包
- Invoices - ML 包
- 中国发票 - ML 包
- 希伯来语发票 - ML 包
- 印度发票 - ML 包
- 日本发票 - ML 包
- 装运发票 - ML 包
- 装箱单 - ML 包
- 护照 - ML 包
- 工资单 - ML 包
- 采购订单 - ML 包
- 收据 - ML 包
- 汇款通知书 - ML 包
- UB04 - ML 包
- 水电费账单 - ML 包
- 车辆所有权证明 - ML 包
- W2 - ML 包
- W9 - ML 包
- 其他开箱即用的 ML 包
- 公共端点
- 硬件要求
- 管道
- Document Manager
- OCR 服务
- 支持的语言
- 深度学习
- 训练高效能模型
- 部署高效能模型
- Insights 仪表板
- 部署在 Automation Suite 中的 Document Understanding
- 在 AI Center 独立版中部署的 Document Understanding
- 许可
- Activities (活动)
- UiPath.Abbyy.Activities
- UiPath.AbbyyEmbedded.Activities
- UiPath.DocumentProcessing.Contracts
- UiPath.DocumentUnderstanding.ML.Activities
- UiPath.DocumentUnderstanding.OCR.LocalServer.Activities
- UiPath.Intelligent OCR.Activities
- UiPath.OCR.Activities
- UiPath.OCR.Contracts
- UiPath.OmniPage.Activities
- UiPath.PDF.Activities

Document Understanding 用户指南
训练高效能模型
The power of Machine Learning Models is that they are defined by training data rather than by explicit logic expressed in computer code. This means that extraordinary care is needed when preparing datasets because a model is only as good as the dataset that was used to train it. In that sense, what UiPath® Studio is to RPA workflows, a Document Type session (in Document UnderstandingDocument UnderstandingTM Cloud) is to Machine Learning capabilities. Both require some experience to be used effectively.
数据提取 ML 模型有什么作用?
ML 模型可以从单一类型的文档中提取数据,尽管它可能涵盖几种不同的语言。 每个字段(“总金额”、“日期”等)必须具有一致的含义,这一点至关重要。 如果人类对字段的正确值感到困惑,那么 ML 模型也不会。
可能会出现歧义情况。例如,水电费账单是否就是另一种类型的发票?还是这两种不同的文档类型需要两种不同的 ML 模型?如果您需要提取的字段相同(即含义相同),则可以将其视为单个文档类型。但是,如果出于不同原因(不同的业务流程)需要提取不同的字段,则表明您需要将这些文档视为两种不同的文档类型,因此需训练两种不同的模型。
When in doubt, start by training a single model, but keep the documents in different Document Manager batches (check the Filter drop-down at the top of the view) so you can easily separate them later if needed. In this way, the labeling work is not lost. When it comes to ML Models, the more data, the better. So, having a single model with ample data is a good place to start.
训练组件和评估组件
Document Manager 可用于构建两种类型的数据集:
- 训练数据集
- 评估数据集
这两种类型的数据集对于构建高性能 ML 模型都是必不可少的,并且其创建和维护都需要花费时间和精力。获取高性能 ML 模型需要代表生产文档流量的评估数据集。
每个数据集类型都以不同的方式标记:
- Training datasets rely on the bounding boxes of the words on the page representing the different pieces of information you need to extract.
- 为训练集添加标签时,请关注页面本身和文字框。
- Evaluation datasets rely on the values of the fields, that appear in the sidebar (for Regular fields) or the top bar (for Column fields).
- 为评估集添加标记时,请注意侧边栏或顶部栏中字段名称下的值。这并不意味着您需要手动输入这些值。我们建议通过选择页面上的框并检查值的正确性来添加标记。
数据提取组件
数据提取依赖于以下组件:
- 光学字符识别
- 文字和线条构建
- 将字符按词语分组中并将词语按从左到右的文本行分组
- 页面上每个词/框的机器学习模型预测
- 文本跨度的清理、解析和格式化
- 例如,将多行中的词语分到一个地址中,将日期格式化为标准 yyyy-mm-dd 格式
- 应用选择要返回的值的算法
- 对于文档有 2 页或更多页且某些字段在多页显示的情况
置信度级别
什么是置信度?
When ML models make predictions, they are basically statistical guesses. The model is saying "this is probably the Total amount" of this Invoice. This begs the question: how probably? Confidence levels are an attempt to answer that question, on a scale from 0 to 1. However, they are NOT true probability estimates. They are just how confident the model is about its guesses, and therefore they depend on what the model has been trained on. A better way to think of them is as a measure of familiarity: how familiar is the model with this model input? If it resembles something the model has seen in training, then it might have a higher confidence. Otherwise it might have a lower confidence.
置信度有什么用?
业务自动化需要检测和处理异常的方法,异常即自动化出错的实例。在传统自动化中,异常是非常明显的,因为当 RPA 工作流中断时,该工作流只会停止、挂起或引发错误。系统可以捕获该错误并进行相应的处理。但是,机器学习模型在做出错误的预测时不会引发错误。那么,我们如何确定 ML 模型何时出错并且需要触发异常处理流程?这通常涉及人工参与,可能会使用 Action Center。
The best way to catch bad predictions, by far, is through enforcing business rules. For example, we know that on an invoice, the net amount plus the tax amount must equal the total amount. Or that the part numbers for the components ordered must have 9 digits. When these conditions do not hold, we know something has gone wrong, and we can trigger the exception handling flow. The is the preferred and strongly recommended approach. It is worth investing significant effort in building these kinds of rules, even using complex Regular Expressions, or even lookups in databases to validate Vendor names, or part numbers, etc. In some cases you may even want to extract some other document that is not of interest but only to cross reference and validate some values to the original document of interest.
但是,在某些情况下,这些选项都不存在,但您仍想检测可能有错的预测。在这些情况下,您可以重新使用置信度。当预测的置信度较低(例如,低于 0.6)时,与置信度为 0.95 时相比,预测不正确的风险较高。但是,这种相关性相当弱。在许多情况下,提取的值虽然置信度较低,但其实是正确的。少数情况下,也有可能以高置信度(超过 0.9)提取一个不正确的值。出于这些原因,我们强烈建议用户尽可能依赖业务规则,并且仅在万不得已的情况下使用置信度。
有哪些类型的置信度?
Most components in the Document UnderstandingTM product return a confidence level. The main components of a Document UnderstandingTM workflow are Digitization, Classification and Extraction. Each of these has a certain confidence for each prediction. The Digitization and the Extraction confidence are both visually exposed in the Validation Station, so you can filter predictions and focus only on low confidence ones, to save time.
置信度分数调整(或校准)
不同模型的置信度将有所不同,具体取决于模型设计。例如,某些模型返回的置信度始终在 0.9 到 1 之间,只有在极少数情况下才会低于 0.8。其他模型的置信度级别在 0 到 1 之间的分布更均匀,即使它们通常聚集在量表分数较高的一端。因此,不同模型的置信度阈值会有所不同。例如,OCR 上的置信度阈值将与 ML 提取程序或 ML 分类器上的阈值不同。此外,每当模型上出现重大架构更新(例如发布基于 Helix 生成式 AI 的模型架构时),置信级别分布将发生变化,并且置信度阈值将需要重新评估。
构建高效的 ML 模型
为了获得最佳的自动化率结果(处理文档流所需的人工工作量每年每人每月减少的百分比),您需要仔细遵循以下步骤:
-
当然,这会影响 OCR 和字与行的构建(部分取决于 OCR),以及下游的所有内容。
1. 选择一个 OCR 引擎
要选择 OCR 引擎,您应该创建不同的 Document Manager 会话,配置不同的 OCR 引擎,并尝试将相同的文件导入每个文件中以检查差异。专注于您想要提取的区域。 例如,如果您需要提取发票中作为徽标的一部分显示的公司名称,则可能需要查看哪个 OCR 引擎在徽标中的文本上效果更好。
Your default option should be UiPath Document OCR since it is included with Document Understanding licenses at no charge. However, in cases where some unsupported languages are required, or some very hard-to-read documents are involved, you might want to try Google Cloud (Cloud only) or Microsoft Read (Cloud or On Premises), which have better language coverage. These engines come at a cost, indeed it is low, but if the accuracy is higher on some critical data fields for your business process, it is strongly recommended to use the best OCR available – saving your time later on since everything downstream depends on it.
Please be aware that the Digitize Document activity has the ApplyOcrOnPDF setting set to Auto by default, determining if the document requires to apply the OCR algorithm depending on the input document. Avoid missing the extraction of important information (from logos, headers, footers, etc.) by setting the ApplyOcrOnPDF to Yes, making sure that all text is detected, though it might slow down your process.
2. 定义字段
定义字段是一场需要与拥有业务流程本身的主题专家或领域专家进行的对话。对于发票,它将是应付账款流程的所有者。此对话非常重要,需要在标注任何文档之前进行以避免浪费时间,并且还需要一起查看至少 20 个随机选择的文档样本。为此,我们需要预留一个小时的时间段,通常需要在几天后重复该时间段,因为准备数据的人员会遇到模棱两可的情况或极端情况。
对话开始时通常会假设您需要提取 10 个字段,然后提取 15 个字段。
您需要了解的一些关键配置:
- Content type This is the most important setting as it determines the postprocessing of the values, especially for dates (detects if the format is US-style or non-US style, and then formats them as yyyy-mm-dd) and for numbers (detects the decimal separator – comma or period). ID numbers clean up anything coming before a colon or hash symbol. String content type performs no cleanup and can be used when you want to do your own parsing in the RPA workflow.
- Multi-line checkbox This is for parsing strings like addresses that may appear on more than 1 line of text.
- Multi-valued checkbox This is for handling multiple choice fields or other fields which may have more than one value, but are NOT represented as a table column. For example, an Ethnic group question on a government form may contain multiple checkboxes where you can select all that apply.
- Hidden fields Fields marked as Hidden can be labelled but they are held out when data is exported, so the model cannot be trained on them. This is handy when labeling a field is a work in progress, when it is too rare, or when it is low priority.
- Scoring This is relevant only for Evaluation pipelines, and it affects how the accuracy score is calculated. A field that uses Levenshtein scoring is more permissive: if a single character out of 10 is wrong, the score is 0.9. However, if scoring is Exact Match it is more strict: a single wrong character leads to a score of zero. Only String type fields have the option to select Levenshtein scoring by default.
水电费账单金额
总金额看似足够简单,但水电费账单中包含很多金额。有时,您需要支付总金额。有时,您只需要当前的账单金额,而不需要从以前的账单周期结转的金额。在后一种情况下,即使当前账单和总金额可以相同,您也需要以不同的方式标注标签。概念不同,金额通常也有所不同。
Each field represents a different concept, and they need to be defined as cleanly as possible, so there is no confusion. If a human might be confused, the ML model will also be confused.
Moreover, the current bill amount can sometimes be composed of a few different amounts, fees, and taxes and may not appear individualized anywhere on the bill. A possible solution to this is to create two fields: a previous-charges field, and a total field. These two always appear as distinct explicit values on the utility bill. Then the current bill amount can be obtained as the difference between the two. You might even want to include all 3 fields (previous-charges, total, and current-charges) in order to be able to do some consistency checks in cases where the current bill amount appears explicitly on the document. So you could go from one to three fields in some cases.
发票上的采购订单编号
PO 编号可以显示为发票的单个值,也可以显示为发票上的行项目表格的一部分,其中每个行项目都有不同的 PO 编号。在本例中,设置两个不同的字段可能较合适:po-no 和 item-po-no。通过在视觉上和概念上保持每个字段的一致性,该模型可能会做得更好。但是,您需要确保“训练”和“评估”数据集都充分体现了这一点。
发票上的供应商名称和付款地址名称
The company name usually appears at the top of an invoice or a utility bill, but sometimes it might not be readable because there is just a logo, and the company name is not explicitly written out. There could also be some stamp, or handwriting, or wrinkle over the text. In these cases, people might label the name that appears at the bottom right, in the Remit payment to section of the payslip on utility bills. That name is often the same, but not always, since it is a different concept. Payments can be made to some other parent or holding company, or other affiliate entity, and it is visually different on the document. This might lead to poor model performance. In this case, you should create two fields, vendor-name and payment-addr-name. Then you can look both up in a vendor database and use the one that matches, or use payment-addr-name when the vendor-name is missing.
表格行
我们需要牢记两个不同的概念:表格行和文本行。表格行包含该行中所有列字段的所有值。有时,它们可能全部属于页面上同一行文本的一部分。在其他情况下,它们可能位于不同的行中。
如果表格行包含多行文本,则需要使用“/”热键对该表格行中的所有值进行分组。 执行此操作时,将出现一个绿色框,覆盖整个表格行。 这是一个表格示例,其中前两行包含多行文本,需要使用“/”热键进行分组,而第三行是单行文本,不需要进行分组。

这是一个表格示例,其中每个表格行都由一行文本组成。您无需使用“/”热键对这些活动进行分组,因为这是由 Document Manager 隐式完成的。

在从上到下读取的过程中,识别一行结束以及另一行开始的位置通常是 ML 提取模型的主要挑战,尤其是在没有可视水平线分隔行的表单等文档上。 在我们的 ML 包中,有一个特殊的模型,该模型经过训练,可以正确地将表格拆分为行。 此模型使用您使用“/”或“Enter”键标记的组进行训练,并以绿色透明框表示。
2. 选择一个均衡且具有代表性的数据集进行训练
Machine Learning technology has the main benefit of being able to handle complex problems with high diversity. When estimating the size of a training dataset, one looks first at the number of fields and their types, and the number of languages. A single model can handle multiple languages as long as they are not Chinese/Japanese/Korean. Chinese/Japanese/Korean scenarios generally require separate Training datasets and separate models.
字段分为 3 种类型:
- Regular fields (date, total amount)
- 对于常规字段,每个字段至少需要 20-50 个文档样本。因此,如果您需要提取 10 个常规字段,则至少需要 200-500 个文档样本。如果您需要提取 20 个常规字段,则至少需要 400-1000 个文档样本。您需要的文档样本数量会随着字段数量的增加而增加。更多的字段意味着您需要更多的文档样本,大约增加 20 到 50 倍。
- Column fields (item unit price, item quantity)
- 对于列字段,每个列字段至少需要 50-200 个文档样本,因此,对于 5 个列字段,使用干净简单的布局,您可能会通过 300 个文档样本获得良好的结果。但对于高度复杂和多样化的布局,可能需要超过 1000 个文档样本。要涵盖多种语言,则假设每种语言涵盖所有不同的字段,您至少需要 200-300 个文档样本。因此,对于包含 2 种语言的 10 个标头字段和 4 个列字段,600 个文档样本可能就足够了(列和标头为 400 个文档样本,额外添加一种语言为 200 个),但在某些情况下可能需要 1200 个或更多文档样本。
- Classification fields (currency)
- 分类字段通常要求每个类至少提供 10 至 20 个文档样本。
这些准则假设您正在解决高度多样性的场景,例如具有数十到数百或数千种布局的发票或采购订单。但是,如果您要解决布局(少于 5 到 10 个)较少的税表或发票等低多样性场景,则数据集的大小更多地取决于布局的数量。在这种情况下,每个布局应从 20 到 30 页开始,并在需要时添加更多页面 - 尤其是当页面非常密集且要提取大量字段时。例如,创建一个用于从 2 个布局中提取 10 个字段的模型可能需要 60 页,但如果您需要从 2 个布局中提取 50 或 100 个字段,则可以从 100 或 200 页开始,并根据需要添加更多页面,以提高准确性。在这种情况下,常规字段/列字段的区别就不那么重要了。
ML technology is designed to handle high diversity scenarios. Using it to train models on low diversity scenarios (1-10 layouts) requires special care to avoid brittle models that are sensitive to slight changes in the OCR text. Avoid this by having some deliberate variability in the training documents, by printing and then scanning or photographing them using mobile phone scanner apps. The slight distortions or changing resolutions make the model more robust.
这些估计值假定大多数页面都包含所有或大部分字段。对于包含多个页面的文档,但大多数字段位于单个页面上,则相关页数是大多数字段出现的页面的示例数量。
The numbers described are general guidelines, not strict requirements. In general, you can start with a smaller dataset, and then keep adding data until you get good accuracy. This is especially useful to parallelize the RPA work with the model building. Also, a first version of the model can be used to prelabel additional data (check Settings view and Predict button in Document Manager) which can accelerate labeling additional Training data.
深度学习模型可以变得通用
您不需要在训练集中表示每个布局。实际上,生产文档流程中的大多数布局可能在训练集中有 0 个样本,或者 1 个或 2 个样本。这是可取的,因为您希望利用 AI 的强大功能来理解文档,并能够对训练期间未见过的文档作出正确的预测。每个布局的大量样本并不是强制性的,因为大多数布局可能根本不存在,或者仅出现 1 次或 2 次,而模型仍然能够根据从其他布局的学习情况进行正确预测。
训练开箱即用模型
为 Document Understanding 训练 ML 模型时,有三种主要类型的场景:
- 使用 AI Center 中的 DocumentUnderstanding ML 包从头开始训练新型文档
- 对预训练的开箱即用模型进行重新训练,可用于优化准确性
- 对预训练的开箱即用模型进行重新训练,以优化准确性以及添加一些新字段
本节标题为“创建训练集”的第一部分介绍了第一种场景的数据集大小估计值。
对于第二种场景,数据集的大小取决于预训练模型在您的文档上的运行效果。如果它们已经运行良好,则您可能只需要很少的数据,可能需要 50-100 页。如果它们在许多重要字段失败,则您可能需要更多数据,但相比于从头开始训练字段,一个良好的起点的数据集大小仍然会小 4 倍。
最后,对于第三种场景,从第二种场景的数据集大小开始,然后使用与从头开始训练相同的指导原则,根据新字段的数量增加数据集:每个新的常规字段至少 20-50 页,或每个列字段至少 50-200 页。
在所有这些情况下,所有文档都需要完全标记,包括开箱即用模型无法识别的新字段,以及开箱即用模型能够识别的原始字段。
分布不均匀的字段出现次数
有些字段可能出现在每个文档上(例如,日期、发票编号),而有些字段可能只出现在 10% 的页面上(例如,处理费用、折扣)。在这种情况下,您需要制定业务决策。如果这些稀有字段对自动化并不重要,则可以使用该特殊字段的少量文档样本(10-15 个),即包含该字段值的页面。但是,如果这些字段至关重要,则需要确保在训练集中至少包含 30-50 个该字段的文档样本,以确保涵盖全部多样性。
平衡数据集
对于发票而言,如果数据集包含来自 100 个供应商的发票,但数据集的一半包含来自 1 个供应商的发票,则这是一个非常不平衡的数据集。完全平衡的数据集是指每个供应商出现相同次数的情况。数据集不需要达到完美平衡,但您应避免有超过 20% 的整个数据集来自任何一个供应商。如果达到某个点,增加数据就无济于事,它甚至可能影响其他供应商的准确性,因为该模型将极大地优化(过拟合)一个供应商。
代表性数据集
Data should be chosen to cover the diversity of the documents likely to be seen in the production workflow. For example, if you get invoices in English but some of them come from the US, India and Australia, they probably look different, so you need to make sure you have document samples from all three. This is relevant not only for the model training itself, but also for labeling purposes. When you label the documents you might discover that you need to extract new, different fields from some of these regions, like GSTIN code from India, or ABN code from Australia. Check the Define fields section for more information.
4. 标记训练数据集
When labeling Training data, you need to focus on the bounding boxes of the words in the document pane of Document Manager. The parsed values in the right or top sidebars are not important as they are not used for training.
每当字段在页面上多次出现时,只要它们表示相同的概念,则应为所有字段加上标签。
当 OCR 漏掉一个词语或弄错几个字符时,只要标记边框(如果有),如果没有漏掉或弄错的情况,则跳过这步,继续操作。无法在 Document Manager 中添加词,因为即使您这样做,该词在运行时仍会丢失,因此添加它根本无法帮助模型。
As you label, remain vigilant about fields that may have multiple or overlapping meanings/concepts, in case you might need to split a field into two separate fields, or fields that you do not explicitly need, but which, if labelled, might help you to do certain validation or self-consistency check logic in the RPA workflow. Typical examples are quantity,unit-price, and line-amount on invoice line items. Line-amount is the product of quantity and unit-price, but this is very useful to check for consistency without the need for confidence levels.
5. 训练提取程序
To create an extractor, go to the Extractors view in Document Understanding and select the Create Extractor button at the top right. You can then select the Document Type, the ML Model and Version you would like to use. You can monitor the progress on the Extractors tab, or in the Details view of the Extractor, which contains a link to the AI Center pipeline, where you can check the detailed logs in real time.
When evaluating an ML model, the most powerful tool is the evaluation_
在此 Excel 文件中,您可以查看哪些预测失败以及对哪些文件失败,您还可以立即查看它是 OCR 错误还是 ML 提取或解析错误,以及是否可以通过 RPA 工作流中的简单逻辑进行修复,或者需要其他 OCR 引擎或更多训练数据,或者需要改进 Document Manager 中的标记或字段配置。
此 Excel 文件对于识别需要应用于 RPA 工作流的最相关业务规则也非常有用,以便在路由到 Actions Center 中的验证站点以进行手动审核时捕获常见错误。 到目前为止,业务规则是检测错误的最可靠方法。
对于业务规则无法捕获的错误,您还可以使用置信度级别。此 Excel 文件还包含每个预测的置信度级别,因此您可以使用 Excel 功能(例如排序和筛选)来确定适合您的业务场景的置信度阈值。
Overall, the evaluation_<package_name>.xlsx Excel file is a key resource you need to focus on to get the best results from your AI automation.
GPU training is highly recommended for large and production datasets. CPU training is much slower and should be used sparingly, for small datasets for demo or testing purposes. For more information, check the Training pipelines page.
6. 定义并实施业务规则
在此步骤中,您应该关注模型错误及其检测方法。有 2 种主要方法用于检测错误:
- 通过执行业务规则
- through applying lookups in Systems of Record in the customer organization
- 通过强制执行最低置信度级别阈值
检测错误最有效、最可靠的方法是定义业务规则和查找。 置信度级别永远不可能达到 100% 完美,始终会有一小部分(但非零)的正确预测(低置信度)或错误预测(高置信度)。 此外,也许最重要的是,缺失的字段没有置信度,因此置信度阈值永远无法捕获根本不会提取字段的错误。 因此,置信度级别阈值只能用作回退和安全网,而不能用作检测关键业务错误的主要方法。
业务规则示例:
- 净额加税额必须等于总金额
- 总金额必须大于或等于净额
- Invoice number, Date, Total amount (and other fields) must be present
- PO 编号(如果有)在 PO 数据库中必须存在
- 发票日期必须是过去的日期,且不能超过 X 个月
- 到期日期必须是未来日期,且不能超过 Y 天/月
- For each line item, the quantity multiplied by unit price must equal the line amount
- 行金额之和必须等于净额或总金额
- 等等。
In case of numbers, a rounding to eight decimals is performed.
具体来说,列字段的置信度级别绝不应用作错误检测机制,因为列字段(例如,发票或 PO 上的行项目)可能包含数十个值,因此针对如此多的值设置最小阈值是特别不可靠的,因为一个值很有可能具有较低的置信度,这将导致多次不必要地路由大部分或全部文档以进行人工验证。
业务规则必须作为 RPA 工作流的一部分强制执行,并且将业务规则失败传递给人工验证者,以便引起他们的注意并加快流程。
When defining Business Rules, please keep in mind that the Starts with, Ends with, and Contains values are case sensitive.
7.(可选)选择置信度阈值
定义业务规则后,有时可能会保留少量没有业务规则的字段,或者业务规则不太可能捕获所有错误。为此,您可能需要使用置信度阈值作为最后的手段。
The main tool to set this threshold is the Excel spreadsheet which is output by the Training pipeline in the Outputs > artifacts > eval_metrics folder.
This evaluation_
8. 使用来自验证站点的数据进行微调
Validation Station data can help improve the model predictions, yet, in many cases, it turns out that most errors are NOT due to the model itself but to the OCR, labelling errors or inconsistencies, or to postprocessing issues (e.g., date or number formatting). So, the first key aspect is that Validation Station data should be used only after the other Data extraction components have been verified and optimized to ensure good accuracy, and the only remaining area of improvement is the model prediction itself.
第二个关键方面是,验证站点数据的信息密度低于 Document Manager 中标记的数据。从根本上讲,验证站点用户只关心一次获取正确的值。如果发票有 5 页,并且发票编号出现在每页上,则验证站点用户只需在第一页上对其进行验证。因此,80% 的值保持未标记状态。在 Document Manager 中,所有值都已标记。
最后,请记住,需要将验证站点数据添加到手动标记的原始数据集,以便您始终拥有单个训练数据集,该数据集的大小会随时间而增加。 您始终需要在次要版本为 0(零)的 ML 包上进行训练,该版本是 UiPath 发布的开箱即用版本。
It is often wrongly assumed that the way to use Validation Station data is to iteratively train the previous model version, so the current batch is used to train package X.1 to obtain X.2. Then the next batch trains on X.2 to obtain X.3 and so on. This is the wrong way to use the product. Each Validation Station batch needs to be imported into the same Document Manager session as the original manually labeled data making a larger dataset, which must be used to train always on the X.0 ML Package version.
使用验证站点数据的注意事项
由于验证站点数据在生产工作流中使用,因此其数据量可能会大幅增加。您不希望数据集被验证站点数据淹没,因为上述信息密度问题可能会降低模型的质量。
建议将添加的数据量限制在 Document Manager 数据页面数量的 2 到 3 倍以内,除此之外,只挑选那些您发现了重大故障的供应商或样本。如果已知生产数据发生重大变化,例如使用新语言,或者将新的地理区域加入到业务流程中(从美国扩展到欧洲或南亚),则应将这些语言和区域的代表性数据添加到 Document Manager 中进行手动标记。验证站点数据不适用于此类重大范围扩展。
验证站点数据的另一个潜在问题是余额。 在 Production 中,大多数流量来自供应商/客户/全球区域的一小部分是很常见的。 如果按原样允许进入训练集,可能会导致模型出现高度偏差,该模型在一小部分数据子集上表现良好,但在其余大部分数据上表现不佳。 因此,在将验证站点数据添加到训练集时要特别小心,这一点很重要。
Here is a sample scenario. You have chosen a good OCR engine, labeled 500 pages in Document Manager, resulting in good performance, and you have deployed the model in a production RPA workflow. Validation Station is starting to generate data. You should randomly select up to a maximum of 1000-1500 pages from Validation Station and import them into the Document Manager together with the first 500 pages and train your ML model again. After that, you should look very carefully at the evaluation_
9. 部署自动化
Make sure to use the Document Understanding™ Process: Studio Template from the Templates section in the Studio start screen in order to apply best practices in Enterprise RPA architecture.