ixp
latest
false
  • 概述
    • 简介
    • 从非结构化文档中提取数据
    • 构建和部署模型
    • 配额
  • 模型构建
    • 概述
    • 管理项目
    • 上传示例文档
    • 查看预测
    • 验证提取预测
    • 模型配置
  • 模型验证
    • 概述
    • 评估模型性能
    • 收集验证统计信息
    • 对分类进行迭代
  • 模型部署
  • API
  • 常见问题
重要 :
新发布内容的本地化可能需要 1-2 周的时间才能完成。
UiPath logo, featuring letters U and I in white

非结构化复杂文档用户指南

上次更新日期 2026年3月9日

评估模型性能

您可以在以下位置评估模型的性能:

  • 构建”选项卡,显示项目的总体分数以及每个文档的错误率。
  • 度量”选项卡,显示字段组级和字段级性能。

在“构建”中评估模型性能

您可以在“构建”选项卡中的“项目分数”下查看总体评分。

备注:
  • 健康模型的项目分数为 “良好”“优秀” ,没有现场性能警告。
  • 项目分数根据所有字段的 F1 分数 计算。

此图像显示了项目分数的示例,并介绍了性能级别及其分数。

此外,您可以在“构建”中的“文档”部分的“错误率”列中查看每个文档的错误率。

备注:

错误率仅适用于带批注的文档,并表示模型在每个文档上所犯的错误数量,即模型预测与用户批注之间的差异。

此图像显示了“构建”选项卡,突出显示了文档的“错误率”列。

在“度量”中评估模型性能

The Measure page helps you evaluate how well a model performs on annotated documents before you publish them. The page includes:

  • A field performance table that surfaces key performance metrics per field and field group.
  • Support for comparing performance differences between model versions, highlighting improvements or regressions.
  • Visibility into the distribution of error types for each taxonomy field.
  • Data export capabilities for custom offline analysis.

The following sections describe the main components in Measure and explain how to use them effectively when you analyze model performance.

Project summary

The summary section provides a quick, high-level view of how your current model version performs across the project. You can use it to:

  • Select the model version you want to evaluate.
  • Get an at-a-glance read on overall performance using Project score and Avg. doc error rate.
  • Quickly spot whether overall project performance is trending up or down when comparing against a previous version.
Project score

The Project score summarizes overall model performance. Why it is useful

  • Provides a single, consistent way to track overall progress as you iterate on the taxonomy, instructions, and annotations.
  • Helps you quickly determine whether a model version is generally improving or regressing before drilling into specific fields. How it is calculated
  • Project score is computed as the simple average of F1 scores across all fields in the taxonomy.
  • F1 score is a standard model performance metric that balances precision and recall, that is, the harmonic mean of the two.
  • At a high level:
    • Precision answers: How often were the predicted values of the model correct?
    • Recall answers: How much of the annotated data did the model successfully find?
备注:

The Project score is an average. Specific field-level regressions or limitations can be reviewed with the Field performance table.

Avg. doc error rate

The Avg. doc error rate is the average of the error rates for each annotated document in the project.

Why it is useful The Avg. doc error rate provides a quick indicator of how error-prone documents are when the selected model version processes them, which helps evaluate readiness to publish.

How it is calculated The value is computed as the simple average of the error rate of each fully annotated document in the project.

Field performance table

The Field performance table is the primary way to analyze model performance in the Measure page. It displays one row per field or field group, along with performance and error metrics calculated across the annotated documents in the project. The table does not take into account unannotated and partially annotated documents when calculating metrics.

The table helps answer questions such as:

  • Which fields limit the overall model performance?
  • Are errors concentrated in a few fields or spread broadly?
  • Did a recent model change improve or degrade specific fields?

The Field performance table includes several categories of metrics that help you analyze model performance from different perspectives. Each category answers a specific diagnostic question about how your model behaves across fields and documents.

备注:

Validation status and partial results To reduce waiting time:

  • Field performance metrics become visible once validation reaches a minimum completion threshold.
  • Warnings indicate when validation is still in progress and that the displayed results may change.
Performance metrics

The purpose of the performance metrics is to evaluate the overall quality of extraction for each field or field group. The performance metrics are described as follows:

  • F1 score — The harmonic mean of precision and recall: F1 = 2 × (Precision × Recall) / (Precision + Recall). F1 score only remains high when both precision and recall are high. In practice, this makes F1 a strong overall quality indicator for extraction tasks where you care about avoiding incorrect values and avoiding missed values. Therefore, F1 is a useful first metric to review to analyze field performance changes across model versions.
  • Precision — Measures how often predicted values are correct: Precision = True positives / (True positives + False positives). True positives are predictions that match the annotated value, excluding values annotated as missing.
  • Recall — Measures how often the model finds a value when it exists: Recall = True positives / (True positives + False negatives). False negatives are annotated values that the model did not predict, excluding values annotated as missing.
  • Error rate — Total errors / Total annotations. Values marked as missing are included in the count of errors and annotations.
  • Error rate (excluding missing) — (Total errors – Extra predictions) / Annotated values. Annotated values marked as missing are excluded.
Predictions and errors

The purpose of the predictions and errors metrics is to understand the volume and composition of errors that contribute to poor performance. The metrics are described as follows:

  • Total errors — Total number of errors for a field across all error classes: Total errors = Incorrect predictions + Missed predictions + Extra predictions.
  • Total predictions — Total number of predicted values for a field: Total predictions = Correct values + Correct missing + Incorrect predictions.
  • Incorrect predictions — Number of predictions where the extracted value does not match the annotation. Excludes predictions and annotated values marked as missing.
  • Extra predictions — Number of predicted values that the model should not have extracted, or did not have a corresponding annotation or annotation marked as missing.
  • Missed predictions — Number of annotated values that the model failed to extract.
  • Correct values — Number of predicted values that exactly match the annotation.
  • Correct missing — Number of instances where the model correctly predicted that a value is missing.
批注

The purpose of the annotations metrics is to provide context for how much labeled data supports each metric and how reliable performance scores are. The metrics are described as follows:

  • Total annotations — Total number of annotations, including values marked as missing: Total annotations = Annotated values + Annotated values marked as missing.
  • Annotated values — Total number of annotated field values, excluding those marked as missing.
  • Annotated as missing — Total number of times a field was explicitly labeled as missing.
Document-level metrics

The purpose of document-level metrics is to understand how errors are distributed across documents rather than just across predictions. The metrics are described as follows:

  • Documents with errors — Total number of documents where the field has at least one error.
  • Documents annotated — Total number of documents in which the field has at least one annotated field value.
  • Percentage of documents with errors — Percentage of annotated documents that contain at least one error for the field: Documents with errors / Documents annotated.
示例场景

Scenario 1: Low F1 + Low Precision, but Recall is moderate or high What you observe F1 is low, Precision is low, and Recall is moderate or high.

What it usually means

  • The model is extracting values for a field, but there are more values predicted than you expect to be found.
  • Common root causes:
    • Field instruction is too broad or ambiguous. For example, the field instruction is capture the amount, but it does not specify which amount.
    • The document has similar values that can be confused for one another, for example, subtotal versus total, ship-to versus bill-to.

What to do next Compare the incorrect and extra predictions to identify whether the issue is tied to extracting the wrong value (non-zero incorrect predictions count) or the value should not have been extracted at all (non-zero extra predictions count). Tighten field instructions with disambiguators, such as labels, keywords, and formatting constraints.

Scenario 2: High Missed Predictions (Recall is low), Precision is moderate or high

What you observe

  • Recall is low and Precision is moderate or high (F1 is typically low or moderate).
  • Missed predictions is high, often more than incorrect or extra.

What it usually means

  • The model is failing to extract values that are present.
  • Common root causes:
    • Field instruction is too narrow, which means over-constrained examples or too-specific label requirements.
    • The value appears in multiple formats, such as dates and IDs, and the instruction does not cover variants.

What to do next

  • Use Missed predictions + Annotated values to confirm this is a recall problem, that is, that the values exist but are not found. Check Annotated values to confirm there is a reasonable number of annotated datapoints for the field, and Missed predictions to confirm that the model is struggling to find values as opposed to predicting them incorrectly.
  • Expand instructions to include acceptable variants: alternative labels or synonyms, multiple formatting patterns, location hints (for example, near applicant details or under the borrower section).

Scenario 3: High Error rate but Low Docs with errors (errors concentrated in a few documents)

What you observe

  • Error rate is high or Total errors is high.
  • Docs with errors is low relative to documents annotated.
  • Often one field looks bad but only fails on a small subset of documents.

What it usually means

  • Errors are driven by outlier documents, not systemic field behavior.
  • Common root causes:
    • A specific document or format behaves differently than the rest.
    • OCR or quality issues in a small number of documents, such as blurry scans, skew, and handwritten overlays.
    • The field is present in most documents but formatted unusually in a few, for example, multi-line versus single-line.

What to do next

  • Compare Docs with errors and Docs annotated, and optionally % of Docs with errors, to confirm concentration.
  • Sort documents by Error rate in the Build page and inspect documents with the highest error rate to identify if the field is performing poorly on a specific subset.

Scenario 4: Large swings in performance between versions with few annotations

What you observe

  • Large differences in F1 or error rate between model versions (up or down), but Annotated values is low, Docs annotated is low, or both.

What it usually means

  • The field metrics are not stable yet due to small sample size.
  • Common root causes:
    • Too few examples — 1–2 documents can significantly change rates.
    • Field is rarely present, that is, many missing cases and few true values.
    • A handful of difficult documents dominate the metric.

What to do next

  • Check Annotated values, Docs annotated, and Annotated as missing to validate low coverage.
  • Treat the metrics as directional, not definitive, until coverage increases.
  • Add more labeled data specifically for that field: prioritize documents where the field is present, and include a diverse set of samples or variants.
  • Use version comparisons only after coverage is sufficient to reduce variability-driven noise.
Filtering and sorting

To filter rows in the table, select one or more of the available quick filters:

  • Annotated Values <10
  • Field F1 score < 50
  • Field F1 score within 50–70 You can also sort the Field performance table by any metric in the table. When a sort is applied, values are sorted within their respective field group. For example, sorting by F1 score sorts the fields within each field group relative to one another.
Visibility settings

By default, Measure shows differences for performance metrics, for example, F1 score and error rate.

To view differences across all metrics, proceed as follows:

  1. Enable the Show differences in scores from: Version toggle.
  2. Select the Show differences in scores from: Version dropdown.
  3. Select Visibility settings.
  4. In the Version changes - visibility settings pop-up, select All Metrics. The available options are:
    • Performance metrics only — Performance metrics are determined by model predictions being compared to annotations, such as F1 score and error rate.
    • All metrics
    • Show changes inside model variability — By default, changes within the current version's variability ranges are not considered significant and are hidden. Enable to display them. When selected, the following option becomes available:
      • Show colors for all changes — By default, changes within the variability range appear in gray. Enable to color all changes green or red.
  5. 选择“保存”

模型版本

Model versions capture the current state of the project at the time the version was created. You can publish model versions to save them and use them in an automation. In addition, you can star versions in the Measure page to save their performance statistics. You can compare the current performance against previous versions to ensure continued performance improvement during iteration on instructions.

Selecting a model version

Use the Version dropdown to choose which validation results of a specific model version are displayed throughout the Measure page, such as Field performance, Document performance, and associated metrics. When you switch the model version, all metrics on the page are updated to reflect the validation results of the selected version.

Comparing different model versions using score differences

When multiple model versions are available, the Measure page allows you to compare the current model against a previous version. This way, you can better understand the impact of changes to field instructions, changes in annotations, or model configuration updates.

工作原理

  • Measure allows you to view score differences from another model version.
  • Positive or negative changes highlight improvements or regressions. By default, Measure makes comparisons against the previous model version relative to the most recently created model version. To compare a different model version, select an available version using the Show differences in scores from the version dropdown.
Understanding model variability and impact on score differences

Some models in IXP are non-deterministic, which means that the set of predictions of a field between model versions can vary slightly even when the instructions of that field are unchanged.

The Measure page allows you to take model variability into account during performance analysis. This helps you:

  • Understand whether a performance change is meaningful.
  • Avoid overinterpreting small metric fluctuations.

默认情况下:

  • Score differences that fall within the variability range of a metric are hidden when comparing two model versions.
  • You can select to show all score differences or only differences that are greater than or equal to the variability of a metric. These defaults ensure attention is focused on significant changes in model performance, and not noise.

To show differences between model versions irrespective of model variability, proceed as follows:

  1. Enable the Show differences in scores from: Version toggle.
  2. Select the Show differences in scores from: Version dropdown.
  3. Select Visibility settings.
  4. In the pop-up window, select Show changes inside model variability. The available options are:
    • Performance metrics only — Performance metrics are determined by model predictions being compared to annotations, such as F1 score and error rate.
    • All metrics
    • Show changes inside model variability — By default, changes within the current version's variability ranges are not considered significant and are hidden. Enable to display them. When selected, the following option becomes available:
      • Show colors for all changes — By default, changes within the variability range appear in gray. Enable to color all changes green or red.
  5. Optionally, select Show colors for all differences if you want all score differences to appear in green or red. By default, differences within the variability range are displayed in gray.
  6. 选择“保存”

对模型版本加星标

A new model version is created each time you make changes to your taxonomy, including instructions, or to the model settings. The latest version of the model is always available, but you can also star, that is to lock in place, a specific model version to always show the performance statistics in the dashboard.

要为模型版本加星标,请按照以下步骤继续操作:

  1. 展开“ 模型版本 ”下拉菜单,查看所有可用版本的列表。
  2. 选择模型版本旁边的星形图标,即可根据您的需要将其始终固定在列表顶部及仪表板上。
备注:

为模型版本加注星标不会保存模型版本本身,只会保存性能统计信息。要保存模型版本,您必须在“发布”选项卡中发布该模型版本。

Exporting Measure data

You can export data from the Measure page for:

  • Offline analysis.
  • Custom filtering.
  • Sharing results with stakeholders.

Exports include field-level predictions, annotations, and performance metrics visible in the Measure page. To export data, proceed as follows:

  1. Navigate to the Measure page.
  2. Select the vertical ellipsis.
  3. Select Export as Excel file.

此页面有帮助吗?

连接

需要帮助? 支持

想要了解详细内容? UiPath Academy

有问题? UiPath 论坛

保持更新