- 入门指南
- 管理
- 管理来源和数据集
- 模型训练和维护
- 训练
- Defining and setting up your general fields
- Understanding general fields
- Which pre-trained general fields are available?
- Enabling, disabling, updating and creating general fields
- General field filtering
- Reviewing and applying general fields
- Validation for general fields
- Improving general field performance
- Building custom regex general fields
- 生成式提取
- 使用分析和监控
- 自动化和 Communications Mining
- 常见问题及解答
![](https://docs.uipath.com/_next/static/media/grid.05ebd128.png?w=3840&q=100)
Building custom regex general fields
所需权限:“修改数据集”。
A Custom Regex General Field can be used to extract and format spans of text that have a known repetitive structure, such as IDs or reference numbers.
This is a useful option for simple, structured general fields with little variation, whereas for those with significant variation and where context has a big influence on predictions, a machine-learning based general field would be the right choice. Combinations of the two can be used in any dataset within Communications Mining.
A broader Regex (i.e. set of rules to define the general field) can also be used as the base of a custom general field. This combines the rules with contextual, machine learning based refinement through training within Communications Mining to create sophisticated custom general fields. This provides the most optimal performance as well as the necessary restrictions on values extracted for automation.
A Custom Regex General Field is made up of one or more Custom Regex Templates. Each template expresses one way to extract (and format) the general field.
Combined together, these templates offer a flexible and powerful way to cover multiple representations of the same general field type.
模板由两部分组成:
- The regex (regular expression), which describes the constraints that need to be met by a span of text to be extracted as a general field
- 格式,表示如何将提取的字符串规范化为更标准的格式
例如,如果您的客户 ID 可以是单词“ID”后跟 7 位数字,也可以是包含 9 个字符的字母数字字符串,则两个模板将如下所示:
The Custom Regex Template can be tested on text to ensure that it behaves as expected. Any general field that would be extracted with the Template will be shown in a list, with its value, as well as the position of the start and end characters.
\d{4}
且格式为ID-{$}
,则以下测试字符串将显示一次提取:
The regex is the pattern used to extract general fields in the text. See here for the syntax documentation.
命名捕获组可用于标识所提取字符串的特定部分,以便进行后续格式化。 捕获组的名称在所有模板中均应唯一,并且应仅包含小写字母或数字。
Formatting can be provided to post-process the extracted general field.
默认情况下,不应用任何格式,平台返回的字符串将是正则表达式提取的字符串。 但是,如果需要,可以使用以下规则定义更复杂的转换。
$
符号作为前缀。 请注意, $
符号本身表示完整的正则表达式匹配。
{
和}
大括号括起来。
ID-
前缀,则正则表达式和格式将为:
My identification number is 1234567
, it will return one general field: ID-1234567
&
符号连接字符串。
正则表达式 | (?P<id1>\b\d{3}\b)|(?P<id2>\b\d{4}\b) |
Formatting | {$id1 & "-" & $id2} |
文本 | 第一个 ID 为 123,第二个 ID 为 4567 |
General Field returned by the platform | 123-4567 |
将提取的跨度大写:
正则表达式 | \w+\s\w+ |
Formatting | {proper($)} |
文本 | 阿尔伯特·爱因斯坦 |
General Field returned by the platform | 阿尔伯特·爱因因 |
使用给定字符将提取的范围填充至给定大小。
函数参数:
- 包含要填充的字符的文本
- 填充字符串的大小
- 用于填充的字符
正则表达式 | \d{2,5} |
Formatting | {pad($, 5, "0")} |
文本 | 123 |
General Field returned by the platform | 00123 |
将某个字符替换为其他字符。
函数参数:
- 包含要替换的字符的文本
- 要替换的字符
- 旧字符应替换成的内容
正则表达式 | ab |
Formatting | {substitute($, "a", "12")} |
文本 | ab |
General Field returned by the platform | 12b |
返回范围中的前 n 个字符。
函数参数:
- 包含要提取的字符的文本
- 要返回的字符数
正则表达式 | \w{4} |
Formatting | {left($, 2)} |
文本 | ABCD |
General Field returned by the platform | AB |
返回范围中的最后 n 个字符。
函数参数:
- 包含要提取的字符的文本
- 要返回的字符数
正则表达式 | \w{4} |
Formatting | {right($, 2)} |
文本 | ABCD |
General Field returned by the platform | CD |