- 发行说明
- 入门指南
- 通知
- 项目
- 数据集
- 数据标签
- ML 包
- 开箱即用包
- 管道
- ML 技能
- ML 日志
- AI Center 中的 Document Understanding™
- AI Center API
- 许可
- AI 解决方案模板
- 如何
- 基本故障排除指南
自定义命名实体识别
“开箱即用包”>“UiPath 语言分析”>“自定义命名实体识别”
此模型允许您使用自己标记有要提取的实体的数据集。 训练数据集和评估数据集需要采用 CoNLL 或 JSON 格式。 您还可使用AI Center数据标签工具或 Label Studio 导出数据。 此 ML 包必须重新训练,如果在未事先进行训练的情况下部署此包,部署将失败,并显示错误,指明模型未经过训练。
有关如何使用此模型的示例,请参阅按研究论文中的类别提取化学物以了解用例。
何时使用自定义命名实体识别 (NER) 模型
使用自定义 NER 模型提取:
-
文本中的特殊信息。 此信息称为
entity
。 -
人名、地名、组织、地点、日期、数值等。 提取的实体互斥。 实体处于单个词或多词级别,而不是子词级别。 例如,在“我住纽约”句子中,实体可以是“纽约” ,但在“我读过纽约客”句子中则不可以。
您可以直接在信息提取流程中使用提取的实体,也可以将其用作下游任务(例如源文本的分类、源文本的情感分析、PHI 等)的输入。
训练数据集建议
- 如果样本中的实体很密集,则每个实体至少有 200 个样本,这意味着大多数样本(超过 75%)包含 3-5 个这样的实体。
- 如果实体稀疏(每个样本的实体少于三个),即大多数文档中只有少数实体出现,则建议每个实体至少有 400 个样本。 这有助于模型更好地理解区别性特征。
- 如果实体超过 10 个,请再以增量方式添加 100 个样本,直到达到所需的性能指标。
最佳实践
- 拥有有意义的实体;如果人类无法识别实体,则模型也无法识别实体。
- 拥有简单的实体。 不要将实体地址拆分为多个实体: 街道名称、州名称、城市名称或邮政编码等,而不是单个实体地址。
- 同时创建训练数据集和测试数据集,并使用完整管道进行训练。
- 从最低数量的注释样本开始,涵盖所有实体。
- 确保所有实体都包含在训练集和测试集拆分中。
- 运行完整管道并检查测试指标。 如果测试指标不理想,请检查分类报告并识别表现不佳的实体。 添加更多涵盖表现不佳的实体的样本,并重复训练过程,直到达到所需的指标。
此多语言模型支持下面列出的语言。之所以选择这些语言,是因为它们是维基百科条目数最多的前 100 种语言:
- 南非荷兰语
- 阿尔巴尼亚语
- 阿拉伯语
- 阿拉贡语
- 亚美尼亚语
- 阿斯图里亚斯语
- 阿塞拜疆
- 巴什基尔
- 巴斯克语
- 巴伐利亚语
- 白俄罗斯语
- 孟加拉语
- 比什奴普莱利亚-曼尼浦尔语
- 波斯尼亚语
- 布列塔尼
- 保加利亚语
- 缅甸语
- 加泰罗尼亚语
- 宿务语
- 车臣
- 中文 (简体)
- 中文 (繁体)
- 楚瓦什语
- 克罗地亚语
- 捷克语
- 丹麦语
- 荷兰语
- 英文
- 爱沙尼亚语
- 芬兰语
- 法语
- 加利西亚语
- 格鲁吉亚语
- 德语
- 希腊语
- 古吉拉特语
- 海地语
- 希伯来语
- 印地语
- 匈牙利语
- 冰岛语
- 伊多
- 印尼语
- 爱尔兰语
- 意大利语
- 日语
- 爪哇语
- 卡纳达语
- 哈萨克语
- 吉尔吉斯语
- 韩语
- 拉丁语
- 拉脱维亚语
- 立陶宛语
- 伦巴第语
- 低萨克森语
- 卢森堡语
- 马其顿语
- 马达加斯加语
- 马来语
- 马拉雅拉姆语
- 马拉地语
- 米南卡保
- 蒙古语
- 尼泊尔语
- 尼瓦尔语
- 挪威博克马尔语
- 挪威尼诺斯克语
- 奥克西顿语
- 波斯语(现代波斯语)
- 皮埃蒙特语
- 波兰语
- 葡萄牙语
- 旁遮普语
- 罗马尼亚语
- 俄语
- 苏格兰语
- 塞尔维亚语
- 塞尔维亚语-克罗地亚语
- 西西里岛人语
- 斯洛伐克语
- 斯洛文尼亚语
- 南阿塞拜疆语
- 西班牙语
- 巽他语
- 斯瓦希里语
- 瑞典语
- 他加禄语
- 塔吉克语
- 泰米尔语
- 鞑靼人
- 泰卢固语
- 泰语
- 土耳其语
- 乌克兰语
- 乌尔都语
- 乌兹别克语
- 越南语
- 沃拉卜克语
- 瓦瑞瓦瑞语
- 威尔士语
- 西弗里西亚语
- 西部旁遮普语
- 约鲁巴语
文本中命名实体的列表。列表中的每个元素在预测中都有以下项目:
- 已识别的文本
- 文本的开始位置和结束位置,按字符排列
- 命名实体的类型
- 可信度
{ "response" : [{ "value": "George Washington", "start_index": 0, "end_index": 17, "entity": "PER", "confidence": 0.96469810605049133 }] }
{ "response" : [{ "value": "George Washington", "start_index": 0, "end_index": 17, "entity": "PER", "confidence": 0.96469810605049133 }] }
此包支持所有三种类型的管道(完整训练、训练和评估)。对于大多数用例,不需要指定任何参数,模型将使用高级技术来查找高性能模型。在第一次训练之后的后续训练中,模型将使用增量学习(即,在训练运行结束后将使用先前训练的版本)。
此模型支持在所有管道运行(训练、评估和完整管道)期间读取给定目录中的所有文件。
SetDate
代替Set Date
}。
CoNLL 文件格式
.txt
和/或.conll
的所有文件。
CoNLL 文件格式表示文本主体,每行包含一个单词,每个单词包含 10 个制表符分隔的列,其中包含有关该单词(例如,surface 和 syntax)的信息。
可训练的命名实体识别支持两种 CoNLL 格式:
- 文本中只有两列。
- 文本中有四列。
conll
或label_studio
。
label_studio
格式与CoNLL
格式相同,两个数据点之间的分隔符是一个新的空行。 要支持使用-DOCSTART- -X- O O
分隔两个数据点,请添加 dataset.input_format 作为环境变量,并将其值设置为conll
。
有关详细信息,请参阅下面的示例。
Japan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
title NN I-NP O
with IN B-PP O
a DT B-NP O
lucky JJ I-NP O
2-1 CD I-NP O
win VBP B-VP O
against IN B-PP O
Syria NNP B-NP B-LOC
in IN B-PP O
a DT B-NP O
Group NNP I-NP O
C NNP I-NP O
championship NN I-NP O
match NN I-NP O
on IN B-PP O
Friday NNP B-NP O
. . O OFounding O
member O
Kojima B-PER
Minoru I-PER
played O
guitar O
on O
Good B-MISC
Day I-MISC
, O
and O
Wardanceis I-MISC
cover O
of O
a O
song O
by O
UK I-LOC
post O
punk O
industrial O
band O
Killing B-ORG
Joke I-ORG
. O
Japan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
title NN I-NP O
with IN B-PP O
a DT B-NP O
lucky JJ I-NP O
2-1 CD I-NP O
win VBP B-VP O
against IN B-PP O
Syria NNP B-NP B-LOC
in IN B-PP O
a DT B-NP O
Group NNP I-NP O
C NNP I-NP O
championship NN I-NP O
match NN I-NP O
on IN B-PP O
Friday NNP B-NP O
. . O OFounding O
member O
Kojima B-PER
Minoru I-PER
played O
guitar O
on O
Good B-MISC
Day I-MISC
, O
and O
Wardanceis I-MISC
cover O
of O
a O
song O
by O
UK I-LOC
post O
punk O
industrial O
band O
Killing B-ORG
Joke I-ORG
. O
JSON 文件格式
.json
扩展名。
检查以下示例和环境变量以获取JSON 文件格式 示例。
{
"text": "Serotonin receptor 2A ( HTR2A ) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder . anxiety disorder ( GAD ) is a chronic psychiatric disorder with significant morbidity and mortality .\)
Antidepressant drugs are the preferred choice for treatment ; however , treatment response is often variable .\)
Several studies in major depression have implicated a role of the serotonin receptor gene ( HTR2A ) in treatment response to antidepressants .\)
We tested the hypothesis that the genetic polymorphism rs7997012 in the HTR2A gene predicts treatment outcome in GAD patients treated with venlafaxine XR . Treatment response was assessed in 156 patients that participated in a 6-month open - label clinical trial of venlafaxine XR for GAD . Primary analysis included Hamilton Anxiety Scale ( HAM-A ) reduction at 6 months .\)
Secondary outcome measure was the Clinical Global Impression of Improvement ( CGI-I ) score at 6 months .\)
Genotype and allele frequencies were compared between groups using χ(2) contingency analysis .\)
The frequency of the G-allele differed significantly between responders ( 70% ) and nonresponders ( 56% ) at 6 months ( P=0.05 ) using the HAM-A scale as outcome measure .\)
Similarly , using the CGI-I as outcome , the G-allele was significantly associated with improvement ( P=0.01 ) .\)
Assuming a dominant effect of the G-allele , improvement differed significantly between groups ( P=0.001 , odds ratio=4.72 ) .\)
Similar trends were observed for remission although not statistically significant .\)
We show for the first time a pharmacogenetic effect of the HTR2A rs7997012 variant in anxiety disorders , suggesting that pharmacogenetic effects cross diagnostic categories .\)
Our data document that individuals with the HTR2A rs7997012 single nucleotide polymorphism G-allele have better treatment outcome over time .\)
Future studies with larger sample sizes are necessary to further characterize this effect in treatment response to antidepressants in GAD .",
"entities": [{
"entity": "TRIVIAL",
"value": "Serotonin",
"start_index": 0,
"end_index": 9
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 81,
"end_index": 92
}, {
"entity": "TRIVIAL",
"value": "serotonin",
"start_index": 409,
"end_index": 418
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 625,
"end_index": 636
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 752,
"end_index": 763
}, {
"entity": "FAMILY",
"value": "nucleotide",
"start_index": 1800,
"end_index": 1810
}]
}
{
"text": "Serotonin receptor 2A ( HTR2A ) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder . anxiety disorder ( GAD ) is a chronic psychiatric disorder with significant morbidity and mortality .\)
Antidepressant drugs are the preferred choice for treatment ; however , treatment response is often variable .\)
Several studies in major depression have implicated a role of the serotonin receptor gene ( HTR2A ) in treatment response to antidepressants .\)
We tested the hypothesis that the genetic polymorphism rs7997012 in the HTR2A gene predicts treatment outcome in GAD patients treated with venlafaxine XR . Treatment response was assessed in 156 patients that participated in a 6-month open - label clinical trial of venlafaxine XR for GAD . Primary analysis included Hamilton Anxiety Scale ( HAM-A ) reduction at 6 months .\)
Secondary outcome measure was the Clinical Global Impression of Improvement ( CGI-I ) score at 6 months .\)
Genotype and allele frequencies were compared between groups using χ(2) contingency analysis .\)
The frequency of the G-allele differed significantly between responders ( 70% ) and nonresponders ( 56% ) at 6 months ( P=0.05 ) using the HAM-A scale as outcome measure .\)
Similarly , using the CGI-I as outcome , the G-allele was significantly associated with improvement ( P=0.01 ) .\)
Assuming a dominant effect of the G-allele , improvement differed significantly between groups ( P=0.001 , odds ratio=4.72 ) .\)
Similar trends were observed for remission although not statistically significant .\)
We show for the first time a pharmacogenetic effect of the HTR2A rs7997012 variant in anxiety disorders , suggesting that pharmacogenetic effects cross diagnostic categories .\)
Our data document that individuals with the HTR2A rs7997012 single nucleotide polymorphism G-allele have better treatment outcome over time .\)
Future studies with larger sample sizes are necessary to further characterize this effect in treatment response to antidepressants in GAD .",
"entities": [{
"entity": "TRIVIAL",
"value": "Serotonin",
"start_index": 0,
"end_index": 9
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 81,
"end_index": 92
}, {
"entity": "TRIVIAL",
"value": "serotonin",
"start_index": 409,
"end_index": 418
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 625,
"end_index": 636
}, {
"entity": "TRIVIAL",
"value": "venlafaxine",
"start_index": 752,
"end_index": 763
}, {
"entity": "FAMILY",
"value": "nucleotide",
"start_index": 1800,
"end_index": 1810
}]
}
上一个示例的环境变量如下:
- dataset.input_format:
json
- dataset.input_column_name:
text
- dataset.output_column_name:
entities
ai_center 文件格式
.json
。
检查以下示例和环境变量以获取 ai_center 文件格式示例。
{
"annotations": {
"intent": {
"to_name": "text",
"choices": [
"TransactionIssue",
"LoanIssue"
]
},
"sentiment": {
"to_name": "text",
"choices": [
"Very Positive"
]
},
"ner": {
"to_name": "text",
"labels": [
{
"start_index": 37,
"end_index": 47,
"entity": "Stakeholder",
"value": " Citi Bank"
},
{
"start_index": 51,
"end_index": 61,
"entity": "Date",
"value": "07/19/2018"
},
{
"start_index": 114,
"end_index": 118,
"entity": "Amount",
"value": "$500"
},
{
"start_index": 288,
"end_index": 293,
"entity": "Stakeholder",
"value": " Citi"
}
]
}
},
"data": {
"cc": "",
"to": "xyz@abc.com",
"date": "1/29/2020 12:39:01 PM",
"from": "abc@xyz.com",
"text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."
}
}
{
"annotations": {
"intent": {
"to_name": "text",
"choices": [
"TransactionIssue",
"LoanIssue"
]
},
"sentiment": {
"to_name": "text",
"choices": [
"Very Positive"
]
},
"ner": {
"to_name": "text",
"labels": [
{
"start_index": 37,
"end_index": 47,
"entity": "Stakeholder",
"value": " Citi Bank"
},
{
"start_index": 51,
"end_index": 61,
"entity": "Date",
"value": "07/19/2018"
},
{
"start_index": 114,
"end_index": 118,
"entity": "Amount",
"value": "$500"
},
{
"start_index": 288,
"end_index": 293,
"entity": "Stakeholder",
"value": " Citi"
}
]
}
},
"data": {
"cc": "",
"to": "xyz@abc.com",
"date": "1/29/2020 12:39:01 PM",
"from": "abc@xyz.com",
"text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."
}
}
为了利用前面的示例 JSON,需要按如下方式设置环境变量:
- 将 dataset.input_format 更改为
ai_center
- 将 dataset.input_column_name 更改为
data.text
- 将 dataset.output_column_name 更改为
annotations.ner.labels
-
dataset.input_column_name
- 包含文本的列的名称。
- 默认值为
data.text
。 - 仅当输入文件格式为
ai_center
或JSON
时,才需要此变量。
-
dataset.target_column_name
- 包含标签的列的名称。
- 默认值为
annotations.ner.labels
。 - 仅当输入文件格式为
ai_center
或JSON
时,才需要此变量。
-
模型.epochs
- 时期数。
- 默认值为
5
。
-
dataset.input_format
- 训练数据的输入格式。
- 默认值为
ai_center
。 - 支持的值包括:
ai_center
、conll
、label_studio
或json
。注意:label_studio
格式与CoNLL
格式相同,两个数据点之间的分隔符是一个新的空行。 要支持使用-DOCSTART- -X- O O
分隔两个数据点,请添加 dataset.input_format 作为环境变量,并将其值设置为conll
。
- 评估报告,包含以下文件:
- 分类报告
- 混淆矩阵
- 精确召回信息
- JSON 文件:单独的 JSON 文件,与评估报告 PDF 文件的每个部分相对应。这些 JSON 文件可供计算机读取,所以您可以使用它们通过工作流将模型评估传输到 Insights 中。
分类报告
运行完整管道或评估管道时,分类报告来自测试数据集。 它以图表的形式包含每个实体的以下信息:
- 实体 - 实体的名称。
- “精度” - 用于在测试集上正确预测实体的精度指标。
- 召回率 - 在测试集上正确预测实体的召回率指标。
- F1 分数 - 用于在测试集上正确预测实体的 f1 分数指标;您可以使用此分数来比较此模型的两个不同训练版本的基于实体的性能。
混淆矩阵
混淆矩阵下方还提供了一个表格,其中包含解释不同类别错误的说明。 该表中说明了每个实体的错误类别为正确、不正确、遗漏和 虚假 。
精确召回信息
您可以使用此信息来检查模型的精度和召回率。 图表上方的表格还提供了每个实体的阈值以及相应的精度和召回率值。 通过此表,您可以选择要在工作流中配置的所需阈值,以便决定何时将数据发送到 Action Center for human in loop。 请注意,所选阈值越高,路由到循环中的人工操作中心的数据量就越多。
每个实体都有一个精确召回率图表和表格。
有关每个实体的精确召回率表的示例,请参阅下表。
阈值 |
精度 |
召回 |
---|---|---|
0.5 |
0.9193 |
0.979 |
0.55 |
0.9224 |
0.9777 |
0.6 |
0.9234 |
0.9771 |
0.65 |
0.9256 |
0.9771 |
0.7 |
0.9277 |
0.9759 |
0.75 |
0.9319 |
0.9728 |
0.8 |
0.9356 |
0.9697 |
0.85 |
0.9412 |
0.9697 |
0.9 |
0.9484 |
0.9666 |
0.95 |
0.957 |
0.9629 |
有关每个实体的精确召回率图的示例,请参见下图。