AI Center - Custom Named Entity Recognition

ai-center

2022.4

true

AI Center User Guide

Release Notes
- 2022.4.0
Before you begin
- Installing or upgrading AI Center
- Compatibility matrix
Getting started
Projects
- About Projects
- Managing Projects
Datasets
- About datasets
- Managing datasets
ML packages
Pipelines
ML Skills
- About ML Skills
- Managing ML Skills
ML Logs
- About ML Logs
Document Understanding in AI Center
- Data Manager
- OCR Services
Licensing
- AI Units
- Legal information
How To
- ML Packages
  - Use Custom NER With Continuous Learning
Basic Troubleshooting Guide
- General AI Center troubleshooting and FAQs
- AI Center standalone troubleshooting

Custom Named Entity Recognition

Out of the Box Packages > UiPath Language Analysis > CustomNamedEntityRecognition

This model allows you to bring your own dataset tagged with entities you want to extract. The training and evaluation datasets need to be in either CoNLL or JSON format. The data can also be exported from the AI Center Data Labelling tool or can also be exported from Label Studio. This ML Package must be retrained, if deployed without training first, deployment will fail with an error stating that the model is not trained.

For an example on how to use this model, see Extracting chemicals from research paper by category for a use case.

Languages

This multilingual model supports the languages listed below. These languages were chosen because they are the top 100 languages with the largest Wikipedias:

Afrikaans
Albanian
Arabic
Aragonese
Armenian
Asturian
Azerbaijani
Bashkir
Basque
Bavarian
Belarusian
Bengali
Bishnupriya Manipuri
Bosnian
Breton
Bulgarian
Burmese
Catalan
Cebuano
Chechen
Chinese (Simplified)
Chinese (Traditional)
Chuvash
Croatian
Czech
Danish
Dutch
English
Estonian
Finnish
French
Galician
Georgian
German
Greek
Gujarati
Haitian
Hebrew
Hindi
Hungarian
Icelandic
Ido
Indonesian
Irish
Italian
Japanese
Javanese
Kannada
Kazakh
Kirghiz
Korean
Latin
Latvian
Lithuanian
Lombard
Low Saxon
Luxembourgish
Macedonian
Malagasy
Malay
Malayalam
Marathi
Minangkabau
Mongolian
Nepali
Newar
Norwegian (Bokmal)
Norwegian (Nynorsk)
Occitan
Persian (Farsi)
Piedmontese
Polish
Portuguese
Punjabi
Romanian
Russian
Scots
Serbian
Serbo-Croatian
Sicilian
Slovak
Slovenian
South Azerbaijani
Spanish
Sundanese
Swahili
Swedish
Tagalog
Tajik
Tamil
Tatar
Telugu
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese
Volapük
Waray-Waray
Welsh
West Frisian
Western Punjabi
Yoruba

Model details

Input description

Text in one of the above languages from which entities will be extracted.

Output description

List of named entities in the text. Each element in the list has the following items in the prediction:

Text that was recognized
Starting and ending positions of the text, character-wise
Type of the named entity
Confidence
```
{
 "response" : [{
   "value": "George Washington",
   "start_index": 0,
   "end_index": 17,
   "entity": "PER",
   "confidence": 0.96469810605049133 
  }]
}{
 "response" : [{
   "value": "George Washington",
   "start_index": 0,
   "end_index": 17,
   "entity": "PER",
   "confidence": 0.96469810605049133 
  }]
}
```

Recommend GPU

By default, a GPU is recommended.

Pipelines

All three types of pipelines (Full Training, Training, and Evaluation) are supported by this package. For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).

Fine-tuning using data from Validation Station

You can use the Label Studio APIs to write back the data and predictions with weak confidence. Your data can then be re-labeled and exported in CoNLL format.

For more information on how to use Label Studio, see Getting started with Label Studio. Also, you can download the UiPath Studio activity for Label Studio Integration here.

Training on GPU or CPU

You can use either GPU or CPU for training. We recommend using GPU since it's faster.

Dataset format

This model supports reading all files in a given directory during all pipeline runs (training, evaluation, and full pipeline).

Note: Make sure that label names do not contain any spaces or special characters. For example, instead of Set Date, use SetDate.

CoNLL file format

This model can read all files with a .txt and/or .conll extension using the CoNLL file format in the provided directory.

The CoNLL file format represents a body of text with one word per line, each word containing 10 tab-separated columns with information about the word (for example, surface and syntax).

The trainable named entity recognition supports two CoNLL formats:

With just two columns in the text.
With four columns in the text.

To use this format, set the dataset.input_format environment variable to either conll or label_studio.

Note: The label_studio format is the same as the CoNLL format, with separation between two data points being a new empty line. To support separation between two data points with -DOCSTART- -X- O O, add dataset.input_format as an environment variable and set its value to conll.

For more information, see the examples below.

Japan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
title NN I-NP O
with IN B-PP O
a DT B-NP O
lucky JJ I-NP O
2-1 CD I-NP O
win VBP B-VP O
against IN B-PP O
Syria NNP B-NP B-LOC
in IN B-PP O
a DT B-NP O
Group NNP I-NP O
C NNP I-NP O
championship NN I-NP O
match NN I-NP O
on IN B-PP O
Friday NNP B-NP O
. . O OFounding O
member O
Kojima B-PER
Minoru I-PER
played O
guitar O
on O
Good B-MISC
Day I-MISC
, O
and O
Wardanceis I-MISC
cover O
of O
a O
song O
by O
UK I-LOC
post O
punk O
industrial O
band O
Killing B-ORG
Joke I-ORG
. OJapan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
title NN I-NP O
with IN B-PP O
a DT B-NP O
lucky JJ I-NP O
2-1 CD I-NP O
win VBP B-VP O
against IN B-PP O
Syria NNP B-NP B-LOC
in IN B-PP O
a DT B-NP O
Group NNP I-NP O
C NNP I-NP O
championship NN I-NP O
match NN I-NP O
on IN B-PP O
Friday NNP B-NP O
. . O OFounding O
member O
Kojima B-PER
Minoru I-PER
played O
guitar O
on O
Good B-MISC
Day I-MISC
, O
and O
Wardanceis I-MISC
cover O
of O
a O
song O
by O
UK I-LOC
post O
punk O
industrial O
band O
Killing B-ORG
Joke I-ORG
. O

JSON file format

The environment variables can be set, and this model will read all files in a provided directory with a .json extension using the JSON format.

Check the following sample and environment variables for a JSON file format example.

{
    "text": "Serotonin receptor 2A ( HTR2A ) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder . anxiety disorder ( GAD ) is a chronic psychiatric disorder with significant morbidity and mortality .\)
Antidepressant drugs are the preferred choice for treatment ; however , treatment response is often variable .\)
Several studies in major depression have implicated a role of the serotonin receptor gene ( HTR2A ) in treatment response to antidepressants .\)
We tested the hypothesis that the genetic polymorphism rs7997012 in the HTR2A gene predicts treatment outcome in GAD patients treated with venlafaxine XR . Treatment response was assessed in 156 patients that participated in a 6-month open - label clinical trial of venlafaxine XR for GAD . Primary analysis included Hamilton Anxiety Scale ( HAM-A ) reduction at 6 months .\)
Secondary outcome measure was the Clinical Global Impression of Improvement ( CGI-I ) score at 6 months .\)
Genotype and allele frequencies were compared between groups using χ(2) contingency analysis .\)
The frequency of the G-allele differed significantly between responders ( 70% ) and nonresponders ( 56% ) at 6 months ( P=0.05 ) using the HAM-A scale as outcome measure .\)
Similarly , using the CGI-I as outcome , the G-allele was significantly associated with improvement ( P=0.01 ) .\)
Assuming a dominant effect of the G-allele , improvement differed significantly between groups ( P=0.001 , odds ratio=4.72 ) .\)
Similar trends were observed for remission although not statistically significant .\)
We show for the first time a pharmacogenetic effect of the HTR2A rs7997012 variant in anxiety disorders , suggesting that pharmacogenetic effects cross diagnostic categories .\)
Our data document that individuals with the HTR2A rs7997012 single nucleotide polymorphism G-allele have better treatment outcome over time .\)
Future studies with larger sample sizes are necessary to further characterize this effect in treatment response to antidepressants in GAD .",
    "entities": [{
        "entity": "TRIVIAL",
        "value": "Serotonin",
        "start_index": 0,
        "end_index": 9
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 81,
        "end_index": 92
    }, {
        "entity": "TRIVIAL",
        "value": "serotonin",
        "start_index": 409,
        "end_index": 418
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 625,
        "end_index": 636
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 752,
        "end_index": 763
    }, {
        "entity": "FAMILY",
        "value": "nucleotide",
        "start_index": 1800,
        "end_index": 1810
    }]
}{
    "text": "Serotonin receptor 2A ( HTR2A ) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder . anxiety disorder ( GAD ) is a chronic psychiatric disorder with significant morbidity and mortality .\)
Antidepressant drugs are the preferred choice for treatment ; however , treatment response is often variable .\)
Several studies in major depression have implicated a role of the serotonin receptor gene ( HTR2A ) in treatment response to antidepressants .\)
We tested the hypothesis that the genetic polymorphism rs7997012 in the HTR2A gene predicts treatment outcome in GAD patients treated with venlafaxine XR . Treatment response was assessed in 156 patients that participated in a 6-month open - label clinical trial of venlafaxine XR for GAD . Primary analysis included Hamilton Anxiety Scale ( HAM-A ) reduction at 6 months .\)
Secondary outcome measure was the Clinical Global Impression of Improvement ( CGI-I ) score at 6 months .\)
Genotype and allele frequencies were compared between groups using χ(2) contingency analysis .\)
The frequency of the G-allele differed significantly between responders ( 70% ) and nonresponders ( 56% ) at 6 months ( P=0.05 ) using the HAM-A scale as outcome measure .\)
Similarly , using the CGI-I as outcome , the G-allele was significantly associated with improvement ( P=0.01 ) .\)
Assuming a dominant effect of the G-allele , improvement differed significantly between groups ( P=0.001 , odds ratio=4.72 ) .\)
Similar trends were observed for remission although not statistically significant .\)
We show for the first time a pharmacogenetic effect of the HTR2A rs7997012 variant in anxiety disorders , suggesting that pharmacogenetic effects cross diagnostic categories .\)
Our data document that individuals with the HTR2A rs7997012 single nucleotide polymorphism G-allele have better treatment outcome over time .\)
Future studies with larger sample sizes are necessary to further characterize this effect in treatment response to antidepressants in GAD .",
    "entities": [{
        "entity": "TRIVIAL",
        "value": "Serotonin",
        "start_index": 0,
        "end_index": 9
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 81,
        "end_index": 92
    }, {
        "entity": "TRIVIAL",
        "value": "serotonin",
        "start_index": 409,
        "end_index": 418
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 625,
        "end_index": 636
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 752,
        "end_index": 763
    }, {
        "entity": "FAMILY",
        "value": "nucleotide",
        "start_index": 1800,
        "end_index": 1810
    }]
}

The environment variables for the previous example would be as follows :

dataset.input_format: json
dataset.input_column_name: text
dataset.output_column_name: entities

ai_center file format

This is the default format and also the export format of data labelling tool in AI Center, and this model will read all files in a provided directory with a .json extension.

Check the following sample and environment variables for an ai_center file format example.

{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "xyz@abc.com",
        "date": "1/29/2020 12:39:01 PM",
        "from": "abc@xyz.com",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "xyz@abc.com",
        "date": "1/29/2020 12:39:01 PM",
        "from": "abc@xyz.com",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."

For leveraging the previous sample JSON, the environment variables need to be set as follows:

dataset.input_format to ai_center
dataset.input_column_name to data.text
dataset.output_column_name to annotations.ner.labels

Environment variables

dataset.input_column_name
- The name of the column containing text.
- Default value is data.text.
- This variable is only needed if the input file format is ai_center or JSON.
dataset.target_column_name
- The name of the column containing labels.
- Default value is annotations.ner.labels.
- This variable is only needed if the input file format is ai_center or JSON.
model.epochs
- The number of epochs.
- Default value is 5.
dataset.input_format
- The input format of the training data.
- Default value is ai_center.
- Supported values are: ai_center, conll, label_studio or json.
  Note: The label_studio format is the same as the CoNLL format, with separation between two data points being a new empty line. To support separation between two data points with -DOCSTART- -X- O O, add dataset.input_format as an environment variable and set its value to conll.

Artifacts

The evaluation report is a PDF file containing the following information in a human-readable format:

Evaluation report containing the following:
- Classification report
- Confusion matrix
- Precision Recall Information
Separate JSON files corresponding to each section of the Evaluation Report PDF file. These JSON files are machine-readable and you can use them to pipe the model evaluation into Insights using the workflow.

Classification report

The classification report is derived from the test data set when running full or evaluation pipeline. It contains the following information for every entity in the form of a diagram:

Entity- The name of the entity.
Precision - The precision metric for correctly predicting the entity over the test set.
Recall - The recall metric of correctly predicting the entity over the test set.
F1 score - The f1-score metric for correctly predicting the entity over the test set; you can use this score to compare entity based performance of two differently trained versions of this model.

Confusion matrix

A table with explanations explaining different categories of error is also provided under the confusion matrix. The error categories per entity arecorrect,incorrect,missed, and spurious are explained in that table.

Precision recall information

You can use this information to check the precision, recall trade-off of the model. The thresholds and corresponding precision and recall values are also provided in a table above the diagram for every entity. This table will allow you to choose the desired threshold to configure in your workflow so as to decide when to send the data to Action Center for human in the loop. Note that the higher the chosen threshold, the higher the amount of data that gets routed to Action Center for human in the loop will be.

There is a precision-recall diagram and table for each entity.

For an example of a precision-recall table per entity, see the table below.

threshold	precision	recall
0.5	0.9193	0.979
0.55	0.9224	0.9777
0.6	0.9234	0.9771
0.65	0.9256	0.9771
0.7	0.9277	0.9759
0.75	0.9319	0.9728
0.8	0.9356	0.9697
0.85	0.9412	0.9697
0.9	0.9484	0.9666
0.95	0.957	0.9629

For an example of a precision-recall diagram per entity, see the figure below.

Data

Evaluation CSV file

This is a CSV file with predictions on the test set used for evaluation. The file contains the columns:

Text - The text used for evaluation.
Actual_entities - The entities that were provided as labeled data in the evaluation dataset.
Predicted_entities - The entities that the trained model predicted.
Error_type_counts - The difference between the actual entities and predicted entities categorized by error types.

On this page

Languages
Model details
Input description
Output description
Recommend GPU
Pipelines
Fine-tuning using data from Validation Station
Training on GPU or CPU
Dataset format
Environment variables
Artifacts
Data

Was this page helpful?

PREVIOUSUiPath Language Analysis

NEXTLight Text Classification