AI Center - Custom Named Entity Recognition

ai-center

latest

false

AI Center user guide

Getting started
Notifications
- My notifications
Projects
- About Projects
- Managing Projects
Datasets
- About Datasets
- Managing Datasets
Data Labeling
ML packages
Out of the box packages
Pipelines
ML Skills
- About ML Skills
- Managing ML Skills
ML Logs
- About ML Logs
Document UnderstandingTM in AI Center
- Document Manager
- OCR Services
AI Center API
- Overview
- API list
Licensing
AI Solutions Templates
- About AI Solution Templates
  - Email AI
How to
- ML packages
  - Use Custom NER with continuous learning
- ML Skills
Basic Troubleshooting Guide

Custom Named Entity Recognition

Train and deploy the Custom Named Entity Recognition ML package in AI Center using CoNLL or JSON annotated datasets.

Out of the Box Packages > UiPath Language Analysis > CustomNamedEntityRecognition

Note:

This ML package is deprecated. For more information, check the Deprecation timeline page from the Overview guide.

This model allows you to bring your own dataset tagged with entities you want to extract. The training and evaluation datasets need to be in either CoNLL or JSON format. The data can also be exported from the AI Center Data Labelling tool or can also be exported from Label Studio. This ML Package must be retrained, if deployed without training first, deployment will fail with an error stating that the model is not trained.

For an example on how to use this model, check the Extracting chemicals from research paper by category page for a use case.

Recommendations

When to use the Custom Named Entity Recognition (NER) model

Use the Custom NER model to extract:

special information from the text. This information is called entity.
the names of people, places, organisations, locations, dates, numerical values, etc. The extracted entities are mutually exclusive. Entities are at single or multi-word level, not at the sub-word level. For example, in the I live in New York sentence, an entity can be New York but not in the I read the New Yorker sentence.

You can use the extracted entities directly in the information extraction processes or as inputs to the downstream tasks like classification of the source text, sentiment analysis of the source text, PHI, etc.

Training dataset recommendations

Have at least 200 samples per entity if the entities are dense in the samples, meaning that most of the samples (more then 75%) contain 3-5 of these entities.
If the entities are sparse (every sample has less then three entities) i.e., only a few of all the entities appear in most of the documents, then it is recommended to have at least 400 samples per entity. This helps the model to better understand the discriminative features.
If there are more than 10 entities, add 100 more samples in an incremental way until you reach the desired performance metric.

Best practices

Have meaningful entities; if a human cannot identify an entity, neither can a model.
Have simple entities. Instead of a single entity address, break it down into multiple entities: street name, state name, city name, or zip code etc.
Create both train and test datasets, and use a full pipeline for training.
Start with a minimum number of samples for annotation, covering all the entities.
Make sure all the entities are represented in both the train and test split.
Run a full pipeline and check the test metrics. If the test metric is not satisfactory, check the classification report and identify the ill-performing entities. Add more samples that cover the ill-performing entities and repeat the training process, until the desired metric is reached.

Languages

This multilingual model supports the languages from the following list. These languages were chosen because they are the top 100 languages with the largest Wikipedias:

Afrikaans
Albanian
Arabic
Aragonese
Armenian
Asturian
Azerbaijani
Bashkir
Basque
Bavarian
Belarusian
Bengali
Bishnupriya Manipuri
Bosnian
Breton
Bulgarian
Burmese
Catalan
Cebuano
Chechen
Chinese (Simplified)
Chinese (Traditional)
Chuvash
Croatian
Czech
Danish
Dutch
English
Estonian
Finnish
French
Galician
Georgian
German
Greek
Gujarati
Haitian
Hebrew
Hindi
Hungarian
Icelandic
Ido
Indonesian
Irish
Italian
Japanese
Javanese
Kannada
Kazakh
Kirghiz
Korean
Latin
Latvian
Lithuanian
Lombard
Low Saxon
Luxembourgish
Macedonian
Malagasy
Malay
Malayalam
Marathi
Minangkabau
Mongolian
Nepali
Newar
Norwegian (Bokmal)
Norwegian (Nynorsk)
Occitan
Persian (Farsi)
Piedmontese
Polish
Portuguese
Punjabi
Romanian
Russian
Scots
Serbian
Serbo-Croatian
Sicilian
Slovak
Slovenian
South Azerbaijani
Spanish
Sundanese
Swahili
Swedish
Tagalog
Tajik
Tamil
Tatar
Telugu
Thai
Turkish
Ukrainian
Urdu
Uzbek
Vietnamese
Volapük
Waray-Waray
Welsh
West Frisian
Western Punjabi
Yoruba

Model details

Input description

Text in one of the supported languages from which entities will be extracted.

Output description

List of named entities in the text. Each element in the list has the following items in the prediction:

Text that was recognized
Starting and ending positions of the text, character-wise
Type of the named entity
Confidence
```
{
 "response" : [{
   "value": "George Washington",
   "start_index": 0,
   "end_index": 17,
   "entity": "PER",
   "confidence": 0.96469810605049133 
  }]
}
{
 "response" : [{
   "value": "George Washington",
   "start_index": 0,
   "end_index": 17,
   "entity": "PER",
   "confidence": 0.96469810605049133 
  }]
}
```

By default, a GPU is recommended.

Pipelines

All three types of pipelines (Full Training, Training, and Evaluation) are supported by this package. For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).

Fine-tuning using data From Validation Station

You can use the Label Studio APIs to write back the data and predictions with weak confidence. Your data can then be re-labeled and exported in CoNLL format.

For more information on how to use Label Studio, check the Getting started with Label Studio page. Also, you can download the UiPath® Studio activity for Label Studio Integration from the following URL UiPath Studio activity.

Alternatively, you can leverage the data labelling feature in AI Center.

Training on GPU or CPU

You can use either GPU or CPU for training. We recommend using GPU since it's faster.

Dataset format

This model supports reading all files in a given directory during all pipeline runs (training, evaluation, and full pipeline).

Note:

Make sure that label names do not contain any spaces or special characters. For example, instead of Set Date, use SetDate.

CoNLL file format

This model can read all files with a .txt and/or .conll extension using the CoNLL file format in the provided directory.

The CoNLL file format represents a body of text with one word per line, each word containing 10 tab-separated columns with information about the word (for example, surface and syntax).

The trainable named entity recognition supports two CoNLL formats:

With just two columns in the text.
With four columns in the text.

To use this format, set the dataset.input_format environment variable to either conll or label_studio.

Note:

The label_studio format is the same as the CoNLL format, with separation between two data points being a new empty line. To support separation between two data points with -DOCSTART- -X- O O, add dataset.input_format as an environment variable and set its value to conll.

For more information, check the following examples:

Japan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
title NN I-NP O
with IN B-PP O
a DT B-NP O
lucky JJ I-NP O
2-1 CD I-NP O
win VBP B-VP O
against IN B-PP O
Syria NNP B-NP B-LOC
in IN B-PP O
a DT B-NP O
Group NNP I-NP O
C NNP I-NP O
championship NN I-NP O
match NN I-NP O
on IN B-PP O
Friday NNP B-NP O
. . O OFounding O
member O
Kojima B-PER
Minoru I-PER
played O
guitar O
on O
Good B-MISC
Day I-MISC
, O
and O
Wardanceis I-MISC
cover O
of O
a O
song O
by O
UK I-LOC
post O
punk O
industrial O
band O
Killing B-ORG
Joke I-ORG
. O
Japan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
title NN I-NP O
with IN B-PP O
a DT B-NP O
lucky JJ I-NP O
2-1 CD I-NP O
win VBP B-VP O
against IN B-PP O
Syria NNP B-NP B-LOC
in IN B-PP O
a DT B-NP O
Group NNP I-NP O
C NNP I-NP O
championship NN I-NP O
match NN I-NP O
on IN B-PP O
Friday NNP B-NP O
. . O OFounding O
member O
Kojima B-PER
Minoru I-PER
played O
guitar O
on O
Good B-MISC
Day I-MISC
, O
and O
Wardanceis I-MISC
cover O
of O
a O
song O
by O
UK I-LOC
post O
punk O
industrial O
band O
Killing B-ORG
Joke I-ORG
. O

JSON file format

The environment variables can be set, and this model will read all files in a provided directory with a .json extension using the JSON format.

Check the following sample and environment variables for a JSON file format example.

{
    "text": "Serotonin receptor 2A ( HTR2A ) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder . anxiety disorder ( GAD ) is a chronic psychiatric disorder with significant morbidity and mortality .\)
Antidepressant drugs are the preferred choice for treatment ; however , treatment response is often variable .\)
Several studies in major depression have implicated a role of the serotonin receptor gene ( HTR2A ) in treatment response to antidepressants .\)
We tested the hypothesis that the genetic polymorphism rs7997012 in the HTR2A gene predicts treatment outcome in GAD patients treated with venlafaxine XR . Treatment response was assessed in 156 patients that participated in a 6-month open - label clinical trial of venlafaxine XR for GAD . Primary analysis included Hamilton Anxiety Scale ( HAM-A ) reduction at 6 months .\)
Secondary outcome measure was the Clinical Global Impression of Improvement ( CGI-I ) score at 6 months .\)
Genotype and allele frequencies were compared between groups using χ(2) contingency analysis .\)
The frequency of the G-allele differed significantly between responders ( 70% ) and nonresponders ( 56% ) at 6 months ( P=0.05 ) using the HAM-A scale as outcome measure .\)
Similarly , using the CGI-I as outcome , the G-allele was significantly associated with improvement ( P=0.01 ) .\)
Assuming a dominant effect of the G-allele , improvement differed significantly between groups ( P=0.001 , odds ratio=4.72 ) .\)
Similar trends were observed for remission although not statistically significant .\)
We show for the first time a pharmacogenetic effect of the HTR2A rs7997012 variant in anxiety disorders , suggesting that pharmacogenetic effects cross diagnostic categories .\)
Our data document that individuals with the HTR2A rs7997012 single nucleotide polymorphism G-allele have better treatment outcome over time .\)
Future studies with larger sample sizes are necessary to further characterize this effect in treatment response to antidepressants in GAD .",
    "entities": [{
        "entity": "TRIVIAL",
        "value": "Serotonin",
        "start_index": 0,
        "end_index": 9
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 81,
        "end_index": 92
    }, {
        "entity": "TRIVIAL",
        "value": "serotonin",
        "start_index": 409,
        "end_index": 418
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 625,
        "end_index": 636
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 752,
        "end_index": 763
    }, {
        "entity": "FAMILY",
        "value": "nucleotide",
        "start_index": 1800,
        "end_index": 1810
    }]
}
{
    "text": "Serotonin receptor 2A ( HTR2A ) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder . anxiety disorder ( GAD ) is a chronic psychiatric disorder with significant morbidity and mortality .\)
Antidepressant drugs are the preferred choice for treatment ; however , treatment response is often variable .\)
Several studies in major depression have implicated a role of the serotonin receptor gene ( HTR2A ) in treatment response to antidepressants .\)
We tested the hypothesis that the genetic polymorphism rs7997012 in the HTR2A gene predicts treatment outcome in GAD patients treated with venlafaxine XR . Treatment response was assessed in 156 patients that participated in a 6-month open - label clinical trial of venlafaxine XR for GAD . Primary analysis included Hamilton Anxiety Scale ( HAM-A ) reduction at 6 months .\)
Secondary outcome measure was the Clinical Global Impression of Improvement ( CGI-I ) score at 6 months .\)
Genotype and allele frequencies were compared between groups using χ(2) contingency analysis .\)
The frequency of the G-allele differed significantly between responders ( 70% ) and nonresponders ( 56% ) at 6 months ( P=0.05 ) using the HAM-A scale as outcome measure .\)
Similarly , using the CGI-I as outcome , the G-allele was significantly associated with improvement ( P=0.01 ) .\)
Assuming a dominant effect of the G-allele , improvement differed significantly between groups ( P=0.001 , odds ratio=4.72 ) .\)
Similar trends were observed for remission although not statistically significant .\)
We show for the first time a pharmacogenetic effect of the HTR2A rs7997012 variant in anxiety disorders , suggesting that pharmacogenetic effects cross diagnostic categories .\)
Our data document that individuals with the HTR2A rs7997012 single nucleotide polymorphism G-allele have better treatment outcome over time .\)
Future studies with larger sample sizes are necessary to further characterize this effect in treatment response to antidepressants in GAD .",
    "entities": [{
        "entity": "TRIVIAL",
        "value": "Serotonin",
        "start_index": 0,
        "end_index": 9
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 81,
        "end_index": 92
    }, {
        "entity": "TRIVIAL",
        "value": "serotonin",
        "start_index": 409,
        "end_index": 418
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 625,
        "end_index": 636
    }, {
        "entity": "TRIVIAL",
        "value": "venlafaxine",
        "start_index": 752,
        "end_index": 763
    }, {
        "entity": "FAMILY",
        "value": "nucleotide",
        "start_index": 1800,
        "end_index": 1810
    }]
}

The environment variables for the previous example would be as follows:

dataset.input_format: json
dataset.input_column_name: text
dataset.output_column_name: entities

ai_center file format

This is the default format and also the export format of data labelling tool in AI Center, and this model will read all files in a provided directory with a .json extension.

Check the following sample and environment variables for an ai_center file format example.

{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "xyz@abc.com",
        "date": "1/29/2020 12:39:01 PM",
        "from": "abc@xyz.com",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."
    }
}
{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "xyz@abc.com",
        "date": "1/29/2020 12:39:01 PM",
        "from": "abc@xyz.com",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."
    }
}

For leveraging the previous sample JSON, the environment variables need to be set as follows:

dataset.input_format to ai_center
dataset.input_column_name to data.text
dataset.output_column_name to annotations.ner.labels

Environment variables

dataset.input_column_name
- The name of the column containing text.
- Default value is data.text.
- This variable is only needed if the input file format is ai_center or JSON.
dataset.target_column_name
- The name of the column containing labels.
- Default value is annotations.ner.labels.
- This variable is only needed if the input file format is ai_center or JSON.
model.epochs
- The number of epochs.
- Default value is 5.
dataset.input_format
- The input format of the training data.
- Default value is ai_center.
- Supported values are: ai_center, conll, label_studio or json.
  Note:
  The label_studio format is the same as the CoNLL format, with separation between two data points being a new empty line. To support separation between two data points with -DOCSTART- -X- O O, add dataset.input_format as an environment variable and set its value to conll.

Artifacts

Artifacts contain the following:

Evaluation report, containing the following files:
- Classification report
- Confusion matrix
- Precision Recall Information
JSON files: separate JSON files corresponding to each section of the Evaluation Report PDF file. These JSON files are machine-readable and you can use them to pipe the model evaluation into Insights using the workflow.

Classification report

The classification report is derived from the test data set when running full or evaluation pipeline. It contains the following information for every entity in the form of a diagram:

Entity- The name of the entity.
Precision - The precision metric for correctly predicting the entity over the test set.
Recall - The recall metric of correctly predicting the entity over the test set.
F1 score - The f1-score metric for correctly predicting the entity over the test set; you can use this score to compare entity based performance of two differently trained versions of this model.

Confusion matrix

A table with explanations explaining different categories of error is also provided under the confusion matrix. The error categories per entity arecorrect,incorrect,missed, and spurious are explained in that table.

Precision recall information

You can use this information to check the precision, recall trade-off of the model. The thresholds and corresponding precision and recall values are also provided in a table above the diagram for every entity. This table will allow you to choose the desired threshold to configure in your workflow so as to decide when to send the data to Action Center for human in the loop. Note that the higher the chosen threshold, the higher the amount of data that gets routed to Action Center for human in the loop will be.

There is a precision-recall diagram and table for each entity.

For an example of a precision-recall table per entity, check the following table:

threshold	precision	recall
0.5	0.9193	0.979
0.55	0.9224	0.9777
0.6	0.9234	0.9771
0.65	0.9256	0.9771
0.7	0.9277	0.9759
0.75	0.9319	0.9728
0.8	0.9356	0.9697
0.85	0.9412	0.9697
0.9	0.9484	0.9666
0.95	0.957	0.9629

For an example of a precision-recall diagram per entity, check the following figure:

Data

Evaluation CSV file

This is a CSV file with predictions on the test set used for evaluation. The file contains the columns:

Text - The text used for evaluation.
Actual_entities - The entities that were provided as labeled data in the evaluation dataset.
Predicted_entities - The entities that the trained model predicted.
Error_type_counts - The difference between the actual entities and predicted entities categorized by error types.

Was this page helpful?

PREVIOUSUiPath Language Analysis

NEXTLight Text Classification

Recommendations​

When to use the Custom Named Entity Recognition (NER) model​

Training dataset recommendations​

Best practices​

Languages​

Model details​

Input description​

Output description​

Recommend GPU​

Pipelines​

Fine-tuning using data From Validation Station​

Training on GPU or CPU​

Dataset format​

CoNLL file format​

JSON file format​

ai_center file format​

Environment variables​

Artifacts​

Classification report​

Confusion matrix​

Precision recall information​

Data​

Evaluation CSV file​

Was this page helpful?

Recommendations

When to use the Custom Named Entity Recognition (NER) model

Training dataset recommendations

Best practices

Languages

Model details

Input description

Output description

Recommend GPU

Pipelines

Fine-tuning using data From Validation Station

Training on GPU or CPU

Dataset format

CoNLL file format

JSON file format

ai_center file format

Environment variables

Artifacts

Classification report

Confusion matrix

Precision recall information

Data

Evaluation CSV file