- Release Notes
- Before you begin
- Getting started
- Installing AI Center
- Migration and upgrade
- Projects
- Datasets
- Data Labeling
- ML packages
- Out of the box packages
- Pipelines
- ML Skills
- ML Logs
- Document UnderstandingTM in AI Center
- AI Center API
- How to
- Managing node scheduling
- Use Custom NER with continuous learning
- Licensing
- Basic Troubleshooting Guide
AI Center User Guide
Use Custom NER with continuous learning
This example is used to extract chemicals by the category mentioned in research papers. By following the procedure below you will extract the chemicals and categorize them as ABBREVIATION, FAMILY, FORMULA, IDENTIFIER, MULTIPLE, SYSTEMATIC, TRIVIAL and NO_CLASS.
When to use the Custom Named Entity Recognition (NER) model
Use the Custom NER model to extract:
-
special information from the text. This information is called
entity
. -
the names of people, places, organisations, locations, dates, numerical values, etc. The extracted entities are mutually exclusive. Entities are at single or multi-word level, not at the sub-word level. For example, in the I live in New York sentence, an entity can be New York but not in the I read the New Yorker sentence.
You can use the extracted entities directly in the information extraction processes or as inputs to the downstream tasks like classification of the source text, sentiment analysis of the source text, PHI, etc.
Training dataset recommendations
- Have at least 200 samples per entity if the entities are dense in the samples, meaning that most of the samples (more then 75%) contain 3-5 of these entities.
- If the entities are sparse (every sample has less then three entities) i.e., only a few of all the entities appear in most of the documents, then it is recommended to have at least 400 samples per entity. This helps the model to better understand the discriminative features.
- If there are more than 10 entities, add 100 more samples in an incremental way until you reach the desired performance metric.
Best practices
- Have meaningful entities; if a human cannot identify an entity, neither can a model.
- Have simple entities. Instead of a single entity address, break it down into multiple entities: street name, state name, city name, or zip code etc.
- Create both train and test datasets, and use a full pipeline for training.
- Start with a minimum number of samples for annotation, covering all the entities.
- Make sure all the entities are represented in both the train and test split.
- Run a full pipeline and check the test metrics. If the test metric is not satisfactory, check the classification report and identify the ill-performing entities. Add more samples that cover the ill-performing entities and repeat the training process, until the desired metric is reached.
This procedure uses the Custom Named Entity Recognition package. For more information on how this package works and what it can be used for, see Custom Named Entity Recognition.
For this procedure, we have provided sample files as follows:
- Pre-labeled training dataset in CoNLL format. You can download it from here.
- Pre-labeled test dataset. You can download it from here.
- Sample workflow for extracting categories of
chemicals mentioned in research papers. You can download it from here.
Note: Make sure that the following variables are filled in in the sample file:
in_emailAdress
- the email address to which the Action Center task will be assigned toin_MLSkillEndpoint
- public endpoint of the ML Skillin_MLSkillAPIKey
- API key of the ML Skillin_labelStudioEndpoint
- optional, to enable continuous labeling: provide import URL of a label studio project
To get started with Label Studio and export data to AI Center, follow the instructions below.
- Install Label Studio on your local machine or cloud instance. To do so, follow the instructions from here.
- Create a new project from the Named Entity Recognition Template and define your Label Names.
- Make sure that the label names have no special characters or spaces. For example, instead of
Set Date
, useSetDate
. - Make sure that the value of the
<Text>
tag is"$text"
. - Upload the data using the API from here.
cURL request example:
curl --location --request POST 'https://<label-studio-instance>/api/projects/<id>/import' \)\) --header 'Content-Type: application/json' \)\) --header 'Authorization: Token <Token>' \)\) --data-raw '[ { "data": { "text": "<Text1>" }, }, { "data": { "text": "<Text2>" } } ]'
curl --location --request POST 'https://<label-studio-instance>/api/projects/<id>/import' \)\) --header 'Content-Type: application/json' \)\) --header 'Authorization: Token <Token>' \)\) --data-raw '[ { "data": { "text": "<Text1>" }, }, { "data": { "text": "<Text2>" } } ]' - Annotate your data.
- Export the data in CoNLL 2003 format and upload it to AI Center.
- Provided the Label Studio instance URL and API key in the provided sample workflow in order to capture incorrect and low confidence predictions.