AI Center - Light Text Classification

ai-center

2022.4

false

AI Center User Guide

Release Notes
- 2022.4.0
Before you begin
- Installing or upgrading AI Center
- Compatibility matrix
Getting started
Installing Automation Suite
Migration and upgrade
- Migrating to the standalone or Automation Suite environment
- Upgrading AI Center
Projects
- About Projects
- Managing Projects
Datasets
- About datasets
- Managing datasets
ML packages
Pipelines
ML Skills
- About ML Skills
- Managing ML Skills
ML Logs
- About ML Logs
Document Understanding in AI Center
- Data Manager
- OCR Services
Licensing
- AI Units
- Legal information
How To
- ML Packages
  - Use Custom NER With Continuous Learning
Basic Troubleshooting Guide
- General AI Center troubleshooting and FAQs
- AI Center standalone troubleshooting

Light Text Classification

Out of the Box Packages > UiPath Language Analysis > LightTextClassification

This is a generic, retrainable model for text classification. It supports all languages based on Latin characters, such as English, French, Spanish, and others. This ML Package must be trained, and if deployed without training first the deployment will fail with an error stating that the model is not trained. This model operates on Bag of Words. This model provides explainability based on n-grams.

Model details

Input type

JSON and CSV

Input description

Text to be classified as String: 'I loved this movie.'

Output description

JSON with class and confidence (between 0-1).

{
    "class": "7",
    "confidence": 0.1259827300369445,
    "ngrams": [
        [
            "like",
            1.3752658445706787
        ],
        [
            "like this",
            0.032029048484416685
        ]
    ]
}{
    "class": "7",
    "confidence": 0.1259827300369445,
    "ngrams": [
        [
            "like",
            1.3752658445706787
        ],
        [
            "like this",
            0.032029048484416685
        ]
    ]
}

Recommend GPU

GPU is not required.

Training enabled

By default, training is enabled.

Pipelines

This package supports all three types of pipelines (Full Training, Training, and Evaluation). The model uses advanced techniques to find a performant model using hyperparameter search. By default, hyperparameter search (the BOW.hyperparameter_search.enable variable) is enabled. The parameters of the most performant model are available in the Evaluation Report.

Dataset format

Three options are available to structure your dataset for this model : JSON, CSV and AI Center JSON format. The model will read all CSV and JSON files in the specified directory. For every format, the model expects two columns or two properties, dataset.input_column_name and dataset.target_column_name by default. The names of these two columns and/or directories are configurable using environment variables.

CSV file format

Each CSV file can have any number of columns, but only two will be used by the model. Those columns are specified by the dataset.input_column_name and dataset.target_column_name parameters.

Check the following sample and environment variables for a CSV file format example.

text, label
I like this movie, 7
I hated the acting, 9text, label
I like this movie, 7
I hated the acting, 9

The environment variables for the previous example would be as follows :

dataset.input_format: auto
dataset.input_column_name: text
dataset.target_column_name: label

JSON file format

Multiple datapoints could be a part of the same JSON file.

Check the following sample and environment variables for a JSON file format example.

[
  {
    "text": "I like this movie",
    "label": "7"
  },
  {
    "text": "I hated the acting",
    "label": "9"
  }
][
  {
    "text": "I like this movie",
    "label": "7"
  },
  {
    "text": "I hated the acting",
    "label": "9"
  }
]

The environment variables for the previous example would be as follows :

dataset.input_format: auto
dataset.input_column_name: text
dataset.target_column_name: label

ai_center file format

This is the deafult value of the environment variables which can be set, and this model will read all files in a provided directory with a .json extension.

Check the following sample and environment variables for an ai_center file format example.

{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "xyz@abc.com",
        "date": "1/29/2020 12:39:01 PM",
        "from": "abc@xyz.com",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "xyz@abc.com",
        "date": "1/29/2020 12:39:01 PM",
        "from": "abc@xyz.com",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."

For leveraging the previous sample JSON, the environment variables need to be set as follows:

dataset.input_format: ai_center
dataset.input_column_name: data.text
dataset.target_column_name: annotations.intent.choices

Training on GPU or CPU

GPU is not required for training

Environment variables

dataset.input_column_name
- The name of the input column containing the text.
- Default value is data.text.
- Make sure that this variable is configured according to your input JSON or CSV file.
dataset.target_column_name
- The name of the target column containing the text.
- Default value is annotations.intent.choices.
- Make sure that this variable is configured according to your input JSON or CSV file.
dataset.input_format
- The input format of the training data.
- Default value is ai_center.
- Supported values are: ai_center or auto.
- If ai_center is selected, only JSON files are supported. Make sure to also change the value of the dataset.target_column_name to annotations.sentiment.choices if ai_center is selected.
- If auto is selected, both CoNLL and JSON files are supported.
BOW.hyperparameter_search.enable
- The default value for this parameter is True. If left enabled, this will find the most performant model in the given timeframe and compute resources.
- This will also generate a HyperparameterSearch_report PDF file to showcase variations of parameters that were tried.
BOW.hyperparameter_search.timeout
- The maximum time the hyperparameter search is allowed to run in seconds.
- Default value is 1800.
BOW.explain_inference
- When this is set to True, during inference time when model is served as ML Skill, some of the most important n-grams will also be returned along with the prediction.
- Default value is False.

Optional variables

You can add other optional variables by clicking on the Add new button. However, if you set the BOW.hyperparameter_search.enable variable to True, the optimal values of these variables are searched. For the following optional parameters to be used by the model, please set the BOW.hyperparameter_search.enable search variable to False:

BOW.lr_kwargs.class_weight
- Supported values are: balanced or None.
BOW.ngram_range
- Range of sequence length of consecutive word sequence that can be considered as features for the model.
- Make sure to follow this format: (1, x), where x is the maximum sequence length you want to allow.
BOW.min_df
- Used to set the minimum number of occurrences of the n-gram in the dataset to be considered as a feature.
- Recommended values are between 0 and 10.
dataset.text_pp_remove_stop_words
- Used to configure whether or not stop words should be included in the search (for example, words like the, or).
- Supported values are: True or False.

Artifacts

The evaluation report is a PDF file containing the following information in a human-readable format:

ngrams per class
Precision-recall diagram
Classification report
Confusion matrix
Best Model Parameters for Hyperparameter search

ngrams per class

This section contains the top 10 n-grams that affects the model prediction for that class. There is a different table for each class that the model was trained on.

Precision recall diagram

You can use this diagram and the table to check the precision, recall trade-off, along with f1-scores of the model. The thresholds and corresponding precision and recall values are also provided in a table below this diagram. This table will choose the desired threshold to configure in your workflow so as to decide when to send the data to Action Center for human in the loop. Note that the higher the chosen threshold, the higher the amount of data that gets routed to Action Center for human in the loop will be.

There is a precision-recall diagram for each class.

For an example of a precision-recall diagram, see the figure below.

For an example of a precision-recall table, see the table below.

Precision	Recall	Threshold
0.8012232415902141	0.6735218508997429	0.30539842728983285
0.8505338078291815	0.6143958868894601	0.37825683923133907
0.9005524861878453	0.4190231362467866	0.6121292357073038
0.9514563106796117	0.2519280205655527	0.7916427288647211

Classification report

The classification report contains the following information:

Label - the label part of the test set
Precision - the accuracy of the prediction
Recall - relevant instances that were retrieved
F1 score - the geometric mean between precision and recall; you can use this score to compare two models
Support - the number of times a certain label appears in the test set

For an example of a classification report, see the table below.

Label	Precision	Recall	F1 Score	Support
0.0	0.805	0.737	0.769	319
1.0	0.731	0.812	0.77	389
2.0	0.778	0.731	0.754	394
3.0	0.721	0.778	0.748	392
4.0	0.855	0.844	0.85	385
5.0	0.901	0.803	0.849	395

Confusion Matrix

Best Model Parameters for Hyperparameter search

When the BOW.hyperparameter_search.enable variable is set to True the best model parameters picked by the algorithm are displayed in this table. To retrain the model with different parameters not covered by the hyperparameter search you can also set these parameters manually in the Environment variables. For more information on this, check the (doc:light-text-classification#environment-variables) section.

For an example of this report, see the table below.

Name	Value
BOW.ngram_range	(1, 2)
BOW.min_df	2
BOW.lr_kwargs.class_weight	balanced
dataset.text_pp_remove_stop_words	True

Hyperparameter search report

This report is a PDF file generated only if the BOW.hyperparameter_search.enable parameter is set to True. The report contains the best values for the optional variables and a diagram to display the results.

JSON files

You can find separate JSON files corresponding to each section of the Evaluation Report PDF file. These JSON files are machine-readable and you can use them to pipe the model evaluation into Insights using the workflow.

Data

Evaluation CSV file

This is a CSV file with predictions on the test set used for evaluation. This file also contains the n-grams that impacted the prediction (irrespective of the BOW.explain_inference variable value).

On this page

Model details
Input type
Input description
Output description
Recommend GPU
Training enabled
Pipelines
Dataset format
Training on GPU or CPU
Environment variables
Artifacts
Data

Was this page helpful?

PREVIOUSCustom Named Entity Recognition

NEXTMultilingual Text Classification