AI Center
latest
false
Banner background image
AI Center
Last updated Feb 22, 2024

Multilingual Text Classification

Out of the Box Packages > UiPath Language Analysis > MultiLingualTextClassification

This is a generic, retrainable model for text classification. This ML Package must be trained, and if deployed without training first, the deployment will fail with an error stating that the model is not trained. It is based on BERT, a self-supervised method for pretraining natural language processing systems. A GPU is recommended especially during training. A GPU delivers ~5-10x improvement in speed.

Languages

This multilingual model supports the languages listed below. These languages were chosen because they are the top 100 languages with the largest Wikipedias:

  • Afrikaans
  • Albanian
  • Arabic
  • Aragonese
  • Armenian
  • Asturian
  • Azerbaijani
  • Bashkir
  • Basque
  • Bavarian
  • Belarusian
  • Bengali
  • Bishnupriya Manipuri
  • Bosnian
  • Breton
  • Bulgarian
  • Burmese
  • Catalan
  • Cebuano
  • Chechen
  • Chinese (Simplified)
  • Chinese (Traditional)
  • Chuvash
  • Croatian
  • Czech
  • Danish
  • Dutch
  • English
  • Estonian
  • Finnish
  • French
  • Galician
  • Georgian
  • German
  • Greek
  • Gujarati
  • Haitian
  • Hebrew
  • Hindi
  • Hungarian
  • Icelandic
  • Ido
  • Indonesian
  • Irish
  • Italian
  • Japanese
  • Javanese
  • Kannada
  • Kazakh
  • Kirghiz
  • Korean
  • Latin
  • Latvian
  • Lithuanian
  • Lombard
  • Low Saxon
  • Luxembourgish
  • Macedonian
  • Malagasy
  • Malay
  • Malayalam
  • Marathi
  • Minangkabau
  • Nepali
  • Newar
  • Norwegian (Bokmal)
  • Norwegian (Nynorsk)
  • Occitan
  • Persian (Farsi)
  • Piedmontese
  • Polish
  • Portuguese
  • Punjabi
  • Romanian
  • Russian
  • Scots
  • Serbian
  • Serbo-Croatian
  • Sicilian
  • Slovak
  • Slovenian
  • South Azerbaijani
  • Spanish
  • Sundanese
  • Swahili
  • Swedish
  • Tagalog
  • Tajik
  • Tamil
  • Tatar
  • Telugu
  • Thai
  • Turkish
  • Ukrainian
  • Urdu
  • Uzbek
  • Vietnamese
  • Volapük
  • Waray-Waray
  • Welsh
  • West Frisian
  • Western Punjabi
  • Yoruba

Model Details

Input Type

JSON

Input Description

Text to be classified as String: 'I loved this movie.'

Output Description

JSON with predicted class name, associated confidence on that class prediction (between 0-1).

Example:

{
  "prediction": "Positive", 
  "confidence": 0.9422031841278076
}{
  "prediction": "Positive", 
  "confidence": 0.9422031841278076
}

Recommend GPU

By default, a GPU is recommended.

Training Enabled

By default, training is enabled.

Pipelines

All three types of pipelines (Full Training, Training, and Evaluation) are supported by this package. For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).

Dataset Format

Three options are available to structure your dataset for this model : JSON, CSV and AI Center™ JSON format (this is also the export format of the labelling tool. The model will read all CSV and JSON files in the specified directory. For every format, the model expects two columns or two properties, dataset.input_column_name and dataset.target_column_name by default. The names of these two columns and/or directories are configurable using environment variables.

CSV file format

Each CSV file can have any number of columns, but only two will be used by the model. Those columns are specified by the dataset.input_column_name and dataset.target_column_name parameters.

Check the following sample and environment variables for a CSV file format example.

text, label
I like this movie, 7
I hated the acting, 9text, label
I like this movie, 7
I hated the acting, 9

The environment variables for the previous example would be as follows :

  • dataset.input_format: auto
  • dataset.input_column_name: text
  • dataset.output_column_name: label

JSON file format

Multiple datapoints could be a part of the same JSON file.

Check the following sample and environment variables for a JSON file format example.

[
  {
    "text": "I like this movie",
    "label": "7"
  },
  {
    "text": "I hated the acting",
    "label": "9"
  }
][
  {
    "text": "I like this movie",
    "label": "7"
  },
  {
    "text": "I hated the acting",
    "label": "9"
  }
]

The environment variables for the previous example would be as follows :

  • dataset.input_format: auto
  • dataset.input_column_name: text
  • dataset.output_column_name: label

ai_center file format

This is the deafult value of the environment variables which can be set, and this model will read all files in a provided directory with a .json extension.

Check the following sample and environment variables for an ai_center file format example.

{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "xyz@abc.com",
        "date": "1/29/2020 12:39:01 PM",
        "from": "abc@xyz.com",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."{
    "annotations": {
        "intent": {
            "to_name": "text",
            "choices": [
                "TransactionIssue",
                "LoanIssue"
            ]
        },
        "sentiment": {
            "to_name": "text",
            "choices": [
                "Very Positive"
            ]
        },
        "ner": {
            "to_name": "text",
            "labels": [
                {
                    "start_index": 37,
                    "end_index": 47,
                    "entity": "Stakeholder",
                    "value": " Citi Bank"
                },
                {
                    "start_index": 51,
                    "end_index": 61,
                    "entity": "Date",
                    "value": "07/19/2018"
                },
                {
                    "start_index": 114,
                    "end_index": 118,
                    "entity": "Amount",
                    "value": "$500"
                },
                {
                    "start_index": 288,
                    "end_index": 293,
                    "entity": "Stakeholder",
                    "value": " Citi"
                }
            ]
        }
    },
    "data": {
        "cc": "",
        "to": "xyz@abc.com",
        "date": "1/29/2020 12:39:01 PM",
        "from": "abc@xyz.com",
        "text": "I opened my new checking account with Citi Bank in 07/19/2018 and met the requirements for the promotion offer of $500 . It has been more than 6 months and I have not received any bonus. I called the customer service several times in the past few months but no any response. I request the Citi honor its promotion offer as advertised."

For leveraging the previous sample JSON, the environment variables need to be set as follows:

  • dataset.input_format: ai_center
  • dataset.input_column_name: data.text
  • dataset.output_column_name: annotations.intent.choices

Training on GPU or CPU

You can use either GPU or CPU for training. We recommend using GPU since it's faster.

Environment Variables

  • dataset.input_column_name
    • The name of the input column containing the text.
    • Default value is data.text.
    • Make sure that this variable is configured according to your input JSON or CSV file.
  • dataset.target_column_name
    • The name of the target column containing the text.
    • Default value is annotations.intent.choices.
    • Make sure that this variable is configured according to your input JSON or CSV file.
  • dataset.input_format
    • The input format of the training data.
    • Default value is ai_center.
    • Supported values are: ai_center or auto.
    • If ai_center is selected, only JSON files are supported. Make sure to also change the value of the dataset.target_column_name to annotations.sentiment.choices if ai_center is selected.
    • If auto is selected, both CoNLL and JSON files are supported.
  • model.epochs
    • The number of epochs.
    • Default value: 100.

Artifacts

Confusion Matrix



Classification Report

precision    recall  f1-score   support
         positive     0.94      0.94      0.94     10408
         negative     0.93      0.93      0.93      9592
    accuracy                              0.94     20000
   macro avg          0.94      0.94      0.94     20000
weighted avg          0.94      0.94      0.94     20000precision    recall  f1-score   support
         positive     0.94      0.94      0.94     10408
         negative     0.93      0.93      0.93      9592
    accuracy                              0.94     20000
   macro avg          0.94      0.94      0.94     20000
weighted avg          0.94      0.94      0.94     20000

Data

Evaluation CSV file

This is a CSV file with predictions on the test set used for evaluation.

text,label,predict,confidence
I like this movie, positive, positive, 0.99
I hated the acting, negative, negative, 0.98text,label,predict,confidence
I like this movie, positive, positive, 0.99
I hated the acting, negative, negative, 0.98

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.