AI Center User Guide

DELIVERY:

Last updated Oct 17, 2024

English Text Classification

OS Packages > Language Analysis > EnglishTextClassification

This is a generic, retrainable model for English Classification. This ML Package must be retrained, if deployed without training first, deployment will fail with an error stating that the model is not trained.

This model is a deep learning architecture for language classification. It is based on RoBERTa, a self-supervised method for pretraining natural language processing systems. A GPU can be used both at serving time and training time. A GPU delivers ~5-10x improvement in speed. The model was open-sourced by Facebook AI Research.

Input type

JSON

Input description

Text to be classified as String: "I loved this movie."

Output description

JSON with predicted class name, associated confidence on that class prediction (between 0-1).

Example:

{
  "class": "Positive",
  "confidence": 0.9422031841278076
}{
  "class": "Positive",
  "confidence": 0.9422031841278076
}

Pipelines

All three types of pipelines (Full Training, Training and Evaluation) are supported by this package.

For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).

Dataset format

Reading multiple files

By default this model will read all files with a .csv and .json extension (recursively) in the provided directory.

CSV File Format:

Each csv file is expected can have any number of columns, only two will be used by the model. Those columns are specified by the parameters input_column (if not set, will default to “input”) and target_column (if not set, will default to “target”).

For example, a single csv file may look like this:

input,target
I like this movie,positive
I hated the acting,negativeinput,target
I like this movie,positive
I hated the acting,negative

In the example file above, any type of pipeline can be triggered without adding any extra parameters. In the following example, the columns need to be specified explicitly:

review,sentiment
I like this movie,positive
I hated the acting,negativereview,sentiment
I like this movie,positive
I hated the acting,negative

Any files that do not have the columns specified by input_column and target_column will be skipped. Furthermore, the delimiter that will be used to parse the file can be set by setting the csv_delimiter parameters. For example, if your file is actually tab-separated, save it with the extension .csv and set the parameter csv_delimiter to ** **

JSON File Format:

Each json file can be for a single data point or a list of data points. That is, each JSON file can have one of two formats: Single data point in one json file:

{
  "input": "I like this movie",
  "target": "positive"
}{
  "input": "I like this movie",
  "target": "positive"
}

Multiple data points in one json file:

[
  {
    "input": "I like this movie",
    "target": "positive"
  },
  {
    "input": "I hated the acting",
    "target": "negative"
  }
][
  {
    "input": "I like this movie",
    "target": "positive"
  },
  {
    "input": "I hated the acting",
    "target": "negative"
  }
]

As for csv file, if input_column and target_column parameters are set, the format overrides “input” with input_column and “target” with target_column.

All valid files (all csv files and json files that conform to the format above) will be coalesced.

Reading a single file

In some cases, it may be useful to use a single file (even if your directory has many files). In this case, the parameter csv_name can be used. If set, the pipeline will only read that file. When this parameter is set, two other additional parameters are enabled:

csv_start_index which allows the user to specify the row where to start reading.
csv_end_index which allows the user to specify the row to end reading.

For example, you may have a large file with 20K rows, but may want to quickly see what a Training run on a subset of data would look like. In this case, you may specify the file name and set csv_end_index to a value much lower than 20k.

Environment variables

input_column: change this value to match your dataset input column’s name (default “input”)
target_column: change this value to match your dataset input column’s name (default “target”)
evaluation_metric: set this value to change metric return by evaluation function and surfaced in the UI. This parameter can be set to one of the following values: "accuracy" (Default), "auroc" (area under the ROC curve), “precision”, “recall”, “matthews correlation” (matthews correlation coefficient), “fscore”.
csv_name: use this variable if you want to specify a unique csv file to be read from the dataset.
csv_start_index: allows to specify the row where to start reading. To be used in combination with csv_name.
csv_end_index: allows to specify the row to end reading. To be used in combination with csv_name.

Artifacts

Train function produces three artifacts:

train.csv - The data that was used to train the model, saved here for governance and traceability.
validation.csv - The data that was used to validate the model. learning-rate-finder.png - Most users will never need to worry about this. Advanced users may find this helpful (see advanced section).
train-report.pdf - A report containing summary information of this run. The first section includes all the parameters that were specified by the user. The second section includes statistics about the data (the number of data points for training, validation, and the checksum of each file). The last section includes two plots:
- Loss Plot – This plots the training and validation loss as a function of the number of epochs. The output ML Package version will always be the version that had the minimum validation loss (not the model at the last epoch).
- Metrics Plot – This plots a number of metrics computed on the validation set at the end of each epoch.

Evaluate function produces two artifacts:

evaluation.csv - The data that was used to evaluate the model.
evaluation-report.pdf - A report containing summary information of this run. The first section includes all the parameters that were specified by the user. The second section includes statistics about the data (the number of data points for evaluation and the file checksum). The third section include statistics of that evaluation (for multi-class, the metrics are weighted). The last section includes a plot of the confusion matrix, and a per-class computation of each of accuracy, precision, recall, and support, as well as their averaged values.

Paper

RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, et al.

On this page