# English Text Classification

> :::note
Out of the box ML packages is deprecated. For more information, check the [Deprecation timeline](https://docs.uipath.com/overview/other/latest/overview/deprecation-timeline#ai-center) page from the **Overview** guide.
:::

:::note
Out of the box ML packages is deprecated. For more information, check the [Deprecation timeline](https://docs.uipath.com/overview/other/latest/overview/deprecation-timeline#ai-center) page from the **Overview** guide.
:::

*OS Packages > Language Analysis > EnglishTextClassification*

This is a generic, retrainable model for English Classification. This ML Package must be retrained, if deployed without training first, the deployment will fail with an error stating that the model is not trained.

This model is a deep learning architecture for language classification. It is based on RoBERTa, a self-supervised method for pretraining natural language processing systems. A GPU can be used both at serving time and training time. A GPU delivers ~5-10x improvement in speed. The model was open-sourced by Facebook AI Research

## Model details

### Input type

JSON

### Input description

Text to be classified as String: "I loved this movie."

### Output description

JSON with the predicted class name, associated confidence on that class prediction (between 0-1).

Example:

```
{
  "class": "Positive",
  "confidence": 0.9422031841278076
}
```

## Pipelines

All three types of pipelines (Full Training, Training, and Evaluation) are supported by this package.

For most use cases, no parameters need to be specified, the model is using advanced techniques to find a performant model. In subsequent trainings after the first, the model uses incremental learning (that is, the previously trained version will be used, at the end of a Training Run).

### Dataset format

*Reading multiple files*

By default, this model will read all files with a.csv and.json extension (recursively) in the provided directory.

#### CSV File Format

Each CSV file is expected can have any number of columns, only two will be used by the model. Those columns are specified by the parameters **input_column** (if not set, will default to “input”) and **target_column** (if not set, will default to “target”).

For example, a single CSV file may look like this:

```
input,target
I like this movie,positive
I hated the acting,negative
```

In the previous example file, any type of pipeline can be triggered without adding any extra parameters. In the following example, the columns need to be specified explicitly:

```
review,sentiment
I like this movie,positive
I hated the acting,negative
```

Any files that do not have the columns specified by **input_column** and **target_column** will be skipped. Furthermore, the delimiter that will be used to parse the file can be set by setting the **csv_delimiter** parameters. For example, if your file is actually tab-separated, save it with the extension.csv and set the parameter **csv_delimiter** to ** **

#### JSON File Format

Each JSON file can be for a single data point or a list of data points. That is, each JSON file can have one of two formats: Single data point in one JSON file:

```
{
  "input": "I like this movie",
  "target": "positive"
}
```

Multiple data points in one json file:

```
[
  {
    "input": "I like this movie",
    "target": "positive"
  },
  {
    "input": "I hated the acting",
    "target": "negative"
  }
]
```

As for csv file, if **input_column** and **target_column** parameters are set, the format overrides “input” with **input_column** and “target” with **target_column**.

All valid files will be coalesced.

*Reading a single file*

In some cases, it may be useful to use a single file (even if your directory has many files). In this case, the parameter **csv_name** can be used. If set, the pipeline will only read that file. When this parameter is set, two other additional parameters are enabled:

* **csv_start_index** which allows the user to specify the row where to start reading.
* **csv_end_index** which allows the user to specify the row to end reading.

For example, you may have a large file with 20K rows, but may want to quickly see what a Training run on a subset of data would look like. In this case, you may specify the file name and set **csv_end_index** to a value much lower than 20k.

### Environment variables

* **input_column**: change this value to match your dataset input column’s name (default “input”)
* **target_column**: change this value to match your dataset input column’s name (default “target”)
* **evaluation_metric**: set this value to change metric return by evaluation function and surfaced in the UI. This parameter can be set to one of the following values: "accuracy" (Default), "auroc" (area under the ROC curve), “precision”, “recall”, “matthews correlation” (matthews correlation coefficient), “fscore”.
* **csv_name**: use this variable if you want to specify a unique csv file to be read from the dataset.
* **csv_start_index**: allows to specify the row where to start reading. To be used in combination with **csv_name**.
* **csv_end_index**: allows to specify the row to end reading. To be used in combination with **csv_name**.

### Artifacts

Train function produces three artifacts:

* train.csv - The data that was used to train the model, saved here for governance and traceability.
* validation.csv - The data that was used to validate the model. `learning-rate-finder.png` - Most users will never need to worry about this. Advanced users may find this helpful.
* train-report.pdf - A report containing summary information of this run. The first section includes all the parameters that were specified by the user. The second section includes statistics about the data (the number of data points for training, validation, and the checksum of each file). The last section includes two plots:
  + Loss Plot – This plots the training and validation loss as a function of the number of epochs. The output ML Package version will always be the version that had the minimum validation loss (not the model at the last epoch).
  + Metrics Plot – This plots a number of metrics computed on the validation set at the end of each epoch.

Evaluate function produces two artifacts:

* evaluation.csv - The data that was used to evaluate the model.
* evaluation-report.pdf - A report containing summary information of this run. The first section includes all the parameters that were specified by the user. The second section includes statistics about the data (the number of data points for evaluation and the file checksum). The third section includes statistics of that evaluation (for multi-class, the metrics are weighted). The last section includes a plot of the confusion matrix, and a per-class computation of each of accuracy, precision, recall, and support, as well as their averaged values.

## Paper

[RoBERTa](https://arxiv.org/abs/1907.11692): A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, et al.
