French Text Classification

OS Packages > Language Analysis > FrenchTextClassification

This model is a generic text classification model using transfer learning for French language and needs to be trained before you can start using it for prediction. It is based on CamemBERT embeddings on which we add a 3 layers fully connected neural network to classify data. CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the multilingual corpus OSCAR by HuggingFace.

Model Details

Input Type

JSON

Input Description

Text to be classified as String: "Mon séjour dans cet hôtel s’est très bien passé"

Output Description

JSON string with pedicted class name, associated confidence on that class prediction (between 0-1) and a list of all classes with the associated confidence in “all_predictions” field.

Example:

{
  "class": "Positive",
  "confidence": 0.9996203184127808,
  "all_predictions": [
    {
      "class": "Negative",
      "confidence": 0.0003796691307798028
    },
    {
      "class": "Positive",
      "confidence": 0.9996203184127808
    }
  ]
}{
  "class": "Positive",
  "confidence": 0.9996203184127808,
  "all_predictions": [
    {
      "class": "Negative",
      "confidence": 0.0003796691307798028
    },
    {
      "class": "Positive",
      "confidence": 0.9996203184127808
    }
  ]
}

Pipelines

All three types of pipelines (Full Training, Training and Evaluation) are supported by this package.

While you train the model for the first time, classes will be inferred by looking at the entire dataset provided. Once the model is trained, the same classes will be used for predictions and future retraining. If you want to reset the classes (or add new classes) you need to retrain the model using environment variable reset (see below).

Using a GPU will make pipeline execution much faster and is recommended for training on big dataset.

Dataset Format

This ML Package will look for json and csv files into your dataset (not in subdirectories).

csv files: it is expected csv with header named input_column(default “text”) and target_column(default “class”) and one line per data.
json files: it is expected to contain only one data point with fields input_column(default “text”) and target_column(default “class”).

Environment Variables

epochs: customize number of epochs for training or Full Pipeline (default 10)
input_column: change this value to match your dataset input column’s name (default “text”)
target_column: change this value to match your dataset input column’s name (default “class”)
reset: add this environment variable if you want to retrain from scratch the three layers neural network and/or change classes. By default, this model is using transfer learning and keep same classes than previous training.

Artifacts

Evaluate function produces two artifacts:

predictions.csv: CSV file with 4 columns:
- text: input text being classified.
- class: ground truth class from dataset.
- predicted_class: class predicted by the model.
- confidence: confidence score associated with prediction.

metrics.json: json file regrouping accuracy, macro averaged f1, precision and recall along with f1, precision and recall for each class. Example:

{
  "accuracy": 0.7572500109672546,
  "f1_macro": 0.756912701179931,
  "precision_macro": 0.7594798901045778,
  "recall_macro": 0.7576722549210066,
  "details": [
    {
      "class": "Negative",
      "f1": 0.7659677030609786,
      "precision": 0.7329335793357934,
      "recall": 0.8021201413427562
    },
    {
      "class": "Positive",
      "f1": 0.7478576992988835,
      "precision": 0.7860262008733624,
      "recall": 0.7132243684992571
    }
  ]
}{
  "accuracy": 0.7572500109672546,
  "f1_macro": 0.756912701179931,
  "precision_macro": 0.7594798901045778,
  "recall_macro": 0.7576722549210066,
  "details": [
    {
      "class": "Negative",
      "f1": 0.7659677030609786,
      "precision": 0.7329335793357934,
      "recall": 0.8021201413427562
    },
    {
      "class": "Positive",
      "f1": 0.7478576992988835,
      "precision": 0.7860262008733624,
      "recall": 0.7132243684992571
    }
  ]
}

Paper

CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.

https://camembert-model.fr/

On this page