AI Center - French Text Classification

ai-center

latest

false

AI Center user guide

Getting started
Notifications
- My notifications
Projects
- About Projects
- Managing Projects
Datasets
- About Datasets
- Managing Datasets
Data Labeling
ML packages
Out of the box packages
Pipelines
ML Skills
- About ML Skills
- Managing ML Skills
ML Logs
- About ML Logs
Document UnderstandingTM in AI Center
- Document Manager
- OCR Services
AI Center API
- Overview
- API list
Licensing
AI Solutions Templates
- About AI Solution Templates
  - Email AI
How to
- ML packages
  - Use Custom NER with continuous learning
- ML Skills
Basic Troubleshooting Guide

French Text Classification

Note:

Out of the box ML packages is deprecated. For more information, check the Deprecation timeline page from the Overview guide.

OS Packages > Language Analysis > FrenchTextClassification

This model is a generic text classification model using transfer learning for French language and needs to be trained before you can start using it for prediction. It is based on CamemBERT embeddings on which we add a 3 layers fully connected neural network to classify data. CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the multilingual corpus OSCAR by HuggingFace.

Model details

Input type

JSON

Input description

Text to be classified as String: "Mon séjour dans cet hôtel s’est très bien passé"

Output description

JSON string with pedicted class name, associated confidence on that class prediction (between 0-1) and a list of all classes with the associated confidence in “all_predictions” field.

Example:

{
  "class": "Positive",
  "confidence": 0.9996203184127808,
  "all_predictions": [
    {
      "class": "Negative",
      "confidence": 0.0003796691307798028
    },
    {
      "class": "Positive",
      "confidence": 0.9996203184127808
    }
  ]
}
{
  "class": "Positive",
  "confidence": 0.9996203184127808,
  "all_predictions": [
    {
      "class": "Negative",
      "confidence": 0.0003796691307798028
    },
    {
      "class": "Positive",
      "confidence": 0.9996203184127808
    }
  ]
}

Pipelines

All three types of pipelines (Full Training, Training and Evaluation) are supported by this package.

While you train the model for the first time, classes will be inferred by looking at the entire dataset provided. Once the model is trained, the same classes will be used for predictions and future retraining. If you want to reset the classes (or add new classes) you need to retrain the model using environment variable reset.

Using a GPU will make pipeline execution much faster and is recommended for training on big dataset.

Dataset format

This ML Package will look for json and csv files into your dataset (not in subdirectories).

csv files: it is expected csv with header named input_column(default “text”) and target_column(default “class”) and one line per data.
json files: it is expected to contain only one data point with fields input_column(default “text”) and target_column(default “class”).

Environment variables

epochs: customize number of epochs for training or Full Pipeline (default 10)
input_column: change this value to match your dataset input column’s name (default “text”)
target_column: change this value to match your dataset input column’s name (default “class”)
reset: add this environment variable if you want to retrain from scratch the three layers neural network and/or change classes. By default, this model is using transfer learning and keep same classes than previous training.

Artifacts

Evaluate function produces two artifacts:

predictions.csv: CSV file with 4 columns:
- text: input text being classified.
- class: ground truth class from dataset.
- predicted_class: class predicted by the model.
- confidence: confidence score associated with prediction.

metrics.json: json file regrouping accuracy, macro averaged f1, precision and recall along with f1, precision and recall for each class. Example:

{
  "accuracy": 0.7572500109672546,
  "f1_macro": 0.756912701179931,
  "precision_macro": 0.7594798901045778,
  "recall_macro": 0.7576722549210066,
  "details": [
    {
      "class": "Negative",
      "f1": 0.7659677030609786,
      "precision": 0.7329335793357934,
      "recall": 0.8021201413427562
    },
    {
      "class": "Positive",
      "f1": 0.7478576992988835,
      "precision": 0.7860262008733624,
      "recall": 0.7132243684992571
    }
  ]
}
{
  "accuracy": 0.7572500109672546,
  "f1_macro": 0.756912701179931,
  "precision_macro": 0.7594798901045778,
  "recall_macro": 0.7576722549210066,
  "details": [
    {
      "class": "Negative",
      "f1": 0.7659677030609786,
      "precision": 0.7329335793357934,
      "recall": 0.8021201413427562
    },
    {
      "class": "Positive",
      "f1": 0.7478576992988835,
      "precision": 0.7860262008733624,
      "recall": 0.7132243684992571
    }
  ]
}

Paper

CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.

https://camembert-model.fr/

On this page

Model details
Input type
Input description
Output description
Pipelines
Dataset format
Environment variables
Artifacts
Paper

Was this page helpful?

PREVIOUSEnglish Text Classification

NEXTJapanese Text Classification

Model details​

Input type​

Input description​

Output description​

Pipelines​

Dataset format​

Environment variables​

Artifacts​

Paper​