AI Center - TPOT XGBoost Classification

ai-center

2020.10

false

AI Center

Release Notes
Requirements
- Hardware and software requirements
- AI Fabric architecture
Installation
Getting Started
- About AI Fabric
- Using AI Fabric
Projects
- About Projects
- Managing Projects
Datasets
- About Datasets
- Managing Datasets
ML Packages
Pipelines
ML Skills
- About ML Skills
- Managing ML Skills
ML Logs
- About ML Logs
Document Understanding in AI Fabric
- Data Manager
- OCR Services
Basic Troubleshooting Guide
- General AI Center Troubleshooting and FAQs

TPOT XGBoost Classification

OS Packages > Tabular Data > TPOTXGBoostClassification

This model is a generic tabular data (numerical value only) classification model that needs to be retrained before being used for predictions. It relies on TPOT to automatically find the best model.

TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. TPOT automates the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data. Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. TPOT is built on top of scikit-learn, so all the code it generates should look familiar to scikit-learn users.

This version of TPOT uses only XGBoost and the standard set of pre-processing methods to optimize a machine learning pipeline.

Input Type

JSON

Input Description

Features used by the model to make predictions. For example: { “Feature1”: 12, “Feature2”: 222, ..., “FeatureN”: 110}

Output Description

JSON with predicted class, associated confidence on that class prediction (between 0-1) and label name. Label names are returned only if the label encoding was performed by the pipeline, within AI Fabric. Some scikit-learn models do not support confidence scores. If the output of the optimization pipeline is a scikit-learn model which does not support confidence scores the output will only contain the predicted class.

Example:

{
  "predictions": 0,
  "confidences": 0.6,
  "labels": "yes"
}{
  "predictions": 0,
  "confidences": 0.6,
  "labels": "yes"
}

Or if label encoding was done outside of the model:

{
  "predictions": 0,
  "confidences": 0.6
}{
  "predictions": 0,
  "confidences": 0.6
}

Pipelines

All three types of pipelines (Full Training, Training and Evaluation) are supported by this package.

While you train the model for the first time, classes will be inferred by looking at the entire dataset provided.

Dataset Format

This ML Package will look for csv files in your dataset (not in subdirectories)

The csv files need to follow these two rules:

first row of the data must contain the header/column names.
all columns, except for the target_column, must be numerical (int, float). The model is not able perform feature encoding however it is able to perform target encoding. If target encoding is performed by the model, at prediction time, the model also returns the label of the target variable.

Environment Variables

max_time_mins: time to run the pipeline (in minutes). The longer the train time the better chances TPOT has at finding a good model. (default: 2)
target_column: name of the target column (default: “target”)
scoring: TPOT makes use of sklearn.model_selection.cross_val_score for evaluating pipelines, and as such offers the same support for scoring functions (default: “accuracy”). Uses standard scikit-learn scoring metrics (https://scikit-learn.org/stable/modules/model_evaluation.html)
keep_training: Typical TPOT runs will take hours to days to finish (unless it's a small dataset), but you can always interrupt the run partway through and see the best results so far. If keep_training is set to True, TPOT will continue the training where it left of

Artifacts

TPOT exports the corresponding Python code for the optimized pipeline to a python file called “TPOT_pipeline.py”. Once the code finishes running, “TPOT_pipeline.py” will contain the Python code for the optimized pipeline.

Paper

The model is based on two publications:

On this page

Input Type
Input Description
Output Description
Pipelines
Dataset Format
Environment Variables
Artifacts

Was this page helpful?

PREVIOUSTPOT AutoML Classification

NEXTUiPath Document Understanding