# Training pipelines

> A Training Pipeline is used to train a new machine learning model. To use this pipeline, the package must contain code to train a model (the `train()` function in the **train.py** file) and code to persist a newly trained model (the `save()` function in the **train.py** file). These, together with a dataset or sub-folder within a dataset, produce a new package version.

A Training Pipeline is used to train a new machine learning model. To use this pipeline, the package must contain code to train a model (the `train()` function in the **train.py** file) and code to persist a newly trained model (the `save()` function in the **train.py** file). These, together with a dataset or sub-folder within a dataset, produce a new package version.

## Creating training pipelines

Create a new training pipeline and make sure to provide the following training pipeline specific information:

* In the **Pipeline type** field, select **Training run**.
* In the **Choose input dataset** field, select a dataset or folder from which you want to import data for training. All files in this dataset/folder should be available locally during the runtime of the pipeline, being passed to the first argument to your `train()` function (that is, the path to the mounted data will be passed to the data_directory variable in the definition **train(self, data_directory)**).
* In the **Enter parameters** section, enter any environment variables defined and used by your pipeline, if any. The environment variables that are set by default are:
  + `artifacts_directory`, with default value **artifacts**: This defines the path to a directory that is persisted as ancillary data related to this pipeline. Most, if not all users, never have the need to override this through the UI. Anything can be saved during pipeline execution including images, pdfs, and subfolders. Specifically, any data your code writes in the directory specified by the path `os.environ['artifacts_directory']` is uploaded at the end of the pipeline run and will be viewable from the **Pipeline details** page.
  + `save_training_data`, with default value **false**: If set to **true**, the folder chosen in `Choose input dataset` is uploaded at the end of the pipeline run as an output of the pipeline under directory `data_directory`.
    :::note
    The pipeline execution might take some time. Check back to it after a while to see its status.
    :::
After the pipeline was executed, a new minor version of the package is available and displayed in the **ML Packages > [Package Name]** page. In our example, this is package version 1.1.

In the **Pipelines** page, the pipeline's status changed to **Successful**. The **Pipeline Details** page displays the arbitrary files and folders related to the pipeline run. In our example, the run created a file called `my-training-artifact.txt`.

## Conceptual analogy for building your own training pipeline

:::note
This is a simplified example. Its purpose is to illustrate how datasets and packages interact in a training pipeline. The steps are merely conceptual and do not represent how the platform works.
:::

1. Copy package version 1.0 into `~/mlpackage`.
2. Copy the **input dataset** or the **dataset subfolder** selected from the UI to `~/mlpackage/data`.
3. Execute the following python code:
   ```
   from train import Main 
   m = Main() 
   m.train(‘./data’) 
   m.save()
   ```
4. Persist the contents of `~/mlpackage` as package version 1.1. Persist artifacts if written, snapshot data if `save_data` is set to **true**.

## Pipelines output

The `_results.json` file contains a summary of the pipeline run execution, exposing all inputs/outputs and execution times for a training pipeline.

```
{
    "parameters": {
        "pipeline": "< Pipeline_name >",
        "inputs": {
            "package": "<Package_name>",
            "version": "<version_number>",
            "train_data": "<storage_directory>",
            "gpu": "True/False"
        },
        "env": {
            "key": "value",
            ...
        }
    },
    "run_summary": {
     "execution_time": <time>, #in seconds 
     "start_at": <timestamp>, #in seconds 
     "end_at": <timestamp>, #in seconds 
     "outputs": {
        "train_data": "<test_storage_directory>", 
        "artifacts_data": "<artifacts_storage_directory>", 
        "package": "<Package_name>",
        "version": "<new_version>"
            }
    }
}
```

The **ML Package zip file** is the new package version automatically generated by the training pipeline.

**Artifacts** folder, visible only if not empty, is a folder regrouping all the artifacts generated by the pipeline and saved under `artifacts_directory` folder.

**Dataset** folder, existing only if `save_data` was set to the default **true** value, is a copy of the input dataset folder.

## Model governance

Governance in machine learning is something very few companies are equipped to handle. In allowing each model to take a snapshot of the data it was trained on, **AI Center** enables companies to have data traceability.

Practically, you can a snapshot of the input data if you pass the parameter `save_training_data` = `true`, which takes a snapshot of the data passed in as input. Thereafter, a user can always navigate to the corresponding **Pipeline Details** page to see exactly what data was used at training time.