- Release Notes
- Before you begin
- Getting started
- Installing AI Center
- Migration and upgrade
- Projects
- Datasets
- Data Labeling
- ML packages
- Out of the box packages
- Pipelines
- Training pipelines
- Evaluation pipelines
- Full pipelines
- Managing pipelines
- Closing the loop
- ML Skills
- ML Logs
- Document UnderstandingTM in AI Center
- How to
- Licensing
- Basic Troubleshooting Guide
Training pipelines
train()
function in the train.py file) and code to persist a newly trained model (the save()
function in the train.py file). These, together with a dataset or sub-folder within a dataset, produce a new package version.
Create a new training pipeline as described here. Make sure to provide the following training pipeline specific information:
- In the Pipeline type field, select Training run.
- In the Choose input dataset field, select a dataset or folder from which you want to import data for training. All files in this dataset/folder should
be available locally during the runtime of the pipeline, being passed to the first argument to your
train()
function (that is, the path to the mounted data will be passed to the data_directory variable in the definition train(self, data_directory)). - In the Enter parameters section, enter any environment variables defined and used by your pipeline, if any. The environment variables that are set
by default are:
artifacts_directory
, with default value artifacts: This defines the path to a directory that is persisted as ancillary data related to this pipeline. Most, if not all users, never have the need to override this through the UI. Anything can be saved during pipeline execution including images, pdfs, and subfolders. Specifically, any data your code writes in the directory specified by the pathos.environ['artifacts_directory']
is uploaded at the end of the pipeline run and will be viewable from the Pipeline details page.save_training_data
, with default value false: If set to true, the folder chosen inChoose input dataset
is uploaded at the end of the pipeline run as an output of the pipeline under directorydata_directory
.Note: The pipeline execution might take some time. Check back to it after a while to see its status.After the pipeline was executed, a new minor version of the package is available and displayed in the ML Packages > [Package Name] page. In our example, this is package version 1.1.
In the Pipelines page, the pipeline's status changed to Successful. The Pipeline Details page displays the arbitrary files and folders related to the pipeline run. In our example, the run created a file calledmy-training-artifact.txt
.
_results.json
file contains a summary of the pipeline run execution,
exposing all inputs/outputs and execution times for a training pipeline.
{
"parameters": {
"pipeline": "< Pipeline_name >",
"inputs": {
"package": "<Package_name>",
"version": "<version_number>",
"train_data": "<storage_directory>",
"gpu": "True/False"
},
"env": {
"key": "value",
...
}
},
"run_summary": {
"execution_time": <time>, #in seconds
"start_at": <timestamp>, #in seconds
"end_at": <timestamp>, #in seconds
"outputs": {
"train_data": "<test_storage_directory>",
"artifacts_data": "<artifacts_storage_directory>",
"package": "<Package_name>",
"version": "<new_version>"
}
}
}
{
"parameters": {
"pipeline": "< Pipeline_name >",
"inputs": {
"package": "<Package_name>",
"version": "<version_number>",
"train_data": "<storage_directory>",
"gpu": "True/False"
},
"env": {
"key": "value",
...
}
},
"run_summary": {
"execution_time": <time>, #in seconds
"start_at": <timestamp>, #in seconds
"end_at": <timestamp>, #in seconds
"outputs": {
"train_data": "<test_storage_directory>",
"artifacts_data": "<artifacts_storage_directory>",
"package": "<Package_name>",
"version": "<new_version>"
}
}
}
The ML Package zip file is the new package version automatically generated by the training pipeline.
artifacts_directory
folder.
save_data
was set to the default
true value, is a copy of the input dataset folder.
Governance in machine learning is something very few companies are equipped to handle. In allowing each model to take a snapshot of the data it was trained on, AI Center enables companies to have data traceability.
save_training_data
= true
, which takes a snapshot of the data passed in as input. Thereafter, a user can always navigate to the corresponding Pipeline Details page to see exactly what data was used at training time.