ai-center
latest
false
UiPath logo, featuring letters U and I in white

AI Center

Automation CloudAutomation SuiteStandalone
Last updated Nov 19, 2024

Training pipelines

A Training Pipeline is used to train a new machine learning model. To use this pipeline, the package must contain code to train a model (the train() function in the train.py file) and code to persist a newly trained model (the save() function in the train.py file). These, together with a dataset or sub-folder within a dataset, produce a new package version.

Creating training pipelines

Create a new training pipeline as described here. Make sure to provide the following training pipeline specific information:

  • In the Pipeline type field, select Training run.
  • In the Choose input dataset field, select a dataset or folder from which you want to import data for training. All files in this dataset/folder should be available locally during the runtime of the pipeline, being passed to the first argument to your train() function (that is, the path to the mounted data will be passed to the data_directory variable in the definition train(self, data_directory)).
  • In the Enter parameters section, enter any environment variables defined and used by your pipeline, if any. The environment variables that are set by default are:
    • artifacts_directory, with default value artifacts: This defines the path to a directory that is persisted as ancillary data related to this pipeline. Most, if not all users, never have the need to override this through the UI. Anything can be saved during pipeline execution including images, pdfs, and subfolders. Specifically, any data your code writes in the directory specified by the path os.environ['artifacts_directory'] is uploaded at the end of the pipeline run and will be viewable from the Pipeline details page.
    • save_training_data, with default value false: If set to true, the folder chosen in Choose input dataset is uploaded at the end of the pipeline run as an output of the pipeline under directory data_directory.
      Note: The pipeline execution might take some time. Check back to it after a while to see its status.

      After the pipeline was executed, a new minor version of the package is available and displayed in the ML Packages > [Package Name] page. In our example, this is package version 1.1.

      In the Pipelines page, the pipeline's status changed to Successful. The Pipeline Details page displays the arbitrary files and folders related to the pipeline run. In our example, the run created a file called my-training-artifact.txt.

Conceptual analogy for building your own training pipeline

Note: This is a simplified example. Its purpose is to illustrate how datasets and packages interact in a training pipeline. The steps are merely conceptual and do not represent how the platform works.
  1. Copy package version 1.0 into ~/mlpackage.
  2. Copy the input dataset or the dataset subfolder selected from the UI to ~/mlpackage/data.
  3. Execute the following python code:
    from train import Main 
    m = Main() 
    m.train(./data’) 
    m.save()from train import Main 
    m = Main() 
    m.train(‘./data’) 
    m.save()
  4. Persist the contents of ~/mlpackage as package version 1.1. Persist artifacts if written, snapshot data if save_data is set to true.

Pipelines output

The _results.json file contains a summary of the pipeline run execution, exposing all inputs/outputs and execution times for a training pipeline.
{
    "parameters": {
        "pipeline": "< Pipeline_name >",
        "inputs": {
            "package": "<Package_name>",
            "version": "<version_number>",
            "train_data": "<storage_directory>",
            "gpu": "True/False"
        },
        "env": {
            "key": "value",
            ...
        }
    },
    "run_summary": {
     "execution_time": <time>, #in seconds 
     "start_at": <timestamp>, #in seconds 
     "end_at": <timestamp>, #in seconds 
     "outputs": {
        "train_data": "<test_storage_directory>", 
        "artifacts_data": "<artifacts_storage_directory>", 
        "package": "<Package_name>",
        "version": "<new_version>"
            }
    }
}{
    "parameters": {
        "pipeline": "< Pipeline_name >",
        "inputs": {
            "package": "<Package_name>",
            "version": "<version_number>",
            "train_data": "<storage_directory>",
            "gpu": "True/False"
        },
        "env": {
            "key": "value",
            ...
        }
    },
    "run_summary": {
     "execution_time": <time>, #in seconds 
     "start_at": <timestamp>, #in seconds 
     "end_at": <timestamp>, #in seconds 
     "outputs": {
        "train_data": "<test_storage_directory>", 
        "artifacts_data": "<artifacts_storage_directory>", 
        "package": "<Package_name>",
        "version": "<new_version>"
            }
    }
}

The ML Package zip file is the new package version automatically generated by the training pipeline.

Artifacts folder, visible only if not empty, is a folder regrouping all the artifacts generated by the pipeline and saved under artifacts_directory folder.
Dataset folder, existing only if save_data was set to the default true value, is a copy of the input dataset folder.

Model governance

Governance in machine learning is something very few companies are equipped to handle. In allowing each model to take a snapshot of the data it was trained on, AI Center enables companies to have data traceability.

Practically, you can a snapshot of the input data if you pass the parameter save_training_data = true, which takes a snapshot of the data passed in as input. Thereafter, a user can always navigate to the corresponding Pipeline Details page to see exactly what data was used at training time.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.