ai-center
2023.4
false
AI Center User Guide
Automation CloudAutomation SuiteStandalone
Last updated Oct 22, 2024

Full pipelines

A Full Pipeline is used to train a new machine learning model and evaluate the performance of this new model, all in one go. Additionally, a preprocessing step is run before training, allowing data manipulation/training a trained machine learning model.

To use this pipeline, the package must contain code to process data, train, evaluate and save a model (the process_data(), train (), evaluate() and save() functions in the train.py file). This code, together with a dataset or sub-folder within a dataset, and optionally an evaluation set, produce a new package version, a score (the return of the evaluate() function for the new version of the model) and any arbitrary outputs the user would like to persist in addition to the score.

Creating full pipelines

Create a new full pipeline as described here. Make sure to provide the following full pipeline specific information:
  • In the Pipeline type field, select Full Pipeline run.
  • In the Choose input dataset field, select a dataset or folder from which you want to import data for full training. All files in this dataset/folder should be available locally during the runtime of the pipeline at the path stored in data_directory variable.
  • Optionally, in the Choose evaluation dataset field, select a dataset or folder from which you want to import data for evaluation. All files in this dataset/folder should be available locally during the runtime of the pipeline at the path stored in test_data_directory variable. If no folder is selected, it is expected that your pipeline writes something to the directory test_data_directory variable in process_data function. If you do not select a folder, and your process_data does not write to test_data_directory then the directory passed to the evaluate function will be empty.
  • In the Enter parameters section, enter the environment variables defined and used by your pipeline, if any. The environment variables are:
    • training_data_directory, with default value dataset/training: Defines where the training data is accessible locally for the pipeline. This directory is used as input for the train() function. Most users will never have to override this through the UI and can just write data into os.environ['training_data_directory'] in the process_data function and can just expect that the argument data_directory in train(self, data_directory will be called with os.environ['training_data_directory'].
    • test_data_directory with default value dataset/test: Defines where the test data is accessible locally for the pipeline. This directory is used as input to the evaluate() function. Most users will never have to override this through the UI and can just write data into os.environ['test_data_directory'] in the process_data function and can just expect that the argument data_directory in evaluate(self, data_directory will be called with os.environ['test_data_directory'].
    • artifacts_directory, with default value artifacts: This defines the path to a directory that will be persisted as ancillary data related to this pipeline. Most, if not all users, will never have the need to override this through the UI. Anything can be saved during pipeline execution including images, pdfs, and subfolders. Concretely, any data your code writes in the directory specified by the path os.environ['artifacts_directory'] will be uploaded at the end of the pipeline run and will be viewable from the Pipeline details page.
    • save_training_data, with default value true: If set to true, training_data_directory folder will be uploaded at the end of the pipeline run as an output of the pipeline under directory training_data_directory.
    • save_test_data, with default value true: If set to true, test_data_directory folder will be uploaded at the end of the pipeline run as an output of the pipeline under directory test_data_directory.
Depending on your choice to select or not an evaluation dataset, you can create full pipelines:
  • Explicitly selecting evaluation data

    Watch the following video to learn how to create a full pipeline with the newly trained package version 1.1. Be sure to select the same dataset (in our example, tutorialdataset) both as the input dataset and as the evaluation dataset.

  • Without explicitly selecting evaluation data

    Watch the following video to learn how to create a full pipeline with the newly trained package version 1.1. Select the input dataset, but leave the evaluation dataset unselected.

Results of a full pipeline run execution

Note: The pipeline execution might take some time. Check back to it after a while to see its status.

After the pipeline was executed, in the Pipelines page, the pipeline's status changed to Successful. The Pipeline Details page displays the arbitrary files and folders related to the pipeline run.

  • The train() function is trained on train.csv, and not on the unaltered contents of the data folder (example1.txt, and example2.txt). The process_data can be used to dynamically split data based on any user-defined parameters.
  • The first full pipeline runs evaluations on a directory with example1.txt,example2.txt, and test.csv. The second full pipeline runs evaluations on a directory with only test.csv. This is the difference in not selecting an evaluation set explicitly when creating the second full pipeline run. This way you can have evaluations on new data from UiPath Robots, as well as dynamically split data already in your project.
  • Each individual component can write arbitrary artifacts as part of a pipeline (histograms, tensorboard logs, distribution plots, etc.).
  • The ML Package zip file is the new package version automatically generated by the training pipeline.
  • Artifacts folder, only visible if not empty, is the folder regrouping all the artifacts generated by the pipeline, and it is saved under artifacts_directory folder.
  • Training folder, only visible if save_training_data was set to true, is a copy of the training_data_directory folder.
  • Test folder, only visible if save_training_data was set to true, is a copy of the test_data_directory folder.

Analogy for building your own full pipelines

Here is a conceptually analogous execution of a full pipeline on some package, for example version 1.1, the output of a training pipeline on version 1.0.

Important: This is a simplified example. Its purpose is to illustrate how datasets and packages interact in a full pipeline. The steps are merely conceptual and do not represent how the platform works.
  1. Copy package version 1.1 into ~/mlpackage.
  2. Create a directory called ./dataset.
  3. Copy the contents of the input dataset into ./dataset.
  4. If the user set something in the Choose evaluation dataset field, copy that evaluation dataset and put it in ./dataset/test.
  5. Set environment variables training_data_directory=./dataset/training and test_data_directory=./dataset/test.
  6. Execute the following python code:
    from train import Main 
    m = Main() 
    m.process_data('./dataset') 
    m.evaluate(os.environ['test_data_directory']) 
    m.train(os.environ['training_data_directory']) 
    m.evaluate(os.environ['test_data_directory'])from train import Main 
    m = Main() 
    m.process_data('./dataset') 
    m.evaluate(os.environ['test_data_directory']) 
    m.train(os.environ['training_data_directory']) 
    m.evaluate(os.environ['test_data_directory'])
  7. Persist the contents of ~/mlpackage as package version 1.2. Persist artifacts if written, snapshot data if save_data is set to true.
    Note: The existence of the environment variables training_data_directory and test_data_directory mean that process_data can use these variables to split data dynamically.

Pipeline outputs

The _results.json file contains a summary of the pipeline run execution, exposing all inputs/outputs and execution times for a full pipeline.
{
    "parameters": {
        "pipeline": "< Pipeline_name >",
        "inputs": {
            "package": "<Package_name>",
            "version": "<version_number>",
      "input_data": "<storage_directory>",       
            "evaluation_data": "<storage_directory>/None",
            "gpu": "True/False"
        },
        "env": {
            "key": "value",
            ...
        }
    },
    "run_summary": {
     "execution_time": <time>, #in seconds 
     "start_at": <timestamp>, #in seconds 
     "end_at": <timestamp>, #in seconds 
     "outputs": {
        "previous_score": <previous_score>, #float
        "current_score": <current_score>, #float
        "training_data": "<training_storage_directory>/None", 
        "test_data": "<test_storage_directory>/None",
        "artifacts_data": "<artifacts_storage_directory>",
        "package": "<Package_name>",
        "version": "<new_version>"  
        }
    }
}{
    "parameters": {
        "pipeline": "< Pipeline_name >",
        "inputs": {
            "package": "<Package_name>",
            "version": "<version_number>",
      "input_data": "<storage_directory>",       
            "evaluation_data": "<storage_directory>/None",
            "gpu": "True/False"
        },
        "env": {
            "key": "value",
            ...
        }
    },
    "run_summary": {
     "execution_time": <time>, #in seconds 
     "start_at": <timestamp>, #in seconds 
     "end_at": <timestamp>, #in seconds 
     "outputs": {
        "previous_score": <previous_score>, #float
        "current_score": <current_score>, #float
        "training_data": "<training_storage_directory>/None", 
        "test_data": "<test_storage_directory>/None",
        "artifacts_data": "<artifacts_storage_directory>",
        "package": "<Package_name>",
        "version": "<new_version>"  
        }
    }
}

Model governance

As in other pipeline types, training and evaluation data can be preserved in a snapshot if you set the parameters save_training_data = true and save_test_data = true.

Was this page helpful?

Get The Help You Need
Learning RPA - Automation Courses
UiPath Community Forum
Uipath Logo White
Trust and Security
© 2005-2024 UiPath. All rights reserved.