# Upload file

> The data scientists build pre-trained, models using Python or using an AutoML platform. Those models are consumed by RPA developers within a workflow.

## Building ML packages

The data scientists build pre-trained, models using Python or using an AutoML platform. Those models are consumed by RPA developers within a workflow.

### Structuring ML packages

A package must adhere to a small set of requirements. These requirements are separated into components needed for serving a model and components needed for training a model.

#### Serving component

A package must provide at least the following:

* A folder containing a `main.py` file at the root of this folder.
* In this file, a class called `Main` that implements at least two functions:
  + `__init__(self)`: takes no argument and loads your model and/or local data for the model (e.g. word embeddings).
  + `predict(self, input)`: a function to be called at model serving time and **returning a String**.
* A file named `requirements.txt` with dependencies needed to run the model.

Think of the serving component of a package as the model at inference time. At serving time, a container image is created using the provided `requirements.txt` file, and the `predict` function is used as the endpoint to the model.

#### Training and evaluation component

In addition to inference, a package can optionally be used to train a machine learning model. This is done by providing the following:

* In the same root folder with the `main.py` file, provide a file named `train.py`.
* In this file, provide a class called `Main` that implements at least four functions. All of the following functions except `_init_`, are optional, but limit the type of pipelines that can be run with the corresponding package.
  + `__init__(self)`: takes no argument and loads your model and/or data for the model (e.g. word embeddings).
  + `train(self, training_directory)`: takes as input a directory with arbitrarily structured data, runs all the code necessary to train a model. This function is called whenever a **training pipeline** is executed.
  + `evaluate(self, evaluation_directory)`: takes as input a directory with arbitrarily structured data, runs all the code necessary to evaluate a mode, and returns a single score for that evaluation. This function is called whenever an **evaluation pipeline** is executed.
  + `save(self)`: takes no argument. This function is called after each call of the `train` function to persist your model.
  + `process_data(self, input_directory)`: takes an `input_directory` input with arbitrarily structured data. This function is only called whenever a **full pipeline** is executed. In the execution of a full pipeline, this function can perform arbitrary data transformations and it can split data. Specifically, any data saved to the path pointed to by the environment variable `training_data_directory` is the input to the `train` function, and any data saved to the path pointed to by the environment variable `evaluation_data_directory` is the input to the `evaluation` function above.

### Handling data types

To make **UiPath® AI Center** easier to use within an RPA workflow, the package can be denoted to have one of three input types: **String**, **File**, and **Files** (set during package upload time).

#### String data

This is a sequence of characters. Any data that can be serialized can be used with a package. If used within an RPA workflow, the data can be serialized by the Robot (for example using a custom activity) and sent as a string. The package uploader must have selected `JSON` as the package’s input type.

The deserialization of data is done in the `predict` function. Check the following examples for deserializing data in Python:

```
Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
  example = skill_input  # No extra processing
    
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):

  example = json.loads(skill_input)
  
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):

  example = np.array(json.loads(skill_input))
  
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
#                    "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):

  example = pd.read_json(skill_input)
```

## File data

This informs the [ML Skill Activity](https://docs.uipath.com/activities/other/latest/document-understanding/ml-skills) making calls to this model to expect a path to a file. Specifically, the activity reads the file from the file system and sends it to the `predict` function as a serialized byte string. Thus the RPA developer can pass a path to a file, instead of having to read and serialize the file in the workflow itself.

Within the workflow, the input to the activity is just the path to the file. The activity reads the file, serializes it, and sends the file bytes to the `predict` function. The deserialization of data is also done in the `predict` function, the general case is just reading the bytes directly into a file-like object as follows:

```
ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):

  file_like = io.BytesIO(skill_input)
```

Reading the serialized bytes as above is equivalent to opening a file with the read binary flag turned on. To test the model locally, read a file as a binary file. The following shows an example of reading an image file and testing it locally:

```
main.py where model input is an image
class Main(object):
   ...
    
   def predict(self, skill_input): 

      from PIL import Image
      image = Image.open(io.BytesIO(skill_input))
   ...
  
if__name__ == '_main_':
   # Test the ML Package locally
   with open('./image-to-test-locally.png', 'rb') as input_file:
      file_bytes = input_file.read()
     m = Main()
     print(m.predict(file bytes))
```

The following shows an example of reading a `csv` file and using a pandas dataframe in the `predict` function:

```
main.py where model input is a csv file
class Main(object):
   ...
   def predict(self, skill_input): 

      data frame = pd.read_csv(io.BytesIO(skill_input))
      ...
      
if name == '_main_':
   # Test the ML Package locally
   with open('./csv—to—test—locally.csv', 'rb') as input_file:
      bytes = input_file.read()
   m = Main()
   print(m.predict(bytes))
```

## Files data

This informs AI Center that the [ML Skill Activity](https://docs.uipath.com/activities/other/latest/document-understanding/ml-skills) making calls to this model expects a list of file paths. As in the previous case, the activity reads and serializes each file, and sends a list of byte strings to the `predict` function.

A list of files can be sent to a skill. Within the workflow, the input to the activity is a string with paths to the files, separated by a comma.

When uploading a package, the data scientist selects **list of files** as the input type. The data scientist then has to deserialize each of the sent files. The input to the `predict` function is a list of bytes where each element in the list is the byte string of the file.

## Persisting arbitrary data

In `train.py`, any executed pipeline can persist arbitrary data, called pipeline output. Any data that is written to the directory path from environment variable artifacts is persisted and can be surfaced at any point by navigating to the **Pipeline Details Page**. Typically, any kind of graphs, statistics of the training/evaluation jobs can be saved in the `artifacts` directory and is accessible from the UI at the end of the pipeline run.

```
train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv

from sklearn.model_selection import train_test_split
class Main(object):
   ...
   def process_data(self, data_directory):
     d = pd.read_csv(os.path.join(data_directory, 'data.csv')) 
     d = self.clean_data(d)
     d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
     d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
     d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
     self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
     self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
  ...
  
   def save_artifacts(self, data, file_name, artifact_directory):
      plot = data.hist() 
      fig = plot[0][0].get_figure()
      fig.savefig(os.path.join(artifact_directory, file_name))
...
```

## Using TensorFlow

During model development, the TensorFlow graph must be loaded on the same thread as the one used for serving. To do so, the default graph must be used.

Check the following example with the necessary modifications:

```

class Main(object):
  def __init__(self):
    self.graph = tf.get_default_graph() # Add this line
    ...
    
  def predict(self, skill_input):
    with self.graph.as_default():
      ...
```

## Information on GPU usage

When GPU is enabled at skill creation time, it is deployed on an [image](https://hub.docker.com/r/nvidia/cuda/tags) with NVIDIA GPU driver 418, CUDA Toolkit 10.0 and CUDA Deep Neural Network Library (cuDNN) 7.6.5 runtime library.

## Examples

### Simple ready-to-serve ML model with no training

In this example, the business problem does not require model retraining, thus the package must contain the serialized model `IrisClassifier.sav` that will be served.

1. Initial project tree (without **main.py** and **requirements.txt**):
   ```
   IrisClassifier/
     - IrisClassifier.sav
   ```
2. Sample **main.py** to be added to the root folder:
   ```
   from sklearn.externals import joblib 

   class Main(object):
      def __init__(self):
         self.model = joblib.load('IrisClassifier.sav')
      def predict(self, X):
         X = json.loads(X)
         result = self.model.predict_proba(X)
         return json.dumps(result.tolist())
   ```
3. Add **requirements.txt**:
   ```
   scikit-learn==0.19.0
   ```

:::note
There are some constraints that need to be respected on pip libraries. Make sure that you can install libraries under following constraint files:
:::

```
itsdangerous<2.1.0
Jinja2<3.0.5
Werkzeug<2.1.0
click<8.0.0
```

To test this, you can use the following command in a fresh environment and make sure that all libraries are properly installing:

```
pip install -r requirements.txt -c constraints.txt
```

4. Final folder structure:
```
IrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt
```

### Simple read-to-serve model with training enabled

In this example, the business problem requires the model to be retrained. Building on the package described above, you may have the following:

1. Initial project tree (serving-only package):
   ```
   IrisClassifier/
     - IrisClassifier.sav
     - main.py
     - requirements.txt
   ```
2. Sample **train.py** to be added to the root folder:
   ```

   class Main(object): 
      def __init__(self):
          self.model_path = './IrisClassifier.sav' 
          self.model = joblib.load(self.model_path)
         
      def train(self, training_directory):
          (X,y) = self.load_data(os.path.join(training_directory, 'train.csv'))
          self.model.fit(X,y)
      def evaluate(self, evaluation_directory):
          (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv'))
          return self.model.score(X,y)
      def save(self):
          joblib.dump(self.model, self.model_path)
      def load_data(self, path):
          # The last column in csv file is the target column for prediction.
          df = pd.read_csv(path)
          X = df.iloc[:, :-1].get_values()
          y = df.iloc[:, 'y'].get_values()
          return X,y
   ```
3. Edit **requirements.txt** if needed:
   ```
   pandas==1.0.1
   scikit-learn==0.19.0
   ```
4. Final folder (package) structure:
   ```
   IrisClassifier/
     - IrisClassifier.sav
     - main.py
     - requirements.txt
     - train.py
   ```
   :::note
   This model can now be first served and, as new data points come into the system via Robot or Human-in-the-Loop, training and evaluation pipelines can be created leveraging **train.py**.
   :::

## Upload Zip File

:::important
Before uploading packages, make sure they are built as specified [here](https://docs.uipath.com/ai-center/automation-suite/2023.10/user-guide/upload-file#building-ml-packages). When creating an ML Package in AI Center, it cannot be named using any python reserved keyword, such as `class`, `break`, `from`, `finally`, `global`, `None`, etc. Make sure to choose another name. The listed examples are not complete since package name is used for `class <pkg-name>` and `import <pck-name>`.
:::

Follow these steps to upload an already created package:

1. In the **ML Packages** page, click the **Upload zip file** button. The **Create New Package** page is displayed.
2. In the **Create New Package** page, enter a name for your package.
3. Click **Upload Package** to select the desired `.zip` file or drag & drop the package `.zip` file into the **Upload package** field.
4. (Optionally) Provide a clear description of the model.

The description is displayed while deploying a new skill based on this model, as well as on the **ML Packages** page.
5. Select the input type from the drop-down. The possible options are:
   * json
   * file
   * files
6. **Optional:** Enter a clear description of the input expected by the model.
7. **Optional:** Enter a clear description of the output returned by the model.

These descriptions are visible to RPA developers using the ML Skill Activity in UiPath Studio. As a good practice, we recommended showing an example of the input and output formats to facilitate communication between data scientists and developers.

8. Select the development language of the model from the drop-down. The possible options are:
   * Python 3.7
   * Python 3.8
   * Python 3.8 OpenCV

9. Select whether the machine learning model requires a **GPU**, by default it is set to No. This information is surfaced as a suggestion to when a skill is created from this package.

10. Select whether to **enable training** for your model. This is what happens if you enabled it:
    * The package can be used in any pipeline.
    * The validation step checks if the **train.py** file is implemented in the package, otherwise, the validation fails.
    
11. Click **Create** to upload the package or **Cancel** to abort the process. The **Create New Package** window is closed and the package is uploaded and displayed along with its details in the **ML Packages > [ML Package Name]** page. It may take a few minutes before your upload is propagated.