AI Center

2021.10

False

发行说明
在开始之前
- 安装或升级 AI Center
- 兼容性矩阵
入门指南
项目
- 关于项目
- 管理项目
数据集
- 关于数据集
- 管理数据集
ML 包
管道
ML 技能
- 关于 ML 技能
- 管理 ML 技能
ML 日志
- 关于 ML 日志
AI Center 中的 Document Understanding
- Data Manager
- OCR 服务
如何
- ML 包
  - 将自定义命名实体识别与持续学习结合使用
基本故障排除指南
- AI Center 常规故障排除和常见问题解答
- AI Center™ 独立版故障排除

AI Center 用户指南

上次更新日期 2024年3月11日

构建 ML 包

数据科学家使用 Python 或 AutoML 平台构建预训练的模型。这些模型由 RPA 开发者在工作流中使用。

结构化 ML 包

包必须符合几项要求。这些要求分为提供模型所需的组件和训练模型所需的组件。

服务组件

包必须至少提供以下内容：

根文件夹中包含 main.py 文件的文件夹。
在此文件中，提供一个名为 Main 的类，该类至少实现两个函数：
- __init__(self)：不接受任何参数，并加载您的模型和/或模型的本地数据（例如单词嵌入）。
- predict(self, input) ：要在模型服务时调用并 返回 String的函数。
名为 requirements.txt 的文件，其中包含运行模型所需的依赖项。

在推断时将包的服务组件视为模型。在提供服务时，将使用提供的 requirements.txt 文件创建容器映像，并将 predict 函数用作模型的端点。

训练组件和评估组件

除了推理之外，还可以选择使用包来训练机器学习模型。为此，您可以提供以下内容：

在与 main.py 文件相同的根文件夹中，提供一个名为 train.py 的文件。
在此文件中，提供一个名为 Main 的类，该类至少实现四个函数。除 __init__ 外，下面所有函数都是可选的，但会限制可与相应包一起运行的管道类型。
- __init__(self)：不接受任何参数，并加载您的模型和/或模型的数据（例如单词嵌入）。
- train(self, training_directory)：接受包含任意结构化数据的目录作为输入，运行训练模型所需的所有代码。每当执行训练管道时，都会调用此函数。
- evaluate(self, evaluation_directory)：接受包含任意结构化数据的目录作为输入，运行评估模型所需的所有代码，并为该评估返回单个分数。每当执行评估管道时，都会调用此函数。
- save(self)：不接受任何参数。每次调用 train 函数后，都会调用此函数以保留您的模型。
- process_data(self, input_directory)：接受包含任意结构化数据的 input_directory 输入。只有在执行完整管道时，才会调用此函数。在执行完整管道时，此函数可以执行任意数据转换，并且可以拆分数据。具体而言，保存到环境变量 training_data_directory 指向的路径的任何数据是 train 函数的输入，而保存到环境变量 evaluation_data_directory 指向的路径的任何数据是上述 evaluation 函数的输入。

处理数据类型

为了使 AI Center 在 RPA 工作流中更易于使用，可以将包表示为具有三种输入类型之一：字符串、文件和文件列表（可在上传包时设置）。

字符串数据

这是一个字符序列。任何可序列化的数据均可与包一起使用。如果在 RPA 工作流中使用，则数据可以由机器人进行序列化（例如，使用自定义活动），并作为字符串发送。包上传者必须选择了 json 作为包的输入类型。

数据的反序列化在 predict 函数中完成。以下是在 Python 中反序列化数据的一些示例：

Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
  example = skill_input  # No extra processing
    
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):
  import json
  example = json.loads(skill_input)
  
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):
  import json
  import numpy as np
  example = np.array(json.loads(skill_input))
  
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
#                    "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):
  import pandas as pd
  example = pd.read_json(skill_input)Robot sends raw string to ML Skill Activity
# E.g. skill_input='a customer complaint'`
def predict(self, skill_input):
  example = skill_input  # No extra processing
    
# Robot sends json formatted string to ML Skill Activity
# E.g skill_input='{'email': a customer complaint', 'date': 'mm:dd:yy'}'
def predict(self, skill_input):
  import json
  example = json.loads(skill_input)
  
# Robot sends json formatted string with number array to ML Skill Activity
# E.g. skill_input='[10, 15, 20]'
def predict(self, skill_input):
  import json
  import numpy as np
  example = np.array(json.loads(skill_input))
  
# Robot sends json formmatted pandas dataframe
# E.g. skill_input='{"row 1":{"col 1":"a","col 2":"b"},
#                    "row 2":{"col 1":"c","col 2":"d"}}'
def predict(self, skill_input):
  import pandas as pd
  example = pd.read_json(skill_input)

文件数据

这将通知调用此模型的ML 技能活动，以获取文件的路径。具体来说，该活动从文件系统读取文件，并将其作为序列化的字节字符串发送给predict函数。因此，RPA 开发者可以传递文件路径，而不必在工作流本身中读取和序列化文件。

在工作流中，活动的输入只是文件路径。该活动将读取文件，对其进行序列化，然后将文件字节发送给 predict 函数。数据的反序列化也可以在 predict 函数中完成，一般情况是直接将字节读取到类似于文件的对象中，如下所示：

ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):
  import io
  file_like = io.BytesIO(skill_input)ML Package has been uploaded with *file* as input type. The ML Skill Activity
# expects a file path. Any file type can be passed as input and it will be serialized.
def predict(self, skill_input):
  import io
  file_like = io.BytesIO(skill_input)

如上所示读取序列化字节等效于打开已启用“读取二进制”标志的文件。要在本地测试模型，请以二进制文件形式读取文件。以下是读取图像文件并在本地对其进行测试的示例：

main.py where model input is an image
class Main(object):
   ...
    
   def predict(self, skill_input): 
      import io
      from PIL import Image
      image = Image.open(io.BytesIO(skill_input))
   ...
  
if__name__ == '_main_':
   # Test the ML Package locally
   with open('./image-to-test-locally.png', 'rb') as input_file:
      file_bytes = input_file.read()
     m = Main()
     print(m.predict(file bytes))main.py where model input is an image
class Main(object):
   ...
    
   def predict(self, skill_input): 
      import io
      from PIL import Image
      image = Image.open(io.BytesIO(skill_input))
   ...
  
if__name__ == '_main_':
   # Test the ML Package locally
   with open('./image-to-test-locally.png', 'rb') as input_file:
      file_bytes = input_file.read()
     m = Main()
     print(m.predict(file bytes))

下面显示了读取 csv 文件并在 predict 函数中使用 pandas 数据框的示例：

main.py where model input is a csv file
class Main(object):
   ...
   def predict(self, skill_input): 
      import pandas as pd
      data frame = pd.read_csv(io.BytesIO(skill_input))
      ...
      
if name == '_main_':
   # Test the ML Package locally
   with open('./csv—to—test—locally.csv', 'rb') as input_file:
      bytes = input_file.read()
   m = Main()
   print(m.predict(bytes))main.py where model input is a csv file
class Main(object):
   ...
   def predict(self, skill_input): 
      import pandas as pd
      data frame = pd.read_csv(io.BytesIO(skill_input))
      ...
      
if name == '_main_':
   # Test the ML Package locally
   with open('./csv—to—test—locally.csv', 'rb') as input_file:
      bytes = input_file.read()
   m = Main()
   print(m.predict(bytes))

文件数据

这将通知 AI Center，调用此模型的ML 技能活动需要文件路径列表。与前一个示例一样，该活动读取并序列化每个文件，并将字节字符串列表发送到predict函数。

可以向技能发送文件列表。在工作流中，活动的输入是包含文件路径（用逗号分隔）的字符串。

上传包时，数据科学家选择文件列表作为输入类型。然后，数据科学家必须反序列化每个已发送的文件（如上所述）。predict 函数的输入是一个字节列表，其中列表中的每个元素是文件的字节字符串。

保留任意数据

在 train.py 中，任何已执行的管道都可以保留任意数据，称为管道输出。从环境变量工件写入到目录路径的任何数据都将保留，并且可以通过导航到“管道详细信息”页面，在任何时候查看这些数据。通常，任何类型的图表以及训练/评估作业的统计信息都可以保存在 artifacts 目录中，并且可以在管道运行结束时从用户界面访问。

train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv
import pandas as pd
from sklearn.model_selection import train_test_split
class Main(object):
   ...
   def process_data(self, data_directory):
     d = pd.read_csv(os.path.join(data_directory, 'data.csv')) 
     d = self.clean_data(d)
     d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
     d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
     d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
     self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
     self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
  ...
  
   def save_artifacts(self, data, file_name, artifact_directory):
      plot = data.hist() 
      fig = plot[0][0].get_figure()
      fig.savefig(os.path.join(artifact_directory, file_name))
...train.py where some historical plot are saved in ./artifacts directory during Full Pipeline execution
# Full pipeline (using process_data) will automatically split data.csv in 2/3 train.csv (which will be in the directory passed to the train function) and 1/3 test.csv
import pandas as pd
from sklearn.model_selection import train_test_split
class Main(object):
   ...
   def process_data(self, data_directory):
     d = pd.read_csv(os.path.join(data_directory, 'data.csv')) 
     d = self.clean_data(d)
     d_train, d_test = train_test_split(d, test_size=0.33, random_state=42)
     d_train.to_csv(os.path.join(data_directory , 'training', 'train.csv'), index=False)
     d_test.to_csv (os.path.join(data__directory , 'test' , 'test.csv'), index=False)
     self.save_artifacts(d_train, 'train_hist.png', os.environ["artifacts"])
     self.save_artifacts(d_test, 'test_hist.png', os.environ["artifacts"])
  ...
  
   def save_artifacts(self, data, file_name, artifact_directory):
      plot = data.hist() 
      fig = plot[0][0].get_figure()
      fig.savefig(os.path.join(artifact_directory, file_name))
...

使用 TensorFlow

在模型开发期间，TensorFlow 图必须加载到用于提供服务的同一个线程上。为此，必须使用默认图。

以下示例进行了必要修改：

import tensorflow as tf
class Main(object):
  def __init__(self):
    self.graph = tf.get_default_graph() # Add this line
    ...
    
  def predict(self, skill_input):
    with self.graph.as_default():
      ...import tensorflow as tf
class Main(object):
  def __init__(self):
    self.graph = tf.get_default_graph() # Add this line
    ...
    
  def predict(self, skill_input):
    with self.graph.as_default():
      ...

有关 GPU 使用情况的信息

如果在创建技能时启用了 GPU，则系统会使用 NVIDIA GPU 驱动程序 418、CUDA Toolkit 10.0 和 CUDA 深度神经网络库 (cuDNN) 7.6.5 运行时库将其部署在映像上。

示例

简单的即用型 ML 模型（未启用训练）

在此示例中，业务问题不需要重新训练模型，因此包必须包含将要提供服务的序列化模型 IrisClassifier.sav。

初始项目树（不含 main.py 和 requirements.txt）：

IrisClassifier/
  - IrisClassifier.savIrisClassifier/
  - IrisClassifier.sav

要添加到根文件夹的示例 main.py：

from sklearn.externals import joblib 
import json
class Main(object):
   def __init__(self):
      self.model = joblib.load('IrisClassifier.sav')
   def predict(self, X):
      X = json.loads(X)
      result = self.model.predict_proba(X)
      return json.dumps(result.tolist())from sklearn.externals import joblib 
import json
class Main(object):
   def __init__(self):
      self.model = joblib.load('IrisClassifier.sav')
   def predict(self, X):
      X = json.loads(X)
      result = self.model.predict_proba(X)
      return json.dumps(result.tolist())

添加 requirements.txt：
```
scikit-learn==0.19.0scikit-learn==0.19.0
```
注意：在 pip 库中需要遵守一些限制条件。确保您可以在以下约束文件下安装库：
```
itsdangerous<2.1.0
Jinja2<3.0.5
Werkzeug<2.1.0
click<8.0.0itsdangerous<2.1.0
Jinja2<3.0.5
Werkzeug<2.1.0
click<8.0.0
```
要测试这一点，您可以在全新环境中使用以下命令，并确保所有库均已正确安装：
```
pip install -r requirements.txt -c constraints.txtpip install -r requirements.txt -c constraints.txt
```

最终文件夹结构：

IrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txtIrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt

简单的即用型模型（已启用训练）

在此示例中，业务问题需要重新训练模型。基于上述包进行构建时，您可能需要满足以下条件：

初始项目树（仅提供服务的包）：

IrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txtIrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt

要添加到根文件夹的示例 train.py：

import pandas as pd 
import joblib
class Main(object): 
   def __init__(self):
       self.model_path = './IrisClassifier.sav' 
       self.model = joblib.load(self.model_path)
      
   def train(self, training_directory):
       (X,y) = self.load_data(os.path.join(training_directory, 'train.csv'))
       self.model.fit(X,y)
   def evaluate(self, evaluation_directory):
       (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv'))
       return self.model.score(X,y)
   def save(self):
       joblib.dump(self.model, self.model_path)
   def load_data(self, path):
       # The last column in csv file is the target column for prediction.
       df = pd.read_csv(path)
       X = df.iloc[:, :-1].get_values()
       y = df.iloc[:, 'y'].get_values()
       return X,yimport pandas as pd 
import joblib
class Main(object): 
   def __init__(self):
       self.model_path = './IrisClassifier.sav' 
       self.model = joblib.load(self.model_path)
      
   def train(self, training_directory):
       (X,y) = self.load_data(os.path.join(training_directory, 'train.csv'))
       self.model.fit(X,y)
   def evaluate(self, evaluation_directory):
       (X,y) = self.load_data(os.path.join(evaluation_directory, 'evaluate.csv'))
       return self.model.score(X,y)
   def save(self):
       joblib.dump(self.model, self.model_path)
   def load_data(self, path):
       # The last column in csv file is the target column for prediction.
       df = pd.read_csv(path)
       X = df.iloc[:, :-1].get_values()
       y = df.iloc[:, 'y'].get_values()
       return X,y

如果需要，编辑 requirements.txt：

pandas==1.0.1
scikit-learn==0.19.0pandas==1.0.1
scikit-learn==0.19.0

最终文件夹（包）结构：
```
IrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt
  - train.pyIrisClassifier/
  - IrisClassifier.sav
  - main.py
  - requirements.txt
  - train.py
```
注意：现在可以首先提供此模型，随着新的数据点通过机器人或人机回圈进入系统中，便可利用 train.py 来创建训练管道和评估管道。