ai-center

2021.10

false

Getting Started
- About this guide
- About Installation
  - Supported use cases for single-node and multi-node installations
  - Downloading installation packages
Network requirements
- Requirements
Single-node requirements and installation
Multi-node requirements and installation
Post-installation
- Overview
Provisioning a GPU
- Provisioning a GPU
Using the configuration file
- About the configuration file
Node scheduling
- Manage node scheduling
Migration and Upgrade
- Migrating to the Standalone or Automation Suite Environment
- Upgrading AI Center
Basic Troubleshooting Guide
- General AI Center troubleshooting and FAQs
- AI Center standalone troubleshooting

OUT OF SUPPORT

AI Center Installation Guide

DELIVERY:

Last updated Nov 11, 2024

Provisioning a GPU

Note: A GPU can be installed only on an agent node, not a server node. Do not use or modify the gpu_support flag from the cluster_config.json. Instead, follow the instructions below to add a dedicated agent node with GPU support to the cluster.

Currently, Automation Suite only supports Nvidia GPU Drivers. See the list of GPU-supported operating systems.

You can find cloud-specific instance types for the nodes here:

Follow the steps from Adding a new node to the cluster to ensure the agent node is added correctly.

For more examples on how to deploy NVIDIA CUDA on a GPU, check this page.

Installing a GPU driver

Run the following command to install the GPU driver on the agent node:
```
sudo yum install kernel kernel-tools kernel-headers kernel-devel 
sudo reboot
sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel.repo
sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel-modular.repo
sudo yum config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo yum install cudasudo yum install kernel kernel-tools kernel-headers kernel-devel 
sudo reboot
sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel.repo
sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel-modular.repo
sudo yum config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo yum install cuda
```

Run the following command to install the container toolkits:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\
          && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo dnf clean expire-cache
sudo yum install -y nvidia-container-runtime.x86_64distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\
          && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo dnf clean expire-cache
sudo yum install -y nvidia-container-runtime.x86_64

Verify if drivers are installed properly

Run sudo nvidia-smi command on the node to verify if the drivers were installed properly.

Note: After the cluster has been provisioned, additional steps are required to configure the provisioned GPUs.

At this point, the GPU drivers have been installed and that the GPU nodes have been added to the cluster.

Adding the GPU to the agent node

Run below two command to update contianerd configuration of agent node.

cat <<EOF > gpu_containerd.sh
if ! nvidia-smi &>/dev/null;
then
  echo "GPU Drivers are not installed on the VM. Please refer the documentation."
  exit 0
fi
if ! which nvidia-container-runtime &>/dev/null;
then
  echo "Nvidia container runtime is not installed on the VM. Please refer the documentation."
  exit 0 
fi
grep "nvidia-container-runtime" /var/lib/rancher/rke2/agent/etc/containerd/config.toml &>/dev/null && info "GPU containerd changes already applied" && exit 0
awk '1;/plugins.cri.containerd]/{print "  default_runtime_name = \\"nvidia-container-runtime\\""}' /var/lib/rancher/rke2/agent/etc/containerd/config.toml > /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.linux]\
  runtime = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.cri.containerd.runtimes.nvidia-container-runtime]\
  runtime_type = "io.containerd.runc.v2"\
  [plugins.cri.containerd.runtimes.nvidia-container-runtime.options]\
    BinaryName = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
EOFsudo bash gpu_containerd.shcat <<EOF > gpu_containerd.sh
if ! nvidia-smi &>/dev/null;
then
  echo "GPU Drivers are not installed on the VM. Please refer the documentation."
  exit 0
fi
if ! which nvidia-container-runtime &>/dev/null;
then
  echo "Nvidia container runtime is not installed on the VM. Please refer the documentation."
  exit 0 
fi
grep "nvidia-container-runtime" /var/lib/rancher/rke2/agent/etc/containerd/config.toml &>/dev/null && info "GPU containerd changes already applied" && exit 0
awk '1;/plugins.cri.containerd]/{print "  default_runtime_name = \\"nvidia-container-runtime\\""}' /var/lib/rancher/rke2/agent/etc/containerd/config.toml > /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.linux]\
  runtime = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.cri.containerd.runtimes.nvidia-container-runtime]\
  runtime_type = "io.containerd.runc.v2"\
  [plugins.cri.containerd.runtimes.nvidia-container-runtime.options]\
    BinaryName = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
EOFsudo bash gpu_containerd.sh

Now run below command to restart rke2-agent

[[ "$(sudo systemctl is-enabled rke2-server 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-server
[[ "$(sudo systemctl is-enabled rke2-agent 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-agent[[ "$(sudo systemctl is-enabled rke2-server 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-server
[[ "$(sudo systemctl is-enabled rke2-agent 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-agent

Enabling the GPU driver post-installation

Run the below commands from any of the primary server nodes.

Navigate to UiPathAutomationSuite folder.

cd /opt/UiPathAutomationSuitecd /opt/UiPathAutomationSuite

Enable in online install

DOCKER_REGISTRY_URL=$(cat defaults.json | jq -er ".registries.docker.url")
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonsetDOCKER_REGISTRY_URL=$(cat defaults.json | jq -er ".registries.docker.url")
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonset

Enable in offline install

DOCKER_REGISTRY_URL=localhost:30071
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonsetDOCKER_REGISTRY_URL=localhost:30071
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonset

GPU taints

GPU workloads get scheduled on GPU nodes automatically when a workload requests for it. But normal CPU workloads also might get scheduled on these nodes, reserving the capacity. If you want only GPU workloads to be scheduled on these nodes you can add taints to these nodes using following commands from the first node.

nvidia.com/gpu=present:NoSchedule - non-GPU workloads do not get scheduled on this node unless explicitly specified
nvidia.com/gpu=present:PreferNoSchedule - this makes it a preferred condition rather than a hard one like the first option

Replace <node-name> with the corresponding GPU node name in your cluster and <taint-name> with one of the above 2 options in following command

kubectl taint node <node-name> <taint-name>kubectl taint node <node-name> <taint-name>

Validating GPU node provisioning

To ensure you have added the GPU nodes successfully, run the following command in the terminal. The output should show nvidia.com/gpu as an output along with the CPU and RAM resources.

kubectl describe node <node-name>kubectl describe node <node-name>

On this page

Installing a GPU driver
Adding the GPU to the agent node
Enabling the GPU driver post-installation
Enable in online install
Enable in offline install
GPU taints
Validating GPU node provisioning

Was this page helpful?

PREVIOUSSetting up a Kerberos Authentication

NEXTAbout the configuration file

Support and Services

Get The Help You Need

UiPath Academy

Learning RPA - Automation Courses

UiPath Forum

UiPath Community Forum

Trust and Security

Cookies Policy