- Getting Started
- Network requirements
- Single-node requirements and installation
- Multi-node requirements and installation
- Post-installation
- Accessing AI Center
- Provision an AI Center tenant
- Updating Orchestrator and Identity Server certificates
- Resizing PVC
- Adding a new node to the cluster
- ML packages offline installation
- Configuring the cluster
- Configuring the FQDN post-installation
- Backing up and restoring the cluster
- Using the monitoring stack
- Setting up a Kerberos authentication
- Provisioning a GPU
- Provisioning a GPU
- Using the configuration file
- Node scheduling
- Migration and upgrade
- Basic Troubleshooting Guide
Provisioning a GPU
gpu_support
flag from the cluster_config.json
.
Instead, follow the instructions below to add a dedicated agent node with GPU support to
the cluster.
Currently, Automation Suite only supports Nvidia GPU Drivers. See the list of GPU-supported operating systems.
You can find cloud-specific instance types for the nodes here:
Follow the steps from Adding a new node to the cluster to ensure the agent node is added correctly.
For more examples on how to deploy NVIDIA CUDA on a GPU, check this page.
- Run the following command to install the GPU driver on the agent node:
sudo yum install kernel kernel-tools kernel-headers kernel-devel sudo reboot sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel.repo sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel-modular.repo sudo yum config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo sudo yum install cuda
sudo yum install kernel kernel-tools kernel-headers kernel-devel sudo reboot sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel.repo sudo sed 's/$releasever/8/g' -i /etc/yum.repos.d/epel-modular.repo sudo yum config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo sudo yum install cuda - Run the following command to install the container toolkits:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo sudo dnf clean expire-cache sudo yum install -y nvidia-container-runtime.x86_64
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \\ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo sudo dnf clean expire-cache sudo yum install -y nvidia-container-runtime.x86_64
Verify if drivers are installed properly
sudo nvidia-smi
command on the node to verify if the drivers were installed properly.
At this point, the GPU drivers have been installed and that the GPU nodes have been added to the cluster.
contianerd
configuration of agent node.
cat <<EOF > gpu_containerd.sh
if ! nvidia-smi &>/dev/null;
then
echo "GPU Drivers are not installed on the VM. Please refer the documentation."
exit 0
fi
if ! which nvidia-container-runtime &>/dev/null;
then
echo "Nvidia container runtime is not installed on the VM. Please refer the documentation."
exit 0
fi
grep "nvidia-container-runtime" /var/lib/rancher/rke2/agent/etc/containerd/config.toml &>/dev/null && info "GPU containerd changes already applied" && exit 0
awk '1;/plugins.cri.containerd]/{print " default_runtime_name = \\"nvidia-container-runtime\\""}' /var/lib/rancher/rke2/agent/etc/containerd/config.toml > /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.linux]\
runtime = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.cri.containerd.runtimes.nvidia-container-runtime]\
runtime_type = "io.containerd.runc.v2"\
[plugins.cri.containerd.runtimes.nvidia-container-runtime.options]\
BinaryName = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
EOFsudo bash gpu_containerd.sh
cat <<EOF > gpu_containerd.sh
if ! nvidia-smi &>/dev/null;
then
echo "GPU Drivers are not installed on the VM. Please refer the documentation."
exit 0
fi
if ! which nvidia-container-runtime &>/dev/null;
then
echo "Nvidia container runtime is not installed on the VM. Please refer the documentation."
exit 0
fi
grep "nvidia-container-runtime" /var/lib/rancher/rke2/agent/etc/containerd/config.toml &>/dev/null && info "GPU containerd changes already applied" && exit 0
awk '1;/plugins.cri.containerd]/{print " default_runtime_name = \\"nvidia-container-runtime\\""}' /var/lib/rancher/rke2/agent/etc/containerd/config.toml > /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.linux]\
runtime = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
echo -e '\
[plugins.cri.containerd.runtimes.nvidia-container-runtime]\
runtime_type = "io.containerd.runc.v2"\
[plugins.cri.containerd.runtimes.nvidia-container-runtime.options]\
BinaryName = "nvidia-container-runtime"' >> /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
EOFsudo bash gpu_containerd.sh
rke2-agent
[[ "$(sudo systemctl is-enabled rke2-server 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-server
[[ "$(sudo systemctl is-enabled rke2-agent 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-agent
[[ "$(sudo systemctl is-enabled rke2-server 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-server
[[ "$(sudo systemctl is-enabled rke2-agent 2>/dev/null)" == "enabled" ]] && systemctl restart rke2-agent
Run the below commands from any of the primary server nodes.
UiPathAutomationSuite
folder.
cd /opt/UiPathAutomationSuite
cd /opt/UiPathAutomationSuite
DOCKER_REGISTRY_URL=$(cat defaults.json | jq -er ".registries.docker.url")
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonset
DOCKER_REGISTRY_URL=$(cat defaults.json | jq -er ".registries.docker.url")
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonset
DOCKER_REGISTRY_URL=localhost:30071
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonset
DOCKER_REGISTRY_URL=localhost:30071
sed -i "s/REGISTRY_PLACEHOLDER/${DOCKER_REGISTRY_URL}/g" ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl apply -f ./Infra_Installer/gpu_plugin/nvidia-device-plugin.yaml
kubectl -n kube-system rollout restart daemonset nvidia-device-plugin-daemonset
GPU workloads get scheduled on GPU nodes automatically when a workload requests for it. But normal CPU workloads also might get scheduled on these nodes, reserving the capacity. If you want only GPU workloads to be scheduled on these nodes you can add taints to these nodes using following commands from the first node.
nvidia.com/gpu=present:NoSchedule
- non-GPU workloads do not get scheduled on this node unless explicitly specifiednvidia.com/gpu=present:PreferNoSchedule
- this makes it a preferred condition rather than a hard one like the first option
<node-name>
with the corresponding GPU node name in your cluster and <taint-name>
with one of the above 2 options in following command
kubectl taint node <node-name> <taint-name>
kubectl taint node <node-name> <taint-name>