Automation Suite - GPU node affected by resource unavailability

GPU node affected by resource unavailability

Description

When configuring a GPU node in Automation Suite 2023.4.0 or 2023.4.1, you might face issues with resource availability.

To check if the GPU node is affected by this issue, run the following command:

kubectl describe node <GPU>kubectl describe node <GPU>

If the Allocatable resource does not contain nvidia.com/gpu, as is the case of the following sample, the GPU issue affects you.

Allocatable:
  cpu:                5400m
  ephemeral-storage:  51938908890
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             113173836Ki
  pods:               500Allocatable:
  cpu:                5400m
  ephemeral-storage:  51938908890
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             113173836Ki
  pods:               500

Solution

To fix this issue, run the following command on the GPU node:

awk '1;/plugins."io.containerd.grpc.v1.cri".containerd]/{print " default_runtime_name = \"nvidia\""}' /var/lib/rancher/rke2/agent/etc/containerd/config.toml > /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
systemctl stop rke2-agent
rke2-killall.sh
systemctl start rke2-agentawk '1;/plugins."io.containerd.grpc.v1.cri".containerd]/{print " default_runtime_name = \"nvidia\""}' /var/lib/rancher/rke2/agent/etc/containerd/config.toml > /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
systemctl stop rke2-agent
rke2-killall.sh
systemctl start rke2-agent

To verify if the GPU resource shows up, run the following command:

kubectl describe node <GPU>kubectl describe node <GPU>

In the following sample, you can see that nvidia.com/gpu is present, so the GPU issue no longer occurs.

Allocatable:
  cpu:                5400m
  ephemeral-storage:  51938908890
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             113173836Ki
  nvidia.com/gpu:     1
  pods:               500Allocatable:
  cpu:                5400m
  ephemeral-storage:  51938908890
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             113173836Ki
  nvidia.com/gpu:     1
  pods:               500

On this page

Description
Solution

Was this page helpful?

PREVIOUSAutomation Suite requires backlog_wait_time to be set to 0

NEXTVolume unable to mount due to not being ready for workloads