k8s显卡插件
# 安装NVIDIA Container Toolkit
文档地址:https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
- 配置源
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \ sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
1
2 - 安装NVIDIA Container Toolkit软件包
export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1 sudo dnf install -y \ nvidia-container-toolkit-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ nvidia-container-toolkit-base-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container-tools-${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ libnvidia-container1-${NVIDIA_CONTAINER_TOOLKIT_VERSION}
1
2
3
4
5
6 - 使用nvidia-ctk命令配置containerd(用于Kubernetes)
sudo nvidia-ctk runtime configure --runtime=containerd
1
# containerd配置
/etc/containerd/config.toml 文件
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
# default_runtime_name = "runc" runc改为nvidia
1
2
3
4
2
3
4
重启
systemctl restart containerd
1
# k8s 显卡插件GitHub地址
https://github.com/NVIDIA/k8s-device-plugin#quick-start
- 使用时间切片共享和GFD
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
gfd:
oneshot: false
noTimestamp: false
outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd
sleepInterval: 60s
sharing:
timeSlicing:
renameByDefault: true
resources:
- name: nvidia.com/gpu
replicas: 10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
- 使用helm安装插件(无效)
- 设置插件的helm存储库
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update
1
2 - 安装命令
helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.17.1 \ --set gfd.enabled=true \ --set-file config.map.config=/root/dp-example-config0.yaml
1
2
3
4
5
6
- 设置插件的helm存储库
- 检查是否生效
kubectl describe node k8s-node-03 | grep -A 10 Capacity #Capacity: # cpu: 4 # ephemeral-storage: 62232Mi # hugepages-1Gi: 0 # hugepages-2Mi: 0 # memory: 16104860Ki # nvidia.com/gpu: 1 # pods: 110 #Allocatable: # cpu: 4 # ephemeral-storage: 58729483372
1
2
3
4
5
6
7
8
9
10
11
12 - 使用helm+configmap安装插件(有效)
- ConfigMap配置
version: v1 sharing: timeSlicing: renameByDefault: false resources: - name: nvidia.com/gpu replicas: 10
1
2
3
4
5
6
7 - 正确关联ConfigMap与设备插件
helm upgrade -i nvdp nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --version 0.17.1 \ --set gfd.enabled=true \ --set config.name=nvidia-plugin-config
1
2
3
4
5
- ConfigMap配置