Skip to content

A pod failed to use 2 GPUs #1462

@Mrpingdan

Description

@Mrpingdan

What happened:
https://project-hami.io/zh/docs/userguide/NVIDIA-device/examples/use-exclusive-card
According to this document, I was unable to allocate two GPUs with full memory to one pod
But when I allocate two GPUs with 71GB of graphics memory, it can start up
What you expected to happen:
Should be allocated to two GPUs normally
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
enabled mig

  • The output of nvidia-smi -a on your host
  • nvidia-smi -L
    GPU 0: NVIDIA H20-3e (UUID: GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82)
    MIG 1g.18gb Device 0: (UUID: MIG-5df9711e-b21e-5e42-b2d1-b32f2c4dd9a3)
    MIG 1g.18gb Device 1: (UUID: MIG-4e6281b4-5560-5e7c-9ebe-4cab135daf78)
    MIG 1g.18gb Device 2: (UUID: MIG-1123f5af-d118-539e-8a18-9759ae0d7de4)
    MIG 1g.18gb Device 3: (UUID: MIG-5472b252-b363-5c94-b478-92fb1645b945)
    MIG 1g.18gb Device 4: (UUID: MIG-e33a4903-a544-584e-b8e1-f8d561fe183d)
    MIG 1g.18gb Device 5: (UUID: MIG-439f2397-1ded-5e53-96c1-a174f0646058)
    MIG 1g.18gb Device 6: (UUID: MIG-cb81df20-f6df-5879-901c-b1cfca4fa991)
    GPU 1: NVIDIA H20-3e (UUID: GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5)
    GPU 2: NVIDIA H20-3e (UUID: GPU-22e5bae0-3938-7cc7-cab3-3f2b91e20774)
    MIG 7g.141gb Device 0: (UUID: MIG-46319f0e-8179-59d8-9038-f82d3a98261e)
    GPU 3: NVIDIA H20-3e (UUID: GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943)
    MIG 7g.141gb Device 0: (UUID: MIG-f1ab13f4-0542-561d-8ccd-f8418c71e705)
    GPU 4: NVIDIA H20-3e (UUID: GPU-c53c0221-f097-5b48-427f-1a4f0f0ab91d)
    MIG 2g.35gb Device 0: (UUID: MIG-7779ab9c-357c-5061-8c12-56ba239f2f08)
    MIG 2g.35gb Device 1: (UUID: MIG-635e4612-419d-5a03-95c1-14876e2aec78)
    MIG 2g.35gb Device 2: (UUID: MIG-900a6c00-88bb-505e-8392-412541775cd2)
    GPU 5: NVIDIA H20-3e (UUID: GPU-efc45a1c-c634-04ed-396a-2984531a62fc)
    MIG 3g.71gb Device 0: (UUID: MIG-cd88fdd7-848b-5207-8108-cf2f1d4a61a9)
    MIG 3g.71gb Device 1: (UUID: MIG-69cb2a80-bca0-506d-ae51-c6396b339018)
    GPU 6: NVIDIA H20-3e (UUID: GPU-9518f90b-6b41-872c-242e-5863bd6c1150)
    MIG 2g.35gb Device 0: (UUID: MIG-2a288840-d2f8-5473-a01e-56436cfc97c8)
    MIG 2g.35gb Device 1: (UUID: MIG-f830007b-01be-57ca-9698-583592f33bb8)
    MIG 2g.35gb Device 2: (UUID: MIG-e1ee6223-ea9c-5a58-a2a6-71228832986b)
    GPU 7: NVIDIA H20-3e (UUID: GPU-c912dac5-bd93-2bd0-57ea-704bbbc13464)
    MIG 3g.71gb Device 0: (UUID: MIG-b8492b01-5042-5791-a8a0-4b8ddf4e9ec7)
    MIG 3g.71gb Device 1: (UUID: MIG-9af8b2d4-97ee-5bf3-abd7-ed5019b8442c)
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The hami-device-plugin container logs
  • The hami-scheduler container logs
  • I1103 03:38:55.537815 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:38:55.537820 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:38:55.537828 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
    I1103 03:38:55.537843 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=2
    I1103 03:38:55.537848 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:38:55.537854 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:38:55.537869 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
    I1103 03:38:55.537899 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=1
    I1103 03:38:55.537905 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:38:55.537910 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:38:55.537918 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
    I1103 03:38:55.537931 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=0
    I1103 03:38:55.537936 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:38:55.537941 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:38:55.537962 1 score.go:196] "NodeUnfitPod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" node="rke2-agent07-ai" reason="node:rke2-agent07-ai resaon:4/8 CardInsufficientMemory, 1/8 AllocatedCardsInsufficientRequest, 4/8 CardNotFoundCustomFilterRule"
    I1103 03:38:55.538010 1 scheduler.go:517] "No available nodes meet the required scores" pod="gputest3-5db8879dd5-dlw7k"
    I1103 03:38:55.538161 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes CardInsufficientMemory(rke2-agent07-ai)"
    I1103 03:38:55.538199 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes CardNotFoundCustomFilterRule(rke2-agent07-ai)"
    I1103 03:38:55.538223 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes AllocatedCardsInsufficientRequest(rke2-agent07-ai)"
    I1103 03:38:55.538233 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, 7 nodes do not meet"
    2025/11/03 03:39:07 http: TLS handshake error from 192.178.4.0:44452: remote error: tls: bad certificate
    I1103 03:39:08.102139 1 route.go:44] Entering Predicate Route handler
    I1103 03:39:08.102581 1 scheduler.go:482] "Starting schedule filter process" pod="gputest3-5db8879dd5-dlw7k" uuid="b41dc013-2b18-44b0-8a10-6632c684edff" namespace="jingyu-test"
    I1103 03:39:08.102618 1 devices.go:455] "Processing resource requirements" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" containerCount=1
    I1103 03:39:08.102631 1 gcu.go:90] Start to count enflame devices for container container-0
    I1103 03:39:08.102639 1 device.go:205] Start to count mthreads devices for container container-0
    I1103 03:39:08.102646 1 device.go:133] Start to count metax devices for container container-0
    I1103 03:39:08.102655 1 device.go:151] Start to count kunlun devices for container container-0
    I1103 03:39:08.102665 1 device.go:180] Start to count iluvatar devices for container container-0
    I1103 03:39:08.102673 1 device.go:195] Start to count enflame devices for container container-0
    I1103 03:39:08.102683 1 device.go:226] Start to count awsNeuron devices for container container-0
    I1103 03:39:08.102692 1 device.go:247] Start to count mlu devices for container container-0
    I1103 03:39:08.102699 1 device.go:252] idx= nvidia.com/gpu val= {{2 0} {} 2 DecimalSI} {{0 0} {} }
    I1103 03:39:08.102720 1 device.go:209] Start to count dcu devices for container container-0
    I1103 03:39:08.102741 1 devices.go:478] "Resource requirements collected" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" requests=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}}]
    I1103 03:39:08.102754 1 pods.go:100] "Pod not found for deletion" pod="jingyu-test/gputest3-5db8879dd5-dlw7k"
    I1103 03:39:08.102833 1 node_policy.go:82] node rke2-agent07-ai used 5, usedCore 0, usedMem 361472,
    I1103 03:39:08.102842 1 node_policy.go:94] node rke2-agent07-ai computer default score is 4.035633
    I1103 03:39:08.102872 1 gpu_policy.go:71] device GPU-c53c0221-f097-5b48-427f-1a4f0f0ab91d user 1, userCore 0, userMem 35840,
    I1103 03:39:08.102878 1 gpu_policy.go:77] device GPU-c53c0221-f097-5b48-427f-1a4f0f0ab91d computer score is 16.778568
    I1103 03:39:08.102883 1 gpu_policy.go:71] device GPU-efc45a1c-c634-04ed-396a-2984531a62fc user 0, userCore 0, userMem 0,
    I1103 03:39:08.102888 1 gpu_policy.go:77] device GPU-efc45a1c-c634-04ed-396a-2984531a62fc computer score is 12.857142
    I1103 03:39:08.102892 1 gpu_policy.go:71] device GPU-9518f90b-6b41-872c-242e-5863bd6c1150 user 1, userCore 0, userMem 35840,
    I1103 03:39:08.102897 1 gpu_policy.go:77] device GPU-9518f90b-6b41-872c-242e-5863bd6c1150 computer score is 16.778568
    I1103 03:39:08.102902 1 gpu_policy.go:71] device GPU-c912dac5-bd93-2bd0-57ea-704bbbc13464 user 2, userCore 0, userMem 145408,
    I1103 03:39:08.102906 1 gpu_policy.go:77] device GPU-c912dac5-bd93-2bd0-57ea-704bbbc13464 computer score is 25.828148
    I1103 03:39:08.102913 1 gpu_policy.go:71] device GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82 user 0, userCore 0, userMem 0,
    I1103 03:39:08.102918 1 gpu_policy.go:77] device GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82 computer score is 12.857142
    I1103 03:39:08.102922 1 gpu_policy.go:71] device GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5 user 0, userCore 0, userMem 0,
    I1103 03:39:08.102926 1 gpu_policy.go:77] device GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5 computer score is 12.857142
    I1103 03:39:08.102930 1 gpu_policy.go:71] device GPU-22e5bae0-3938-7cc7-cab3-3f2b91e20774 user 0, userCore 0, userMem 0,
    I1103 03:39:08.102934 1 gpu_policy.go:77] device GPU-22e5bae0-3938-7cc7-cab3-3f2b91e20774 computer score is 12.857142
    I1103 03:39:08.102938 1 gpu_policy.go:71] device GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943 user 1, userCore 0, userMem 144384,
    I1103 03:39:08.102942 1 gpu_policy.go:77] device GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943 computer score is 24.328350
    I1103 03:39:08.102959 1 device.go:688] "Allocating device for container request" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" card request={"Nums":2,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}
    I1103 03:39:08.102975 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-efc45a1c-c634-04ed-396a-2984531a62fc" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=7
    I1103 03:39:08.102986 1 device.go:680] Allocating... 143771 cores 0
    I1103 03:39:08.102997 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=143771
    I1103 03:39:08.103009 1 device.go:605] MIG entry device usage true= [{1g.18gb 18432 true} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}] request {2 NVIDIA 0 100 0} toAllocate []
    I1103 03:39:08.103035 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-efc45a1c-c634-04ed-396a-2984531a62fc" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=7
    I1103 03:39:08.103045 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:39:08.103052 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:39:08.103061 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
    I1103 03:39:08.103077 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-9518f90b-6b41-872c-242e-5863bd6c1150" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=6
    I1103 03:39:08.103083 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:39:08.103089 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:39:08.103100 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-c53c0221-f097-5b48-427f-1a4f0f0ab91d" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=5
    I1103 03:39:08.103106 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:39:08.103112 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:39:08.103123 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-c912dac5-bd93-2bd0-57ea-704bbbc13464" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=4
    I1103 03:39:08.103129 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:39:08.103136 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:39:08.103146 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-22e5bae0-3938-7cc7-cab3-3f2b91e20774" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=3
    I1103 03:39:08.103151 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:39:08.103158 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:39:08.103165 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
    I1103 03:39:08.103182 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=2
    I1103 03:39:08.103188 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:39:08.103195 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:39:08.103202 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
    I1103 03:39:08.103219 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=1
    I1103 03:39:08.103225 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:39:08.103232 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:39:08.103239 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
    I1103 03:39:08.103253 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=0
    I1103 03:39:08.103258 1 device.go:680] Allocating... 287542 cores 0
    I1103 03:39:08.103264 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
    I1103 03:39:08.103287 1 score.go:196] "NodeUnfitPod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" node="rke2-agent07-ai" reason="node:rke2-agent07-ai resaon:4/8 CardNotFoundCustomFilterRule, 4/8 CardInsufficientMemory, 1/8 AllocatedCardsInsufficientRequest"
    I1103 03:39:08.103326 1 scheduler.go:517] "No available nodes meet the required scores" pod="gputest3-5db8879dd5-dlw7k"
    I1103 03:39:08.103480 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes CardInsufficientMemory(rke2-agent07-ai)"
    I1103 03:39:08.103493 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes CardNotFoundCustomFilterRule(rke2-agent07-ai)"
    I1103 03:39:08.103511 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes AllocatedCardsInsufficientRequest(rke2-agent07-ai)"
    I1103 03:39:08.103522 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, 7 nodes do not meet"
    I1103 03:39:23.112623 1 route.go:44] Entering Predicate Route handler
    I1103 03:39:23.113063 1 scheduler.go:482] "Starting schedule filter process" pod="g
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version:
    2.7.0
  • nvidia driver or other AI device driver version:
  • NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9
  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Others:
  • pod yaml
    name: gputest3
    namespace: jingyu-test
    spec:
    containers:
    - command:
    - sh
    image: >-
    harbor.xx.local/busybox/cuda-sample@sha256:72cf391aba0795ec2141a15a83185e2dac901eb01d9e60d05477c1f43bc272e9
    imagePullPolicy: IfNotPresent
    name: container-0
    resources:
    limits:
    nvidia.com/gpu: '2'

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions