-
Notifications
You must be signed in to change notification settings - Fork 426
Open
Labels
kind/bugSomething isn't workingSomething isn't working
Description
What happened:
https://project-hami.io/zh/docs/userguide/NVIDIA-device/examples/use-exclusive-card
According to this document, I was unable to allocate two GPUs with full memory to one pod
But when I allocate two GPUs with 71GB of graphics memory, it can start up
What you expected to happen:
Should be allocated to two GPUs normally
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
enabled mig
- The output of
nvidia-smi -aon your host - nvidia-smi -L
GPU 0: NVIDIA H20-3e (UUID: GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82)
MIG 1g.18gb Device 0: (UUID: MIG-5df9711e-b21e-5e42-b2d1-b32f2c4dd9a3)
MIG 1g.18gb Device 1: (UUID: MIG-4e6281b4-5560-5e7c-9ebe-4cab135daf78)
MIG 1g.18gb Device 2: (UUID: MIG-1123f5af-d118-539e-8a18-9759ae0d7de4)
MIG 1g.18gb Device 3: (UUID: MIG-5472b252-b363-5c94-b478-92fb1645b945)
MIG 1g.18gb Device 4: (UUID: MIG-e33a4903-a544-584e-b8e1-f8d561fe183d)
MIG 1g.18gb Device 5: (UUID: MIG-439f2397-1ded-5e53-96c1-a174f0646058)
MIG 1g.18gb Device 6: (UUID: MIG-cb81df20-f6df-5879-901c-b1cfca4fa991)
GPU 1: NVIDIA H20-3e (UUID: GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5)
GPU 2: NVIDIA H20-3e (UUID: GPU-22e5bae0-3938-7cc7-cab3-3f2b91e20774)
MIG 7g.141gb Device 0: (UUID: MIG-46319f0e-8179-59d8-9038-f82d3a98261e)
GPU 3: NVIDIA H20-3e (UUID: GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943)
MIG 7g.141gb Device 0: (UUID: MIG-f1ab13f4-0542-561d-8ccd-f8418c71e705)
GPU 4: NVIDIA H20-3e (UUID: GPU-c53c0221-f097-5b48-427f-1a4f0f0ab91d)
MIG 2g.35gb Device 0: (UUID: MIG-7779ab9c-357c-5061-8c12-56ba239f2f08)
MIG 2g.35gb Device 1: (UUID: MIG-635e4612-419d-5a03-95c1-14876e2aec78)
MIG 2g.35gb Device 2: (UUID: MIG-900a6c00-88bb-505e-8392-412541775cd2)
GPU 5: NVIDIA H20-3e (UUID: GPU-efc45a1c-c634-04ed-396a-2984531a62fc)
MIG 3g.71gb Device 0: (UUID: MIG-cd88fdd7-848b-5207-8108-cf2f1d4a61a9)
MIG 3g.71gb Device 1: (UUID: MIG-69cb2a80-bca0-506d-ae51-c6396b339018)
GPU 6: NVIDIA H20-3e (UUID: GPU-9518f90b-6b41-872c-242e-5863bd6c1150)
MIG 2g.35gb Device 0: (UUID: MIG-2a288840-d2f8-5473-a01e-56436cfc97c8)
MIG 2g.35gb Device 1: (UUID: MIG-f830007b-01be-57ca-9698-583592f33bb8)
MIG 2g.35gb Device 2: (UUID: MIG-e1ee6223-ea9c-5a58-a2a6-71228832986b)
GPU 7: NVIDIA H20-3e (UUID: GPU-c912dac5-bd93-2bd0-57ea-704bbbc13464)
MIG 3g.71gb Device 0: (UUID: MIG-b8492b01-5042-5791-a8a0-4b8ddf4e9ec7)
MIG 3g.71gb Device 1: (UUID: MIG-9af8b2d4-97ee-5bf3-abd7-ed5019b8442c) - Your docker or containerd configuration file (e.g:
/etc/docker/daemon.json) - The hami-device-plugin container logs
- The hami-scheduler container logs
- I1103 03:38:55.537815 1 device.go:680] Allocating... 287542 cores 0
I1103 03:38:55.537820 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:38:55.537828 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
I1103 03:38:55.537843 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=2
I1103 03:38:55.537848 1 device.go:680] Allocating... 287542 cores 0
I1103 03:38:55.537854 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:38:55.537869 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
I1103 03:38:55.537899 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=1
I1103 03:38:55.537905 1 device.go:680] Allocating... 287542 cores 0
I1103 03:38:55.537910 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:38:55.537918 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
I1103 03:38:55.537931 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=0
I1103 03:38:55.537936 1 device.go:680] Allocating... 287542 cores 0
I1103 03:38:55.537941 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:38:55.537962 1 score.go:196] "NodeUnfitPod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" node="rke2-agent07-ai" reason="node:rke2-agent07-ai resaon:4/8 CardInsufficientMemory, 1/8 AllocatedCardsInsufficientRequest, 4/8 CardNotFoundCustomFilterRule"
I1103 03:38:55.538010 1 scheduler.go:517] "No available nodes meet the required scores" pod="gputest3-5db8879dd5-dlw7k"
I1103 03:38:55.538161 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes CardInsufficientMemory(rke2-agent07-ai)"
I1103 03:38:55.538199 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes CardNotFoundCustomFilterRule(rke2-agent07-ai)"
I1103 03:38:55.538223 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes AllocatedCardsInsufficientRequest(rke2-agent07-ai)"
I1103 03:38:55.538233 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, 7 nodes do not meet"
2025/11/03 03:39:07 http: TLS handshake error from 192.178.4.0:44452: remote error: tls: bad certificate
I1103 03:39:08.102139 1 route.go:44] Entering Predicate Route handler
I1103 03:39:08.102581 1 scheduler.go:482] "Starting schedule filter process" pod="gputest3-5db8879dd5-dlw7k" uuid="b41dc013-2b18-44b0-8a10-6632c684edff" namespace="jingyu-test"
I1103 03:39:08.102618 1 devices.go:455] "Processing resource requirements" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" containerCount=1
I1103 03:39:08.102631 1 gcu.go:90] Start to count enflame devices for container container-0
I1103 03:39:08.102639 1 device.go:205] Start to count mthreads devices for container container-0
I1103 03:39:08.102646 1 device.go:133] Start to count metax devices for container container-0
I1103 03:39:08.102655 1 device.go:151] Start to count kunlun devices for container container-0
I1103 03:39:08.102665 1 device.go:180] Start to count iluvatar devices for container container-0
I1103 03:39:08.102673 1 device.go:195] Start to count enflame devices for container container-0
I1103 03:39:08.102683 1 device.go:226] Start to count awsNeuron devices for container container-0
I1103 03:39:08.102692 1 device.go:247] Start to count mlu devices for container container-0
I1103 03:39:08.102699 1 device.go:252] idx= nvidia.com/gpu val= {{2 0} {} 2 DecimalSI} {{0 0} {} }
I1103 03:39:08.102720 1 device.go:209] Start to count dcu devices for container container-0
I1103 03:39:08.102741 1 devices.go:478] "Resource requirements collected" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" requests=[{"NVIDIA":{"Nums":2,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}}]
I1103 03:39:08.102754 1 pods.go:100] "Pod not found for deletion" pod="jingyu-test/gputest3-5db8879dd5-dlw7k"
I1103 03:39:08.102833 1 node_policy.go:82] node rke2-agent07-ai used 5, usedCore 0, usedMem 361472,
I1103 03:39:08.102842 1 node_policy.go:94] node rke2-agent07-ai computer default score is 4.035633
I1103 03:39:08.102872 1 gpu_policy.go:71] device GPU-c53c0221-f097-5b48-427f-1a4f0f0ab91d user 1, userCore 0, userMem 35840,
I1103 03:39:08.102878 1 gpu_policy.go:77] device GPU-c53c0221-f097-5b48-427f-1a4f0f0ab91d computer score is 16.778568
I1103 03:39:08.102883 1 gpu_policy.go:71] device GPU-efc45a1c-c634-04ed-396a-2984531a62fc user 0, userCore 0, userMem 0,
I1103 03:39:08.102888 1 gpu_policy.go:77] device GPU-efc45a1c-c634-04ed-396a-2984531a62fc computer score is 12.857142
I1103 03:39:08.102892 1 gpu_policy.go:71] device GPU-9518f90b-6b41-872c-242e-5863bd6c1150 user 1, userCore 0, userMem 35840,
I1103 03:39:08.102897 1 gpu_policy.go:77] device GPU-9518f90b-6b41-872c-242e-5863bd6c1150 computer score is 16.778568
I1103 03:39:08.102902 1 gpu_policy.go:71] device GPU-c912dac5-bd93-2bd0-57ea-704bbbc13464 user 2, userCore 0, userMem 145408,
I1103 03:39:08.102906 1 gpu_policy.go:77] device GPU-c912dac5-bd93-2bd0-57ea-704bbbc13464 computer score is 25.828148
I1103 03:39:08.102913 1 gpu_policy.go:71] device GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82 user 0, userCore 0, userMem 0,
I1103 03:39:08.102918 1 gpu_policy.go:77] device GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82 computer score is 12.857142
I1103 03:39:08.102922 1 gpu_policy.go:71] device GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5 user 0, userCore 0, userMem 0,
I1103 03:39:08.102926 1 gpu_policy.go:77] device GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5 computer score is 12.857142
I1103 03:39:08.102930 1 gpu_policy.go:71] device GPU-22e5bae0-3938-7cc7-cab3-3f2b91e20774 user 0, userCore 0, userMem 0,
I1103 03:39:08.102934 1 gpu_policy.go:77] device GPU-22e5bae0-3938-7cc7-cab3-3f2b91e20774 computer score is 12.857142
I1103 03:39:08.102938 1 gpu_policy.go:71] device GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943 user 1, userCore 0, userMem 144384,
I1103 03:39:08.102942 1 gpu_policy.go:77] device GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943 computer score is 24.328350
I1103 03:39:08.102959 1 device.go:688] "Allocating device for container request" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" card request={"Nums":2,"Type":"NVIDIA","Memreq":0,"MemPercentagereq":100,"Coresreq":0}
I1103 03:39:08.102975 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-efc45a1c-c634-04ed-396a-2984531a62fc" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=2 device index=7
I1103 03:39:08.102986 1 device.go:680] Allocating... 143771 cores 0
I1103 03:39:08.102997 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=143771
I1103 03:39:08.103009 1 device.go:605] MIG entry device usage true= [{1g.18gb 18432 true} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}] request {2 NVIDIA 0 100 0} toAllocate []
I1103 03:39:08.103035 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-efc45a1c-c634-04ed-396a-2984531a62fc" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=7
I1103 03:39:08.103045 1 device.go:680] Allocating... 287542 cores 0
I1103 03:39:08.103052 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:39:08.103061 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
I1103 03:39:08.103077 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-9518f90b-6b41-872c-242e-5863bd6c1150" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=6
I1103 03:39:08.103083 1 device.go:680] Allocating... 287542 cores 0
I1103 03:39:08.103089 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:39:08.103100 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-c53c0221-f097-5b48-427f-1a4f0f0ab91d" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=5
I1103 03:39:08.103106 1 device.go:680] Allocating... 287542 cores 0
I1103 03:39:08.103112 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:39:08.103123 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-c912dac5-bd93-2bd0-57ea-704bbbc13464" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=4
I1103 03:39:08.103129 1 device.go:680] Allocating... 287542 cores 0
I1103 03:39:08.103136 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:39:08.103146 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-22e5bae0-3938-7cc7-cab3-3f2b91e20774" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=3
I1103 03:39:08.103151 1 device.go:680] Allocating... 287542 cores 0
I1103 03:39:08.103158 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:39:08.103165 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
I1103 03:39:08.103182 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-40c0923a-6172-5f3c-d299-4f6eaf218ad5" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=2
I1103 03:39:08.103188 1 device.go:680] Allocating... 287542 cores 0
I1103 03:39:08.103195 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:39:08.103202 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
I1103 03:39:08.103219 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-c6a9ed1a-e4f8-f17b-845b-7c76ba095a82" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=1
I1103 03:39:08.103225 1 device.go:680] Allocating... 287542 cores 0
I1103 03:39:08.103232 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:39:08.103239 1 device.go:598] MIG entry not found [{1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false} {1g.18gb 18432 false}]
I1103 03:39:08.103253 1 device.go:695] "scoring pod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" device="GPU-edb2e09b-6ea4-8d02-9bf9-ab6a1bb06943" Memreq=0 MemPercentagereq=100 Coresreq=0 Nums=1 device index=0
I1103 03:39:08.103258 1 device.go:680] Allocating... 287542 cores 0
I1103 03:39:08.103264 1 quota.go:72] "resourceMem quota judging" limit=0 used=72704 alloc=287542
I1103 03:39:08.103287 1 score.go:196] "NodeUnfitPod" pod="jingyu-test/gputest3-5db8879dd5-dlw7k" node="rke2-agent07-ai" reason="node:rke2-agent07-ai resaon:4/8 CardNotFoundCustomFilterRule, 4/8 CardInsufficientMemory, 1/8 AllocatedCardsInsufficientRequest"
I1103 03:39:08.103326 1 scheduler.go:517] "No available nodes meet the required scores" pod="gputest3-5db8879dd5-dlw7k"
I1103 03:39:08.103480 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes CardInsufficientMemory(rke2-agent07-ai)"
I1103 03:39:08.103493 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes CardNotFoundCustomFilterRule(rke2-agent07-ai)"
I1103 03:39:08.103511 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="1 nodes AllocatedCardsInsufficientRequest(rke2-agent07-ai)"
I1103 03:39:08.103522 1 event.go:389] "Event occurred" object="jingyu-test/gputest3-5db8879dd5-dlw7k" fieldPath="" kind="Pod" apiVersion="v1" type="Warning" reason="FilteringFailed" message="no available node, 7 nodes do not meet"
I1103 03:39:23.112623 1 route.go:44] Entering Predicate Route handler
I1103 03:39:23.113063 1 scheduler.go:482] "Starting schedule filter process" pod="g - The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet) - Any relevant kernel output lines from
dmesg
Environment:
- HAMi version:
2.7.0 - nvidia driver or other AI device driver version:
- NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9
- Docker version from
docker version - Docker command, image and tag used
- Kernel version from
uname -a - Others:
- pod yaml
name: gputest3
namespace: jingyu-test
spec:
containers:
- command:
- sh
image: >-
harbor.xx.local/busybox/cuda-sample@sha256:72cf391aba0795ec2141a15a83185e2dac901eb01d9e60d05477c1f43bc272e9
imagePullPolicy: IfNotPresent
name: container-0
resources:
limits:
nvidia.com/gpu: '2'
Metadata
Metadata
Assignees
Labels
kind/bugSomething isn't workingSomething isn't working