Skip to content

Commit b34f195

Browse files
authored
[CI] Fix nightly CI for A2 series (#3825)
### What this PR does / why we need it? For multi-node CI system, we need to ensure that cluster resources meet the expected specifications before conducting multi-node interoperability tests. Otherwise, unexpected errors may occur (for example, we might mistakenly assume all nodes are ready and perform a global cluster IP acquisition, which would cause an exception to be thrown in Python because some nodes might not actually be ready at that point). Therefore, we need to wait at the workflow level until all resources meet the expected specifications. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: vllm-project/vllm@2918c1b --------- Signed-off-by: wangli <[email protected]>
1 parent ab51fce commit b34f195

File tree

1 file changed

+81
-11
lines changed

1 file changed

+81
-11
lines changed

.github/workflows/_e2e_nightly_multi_node.yaml

Lines changed: 81 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ defaults:
6060
# only cancel in-progress runs of the same workflow
6161
# and ignore the lint / 8 cards test type
6262
concurrency:
63-
group: ascend-nightly-${{ github.workflow_ref }}-${{ github.ref }}-${{ inputs.config_file_path }}
63+
group: ascend-nightly-${{ github.workflow_ref }}-${{ github.ref }}-${{ inputs.soc_version }}
6464
cancel-in-progress: true
6565

6666
jobs:
@@ -115,8 +115,39 @@ jobs:
115115
116116
- name: Clear resources
117117
run: |
118-
# pre clear the crd resources created by lws
119-
kubectl delete leaderworkerset vllm -n "$NAMESPACE" --ignore-not-found
118+
set -euo pipefail
119+
120+
CRD_NAME="${CRD_NAME:-vllm}"
121+
TIMEOUT=${TIMEOUT:-120}
122+
SLEEP_INTERVAL=2
123+
124+
echo "Deleting leaderworkerset [$CRD_NAME] in namespace [$NAMESPACE]..."
125+
kubectl delete leaderworkerset "$CRD_NAME" -n "$NAMESPACE" --ignore-not-found
126+
127+
echo "Waiting for all pods starting with 'vllm' to be deleted..."
128+
START_TIME=$(date +%s)
129+
130+
while true; do
131+
NOW=$(date +%s)
132+
ELAPSED=$((NOW - START_TIME))
133+
134+
if [[ $ELAPSED -ge $TIMEOUT ]]; then
135+
echo "Timeout reached ($TIMEOUT seconds), some pods still exist:"
136+
kubectl get pods -n "$NAMESPACE" | grep '^vllm' || true
137+
exit 1
138+
fi
139+
140+
PODS_EXIST=$(kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n' | grep '^vllm' || true)
141+
142+
if [[ -z "$PODS_EXIST" ]]; then
143+
echo "All vllm pods deleted."
144+
break
145+
else
146+
echo "Waiting for pods to be deleted: $PODS_EXIST"
147+
sleep $SLEEP_INTERVAL
148+
fi
149+
done
150+
120151
- name: Launch cluster
121152
id: launcher
122153
run: |
@@ -164,19 +195,58 @@ jobs:
164195
165196
- name: Waiting for pod ready
166197
run: |
167-
echo "waiting for Pod [$LEADER_POD] in namespace [$NAMESPACE] to Ready..."
198+
POD_PREFIX="${POD_PREFIX:-vllm-0}"
199+
SIZE="${{ inputs.size }}"
200+
TIMEOUT=1200 # default timeout 20 minutes
201+
202+
echo "Waiting for Pods in namespace [$NAMESPACE] to become Running and Ready (timeout ${TIMEOUT}s)..."
203+
204+
START_TIME=$(date +%s)
168205
169206
while true; do
170-
# get pod status
171-
READY_STATUS=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}')
207+
NOW=$(date +%s)
208+
ELAPSED=$((NOW - START_TIME))
209+
if [[ $ELAPSED -ge $TIMEOUT ]]; then
210+
echo "Timeout reached after ${ELAPSED}s"
211+
echo "Dumping pod status for debugging:"
212+
kubectl get pods -n "$NAMESPACE"
213+
kubectl describe pod "$LEADER_POD" -n "$NAMESPACE"
214+
exit 1
215+
fi
216+
217+
# 1) check follower pods
218+
ALL_FOLLOWERS_READY=true
219+
for ((i=1; i<${SIZE}; i++)); do
220+
POD="${POD_PREFIX}-${i}"
221+
PHASE=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound")
222+
READY=$(kubectl get pod "$POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}' 2>/dev/null)
223+
224+
echo "Follower [$POD] phase=$PHASE ready=$READY"
172225
173-
if [[ "$READY_STATUS" == "true" ]]; then
174-
echo "Pod [$LEADER_POD] is Ready!"
226+
if [[ "$PHASE" != "Running" || "$READY" != "true" ]]; then
227+
echo "Follower [$POD] not Ready yet..."
228+
ALL_FOLLOWERS_READY=false
229+
break
230+
fi
231+
done
232+
233+
# 2) check leader pod
234+
LEADER_PHASE=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || echo "NotFound")
235+
LEADER_READY=$(kubectl get pod "$LEADER_POD" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[*].ready}' 2>/dev/null)
236+
237+
echo "Leader [$LEADER_POD] phase=$LEADER_PHASE ready=$LEADER_READY"
238+
239+
if [[ "$LEADER_PHASE" != "Running" || "$LEADER_READY" != "true" ]]; then
240+
echo "Leader not Ready yet..."
241+
ALL_FOLLOWERS_READY=false
242+
fi
243+
244+
if [[ "$ALL_FOLLOWERS_READY" == "true" ]]; then
245+
echo "All follower pods and leader pod are Running and Ready — continuing."
175246
break
176-
else
177-
echo "Pod [$LEADER_POD] not ready, waiting..."
178-
sleep 3
179247
fi
248+
249+
sleep 2
180250
done
181251
182252
- name: Stream logs

0 commit comments

Comments
 (0)