-
Notifications
You must be signed in to change notification settings - Fork 847
[Pools] Improve Concurrent Job Launch #7891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
eec5de9 to
844f506
Compare
|
/smoke-test |
|
/smoke-test --managed-jobs |
cg505
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing - can we make sure as well that /quicktest-core is testing sky exec on a cluster launched with the other version?
|
/quicktest-core |
a723611 to
418f71f
Compare
|
/smoke-test --managed-jobs --kubernetes |
|
/smoke-test --managed-jobs --kubernetes |
|
/smoke-test --managed-jobs --kubernetes |
1 similar comment
|
/smoke-test --managed-jobs --kubernetes |
|
/smoke-test --managed-jobs --kubernetes |
|
/smoke-test -k test_managed_jobs_storage --kubernetes |
|
/smoke-test |
|
/smoke-test -k test_managed_jobs_basic --aws |
|
/smoke-test --managed-jobs --kubernetes |
|
/smoke-test -k test_pools --kubernetes |
Problem
Concurrently launching multiple jobs on pools is currently slow and failure prone. The long time it takes is primarily due to us unnecessarily duplicating a lot of steps in the job provisioning process (submitting controller tasks, rsyncing files, invoking the jobs scheduler).
Approach
This PR improve the submission of multiple jobs by sharing nearly all of the job submission steps among each job replica. We now
Sharing the job dag took a bit of extra care because we have an environment variable
$SKYPILOT_JOB_RANKthat lets you use the rank to parallelize work and that variable is currently set by appending an env var to the task object. Since this value needs to be different for each job we can't append it and have it be different for each replica. To fix this we create a dictionary in our controller task that maps from the replica ID to the rank, store it on the job controller in a file, and then load it when we create theJobControllerinstance for a job.I have also added support for using gRPC to perform our task creation with
add_jobby adding a newnum_jobsfield to indicate the number of jobs we want to create and adding newjob_idsandlog_dirsreturn arguments so that we can get the job ids back in bulk.For both codegen and gRPC I've added code to make sure that we are compatible with a legacy jobs controller by repeatedly calling
add_jobsuntil we get the number of jobs we need.I've also modified consolidation mode to support concurrent launch (previously it would create the tasks but fail to schedule them).
Testing
$SKYPILOT_JOB_RANKis properly set--num-jobsis shortenedRemaining Work
Performance
Launching 100 concurrent jobs used to take us 4.5 minutes and now takes 30 seconds!
job-launch.mp4
Update:
We have added a new backend call that will instruct the jobs controller to set the job info and set the job to pending, the implications are:
add_jobThis also addresses #6932
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)