-
Notifications
You must be signed in to change notification settings - Fork 165
Add acpt-pytorch-2.8-cuda12.6 env #4534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
e700742
add acpt-pytorch-2.7-cuda12.6 env
iamrk04 7054cf4
update torch to 2.8
iamrk04 a8a3089
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 c4630b4
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 9624325
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 be3fbb0
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 c274172
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 720894e
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 dd0764c
fix comments
iamrk04 a0c655c
remove unnecessary apt update
iamrk04 0e23de2
prevent image build failure
iamrk04 2d794d1
update SKU
iamrk04 9dea639
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 2563d52
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 e8acbb9
update packages
iamrk04 8ac0c8a
Merge branch 'iamrk04/acpt_upgrade' of https://github.com/Azure/azure…
iamrk04 4b55154
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 a3f7fc4
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 f601026
Merge branch 'main' into iamrk04/acpt_upgrade
iamrk04 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
11 changes: 11 additions & 0 deletions
11
assets/training/general/environments/acpt-pytorch-2.8-cuda12.6/asset.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| name: acpt-pytorch-2.8-cuda12.6 | ||
| version: auto | ||
| type: environment | ||
| spec: spec.yaml | ||
| extra_config: environment.yaml | ||
| test: | ||
| pytest: | ||
| enabled: true | ||
| pip_requirements: tests/requirements.txt | ||
| tests_dir: tests | ||
| categories: ["PyTorch", "Training"] |
33 changes: 33 additions & 0 deletions
33
assets/training/general/environments/acpt-pytorch-2.8-cuda12.6/context/Dockerfile
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| FROM mcr.microsoft.com/aifx/acpt/stable-ubuntu2204-cu126-py310-torch280:{{latest-image-tag:biweekly\.\d{6}\.\d{1}.*}} | ||
|
|
||
| # Install pip dependencies | ||
| COPY requirements.txt . | ||
| RUN pip install -r requirements.txt --no-cache-dir | ||
|
|
||
| # Inference requirements | ||
| COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20230419.v1 /artifacts /var/ | ||
| RUN apt-get update && \ | ||
| apt-get install -y --no-install-recommends \ | ||
| libcurl4 \ | ||
| liblttng-ust1 \ | ||
| libunwind8 \ | ||
| libxml++2.6-2v5 \ | ||
| nginx-light \ | ||
| psmisc \ | ||
| rsyslog \ | ||
| runit \ | ||
| unzip && \ | ||
| apt-get clean && rm -rf /var/lib/apt/lists/* && \ | ||
| cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \ | ||
| cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \ | ||
| ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \ | ||
| rm -f /etc/nginx/sites-enabled/default | ||
| ENV SVDIR=/var/runit | ||
| ENV WORKER_TIMEOUT=400 | ||
| EXPOSE 5001 8883 8888 | ||
|
|
||
| # support Deepspeed launcher requirement of passwordless ssh login | ||
| RUN apt-get update | ||
| RUN apt-get install -y openssh-server openssh-client | ||
|
|
||
| RUN pip list |
25 changes: 25 additions & 0 deletions
25
assets/training/general/environments/acpt-pytorch-2.8-cuda12.6/context/requirements.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,25 @@ | ||
| azureml-core=={{latest-pypi-version}} | ||
| azureml-dataset-runtime=={{latest-pypi-version}} | ||
| azureml-defaults=={{latest-pypi-version}} | ||
| azure-ml-component=={{latest-pypi-version}} | ||
| azureml-mlflow=={{latest-pypi-version}} | ||
| azureml-contrib-services=={{latest-pypi-version}} | ||
| azureml-inference-server-http | ||
| inference-schema | ||
| MarkupSafe | ||
| regex | ||
| pybind11 | ||
| urllib3 | ||
| requests | ||
| pillow | ||
| transformers | ||
| aiohttp>=3.12.14 | ||
| py-spy | ||
| debugpy | ||
| ipykernel | ||
| tensorboard | ||
| psutil | ||
| matplotlib | ||
| tqdm | ||
| py-cpuinfo | ||
| torch-tb-profiler | ||
12 changes: 12 additions & 0 deletions
12
assets/training/general/environments/acpt-pytorch-2.8-cuda12.6/environment.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| image: | ||
| name: azureml/curated/acpt-pytorch-2.8-cuda12.6 | ||
| os: linux | ||
| context: | ||
| dir: context | ||
| dockerfile: Dockerfile | ||
| template_files: | ||
| - Dockerfile | ||
| - requirements.txt | ||
| publish: | ||
| location: mcr | ||
| visibility: public |
26 changes: 26 additions & 0 deletions
26
assets/training/general/environments/acpt-pytorch-2.8-cuda12.6/spec.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,26 @@ | ||
| $schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json | ||
|
|
||
| description: >- | ||
| Recommended environment for Deep Learning in public preview with PyTorch on Azure containing the Azure ML SDK with the latest compatible versions of Ubuntu, Python, PyTorch, CUDA\RocM, combined with optimizers like ORT Training,+DeepSpeed+MSCCL+ORT MoE and more. The image introduces newly released PyTorch 2.1 for early testing, and preview of new fastcheckpointing capability called Nebula. | ||
| Azure Container Registry:acptdev.azurecr.io/test/public/aifx/acpt/stable-ubuntu2004-cu121-py310-torch212 | ||
|
|
||
| name: "{{asset.name}}" | ||
| version: "{{asset.version}}" | ||
|
|
||
| build: | ||
| path: "{{image.context.path}}" | ||
| dockerfile_path: "{{image.dockerfile.path}}" | ||
|
|
||
| os_type: linux | ||
|
|
||
| tags: | ||
| PyTorch: "2.8" | ||
| GPU: Cuda12 | ||
| OS: Ubuntu22.04 | ||
| Training: "" | ||
| Preview: "" | ||
| Python: "3.10" | ||
| DeepSpeed: "0.13.1" | ||
| ONNXRuntime: "1.17.1" | ||
| torch_ORT: "1.17.0" | ||
| Checkpointing:Nebula: "0.16.10" |
94 changes: 94 additions & 0 deletions
94
...s/training/general/environments/acpt-pytorch-2.8-cuda12.6/tests/pytorch2_8_sample_test.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT License. | ||
|
|
||
| """Test running a sample job in the pytorch 2.8 environment.""" | ||
| import os | ||
| import time | ||
| from pathlib import Path | ||
| from azure.ai.ml import command, Output, MLClient, PyTorchDistribution | ||
| from azure.ai.ml.entities import Environment, BuildContext, JobResourceConfiguration | ||
| from azure.identity import AzureCliCredential | ||
| import subprocess | ||
|
|
||
| BUILD_CONTEXT = Path("../context") | ||
| JOB_SOURCE_CODE = "../../acpt-tests/src" | ||
| TIMEOUT_MINUTES = os.environ.get("timeout_minutes", 60) | ||
| STD_LOG = Path("artifacts/user_logs/std_log.txt") | ||
|
|
||
|
|
||
| def test_pytorch_2_8(): | ||
| """Tests a sample job using pytorch 2.8 as the environment.""" | ||
| this_dir = Path(__file__).parent | ||
|
|
||
| subscription_id = os.environ.get("subscription_id") | ||
| resource_group = os.environ.get("resource_group") | ||
| workspace_name = os.environ.get("workspace") | ||
|
|
||
| ml_client = MLClient( | ||
| AzureCliCredential(), subscription_id, resource_group, workspace_name | ||
| ) | ||
|
|
||
| env_name = "acpt-pytorch-2_8-cuda12_6" | ||
|
|
||
| env_docker_context = Environment( | ||
| build=BuildContext(path=this_dir / BUILD_CONTEXT), | ||
| name=env_name, | ||
| description="Pytorch 2.8 environment created from a Docker context.", | ||
| ) | ||
| ml_client.environments.create_or_update(env_docker_context) | ||
|
|
||
| # create the command | ||
| job = command( | ||
| code=this_dir / JOB_SOURCE_CODE, # local path where the code is stored | ||
| command="pip install -r requirements.txt" | ||
| " && python pretrain_glue.py --tensorboard_log_dir \"/outputs/runs/\"" | ||
| " --deepspeed ds_config.json --num_train_epochs 5 --output_dir outputs --disable_tqdm 1" | ||
| " --local_rank $RANK --logging_strategy \"epoch\"" | ||
| " --per_device_train_batch_size 93 --gradient_accumulation_steps 1" | ||
| " --per_device_eval_batch_size 93 --learning_rate 3e-05 --adam_beta1 0.8 --adam_beta2 0.999" | ||
| " --weight_decay 3e-07 --warmup_steps 500 --fp16 --logging_steps 1000" | ||
| " --model_checkpoint \"bert-large-uncased\"", | ||
| outputs={ | ||
| "output": Output( | ||
| type="uri_folder", | ||
| mode="rw_mount", | ||
| path="azureml://datastores/workspaceblobstore/paths/outputs" | ||
| ) | ||
| }, | ||
| environment=f"{env_name}@latest", | ||
| compute=os.environ.get("gpu_v100_cluster"), | ||
| display_name="bert-pretrain-GLUE", | ||
| description="Pretrain the BERT model on the GLUE dataset.", | ||
| experiment_name="pytorch27_Cuda126_py310_Experiment", | ||
| distribution=PyTorchDistribution(process_count_per_instance=1), | ||
| resources=JobResourceConfiguration(instance_count=2, shm_size='3100m'), | ||
| ) | ||
|
|
||
| returned_job = ml_client.create_or_update(job) | ||
| assert returned_job is not None | ||
|
|
||
| # Poll until final status is reached or timed out | ||
| timeout = time.time() + (TIMEOUT_MINUTES * 60) | ||
| while time.time() <= timeout: | ||
| current_status = ml_client.jobs.get(returned_job.name).status | ||
| if current_status in ["Completed", "Failed"]: | ||
| break | ||
| time.sleep(30) # sleep 30 seconds | ||
|
|
||
| bashCommand = "ls" | ||
| process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE) | ||
| output, error = process.communicate() | ||
| print(output) | ||
| print(error) | ||
|
|
||
| if current_status == "Failed" or current_status == "Cancelled": | ||
| ml_client.jobs.download(returned_job.name) | ||
| if STD_LOG.exists(): | ||
| print(f"*** BEGIN {STD_LOG} ***") | ||
| with open(STD_LOG, "r") as f: | ||
| print(f.read(), end="") | ||
| print(f"*** END {STD_LOG} ***") | ||
| else: | ||
| ml_client.jobs.stream(returned_job.name) | ||
|
|
||
| assert current_status == "Completed" |
2 changes: 2 additions & 0 deletions
2
assets/training/general/environments/acpt-pytorch-2.8-cuda12.6/tests/requirements.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| azure-ai-ml==1.27.1 | ||
| azure.identity==1.10.0 |
3 changes: 1 addition & 2 deletions
3
assets/training/general/environments/acpt-tests/src/requirements.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,6 @@ | ||
| transformers | ||
| datasets | ||
| evaluate | ||
| accelerate | ||
| scikit-learn | ||
| apache_beam | ||
| apache_beam~=2.69.0 | ||
| evaluate |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.