Skip to content

Commit 0cda3c4

Browse files
authored
feat: add documentation and NVIDIA Isaac GR00T N1.5 fine-tuning pipeline
2 parents ee376fe + d4e5de5 commit 0cda3c4

File tree

195 files changed

+3621
-7
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

195 files changed

+3621
-7
lines changed

.gitattributes

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
# Git LFS configuration for sample dataset
2+
# Only track large binary files in sample_dataset to allow selective downloads
3+
4+
# Track specific file types in sample_dataset only
5+
training/sample_dataset/**/*.mp4 filter=lfs diff=lfs merge=lfs -text
6+
training/sample_dataset/**/*.parquet filter=lfs diff=lfs merge=lfs -text
7+
training/sample_dataset/**/*.jsonl filter=lfs diff=lfs merge=lfs -text

.gitignore

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
devel/
2+
log/
3+
build/
4+
install/
5+
bin/
6+
lib/
7+
msg_gen/
8+
srv_gen/
9+
msg/*Action.msg
10+
msg/*ActionFeedback.msg
11+
msg/*ActionGoal.msg
12+
msg/*ActionResult.msg
13+
msg/*Feedback.msg
14+
msg/*Goal.msg
15+
msg/*Result.msg
16+
msg/_*.py
17+
build_isolated/
18+
devel_isolated/
19+
20+
# Generated by dynamic reconfigure
21+
*.cfgc
22+
/cfg/cpp/
23+
/cfg/*.py
24+
25+
# Ignore generated docs
26+
*.dox
27+
*.wikidoc
28+
29+
# eclipse stuff
30+
.project
31+
.cproject
32+
33+
# qcreator stuff
34+
CMakeLists.txt.user
35+
36+
srv/_*.py
37+
*.pcd
38+
*.pyc
39+
qtcreator-*
40+
*.user
41+
42+
/planning/cfg
43+
/planning/docs
44+
/planning/src
45+
46+
*~
47+
48+
# Emacs
49+
.#*
50+
51+
# Catkin custom files
52+
CATKIN_IGNORE
53+
54+
# credentials
55+
.env
56+
57+
# AWS CDK
58+
training/**/infra/cdk.out/*
59+
training/**/infra/cdk.context.json
60+
61+
# Kiro IDE settings
62+
.kiro/
63+
64+
# macOS files
65+
**/.DS_Store

README.md

Lines changed: 47 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,57 @@
1-
## My Project
1+
# Sample Embodied AI Platform
22

3-
TODO: Fill this README out!
3+
A reference platform with components for collecting data, training, evaluating, and deploying embodied AI systems on AWS.
44

5-
Be sure to:
5+
## What's New
66

7-
* Change the title in this README
8-
* Edit your repository description on GitHub
7+
* [October 2025] We added the first component that demonstrates the fine-tuning pipeline for NVIDIA Isaac GR00T vision-language-action (VLA) model via teleoperation and imitation learning, then deploying for inference on cost-effective SO-ARM100/101.
8+
9+
## Project goals
10+
11+
- **Accelerate adoption**: End-to-end reference architecture combining AWS managed services with open source, purpose-built for physical/embodied AI.
12+
- **Lower the barrier**: Train and test in the cloud, then deploy to real robots, cost-effectively and reproducibly.
13+
- **Move fast**: Re-train overnight in AWS as tasks and environments change.
14+
- **Ecosystem enablement**: A practical baseline for startups and enterprises to build scalable physical AI pipelines on AWS.
15+
- **Cloud-to-robot path**: Demonstrates integration from simulation and training to on-device inference.
16+
17+
## Component overview
18+
19+
This repository is organized into modular components. Each component has its own documentation with setup, deployment, and usage instructions.
20+
21+
### Available components
22+
23+
| Component | Path | Purpose | Docs |
24+
| --- | --- | --- | --- |
25+
| NVIDIA Isaac GR00T Training | `training/gr00t/` | Fine-tune NVIDIA Isaac GR00T with teleop/sim data; reproducible workflow on AWS Batch; DCV workstation for monitoring/eval | [training/gr00t/README.md](training/gr00t/README.md) |
26+
27+
## Roadmap
28+
29+
- Additional VLA backbones and training recipes
30+
- Alternative data generation: teleop, scripted, sim-to-real augmentation, synthetic video
31+
- More embodiments (humanoids, robotic arms, etc.)
32+
- Serving patterns (SageMaker, EKS) and agents (Bedrock, OSS)
33+
- Robust IoT/edge deployment (AWS IoT/Greengrass), safety/telemetry best practices
934

1035
## Security
1136

12-
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
37+
Review and run security scans before production use. See:
38+
- Each component and its own security considerations and best practices.
39+
- [CONTRIBUTING](CONTRIBUTING.md)
40+
41+
## Reporting Issues
42+
43+
If you notice a defect, feel free to create an [Issue](https://github.com/aws-samples/sample-embodied-ai-platform/issues).
44+
45+
## Contributing
46+
47+
Contributions are welcome. Please see [CONTRIBUTING](CONTRIBUTING.md) and [CODE_OF_CONDUCT](CODE_OF_CONDUCT.md).
1348

1449
## License
1550

16-
This library is licensed under the MIT-0 License. See the LICENSE file.
51+
This project is licensed under the MIT-0 License. See [LICENSE](LICENSE).
52+
53+
## Acknowledgments
54+
55+
- AWS teams and community projects
56+
- NVIDIA Isaac team and open-source contributors
1757

training/gr00t/.dockerignore

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Exclude infrastructure and build artifacts
2+
infra/
3+
cdk.out/
4+
__pycache__/
5+
*.pyc
6+
*.pyo
7+
*.pyd
8+
.Python
9+
env/
10+
venv/
11+
.venv
12+
pip-log.txt
13+
pip-delete-this-directory.txt
14+
.tox/
15+
.coverage
16+
.coverage.*
17+
.cache
18+
nosetests.xml
19+
coverage.xml
20+
*.cover
21+
*.log
22+
.git/
23+
.mypy_cache/
24+
.pytest_cache/
25+
.hypothesis/
26+
**/.DS_Store

training/gr00t/Dockerfile

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# Alternative Dockerfile from the GR00T repo: https://github.com/NVIDIA/Isaac-GR00T/blob/main/Dockerfile
2+
# Optimized Dockerfile for Isaac-GR00T using UV package manager
3+
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
4+
5+
# Build argument to control GR00T version (stable vs latest)
6+
ARG USE_STABLE=true
7+
ARG STABLE_COMMIT=db107f03d165060998df166292578f1d7fb3c79a
8+
9+
# Setting the frontend to be non-interactive
10+
# This is to avoid any user input required during the installation of packages
11+
ENV DEBIAN_FRONTEND=noninteractive
12+
13+
# System dependencies - consolidated for better layer caching
14+
RUN apt-get update && apt-get install -y --no-install-recommends \
15+
# Core utilities
16+
wget curl ca-certificates unzip \
17+
# Git and version control
18+
git git-lfs \
19+
# Build essentials
20+
build-essential cmake \
21+
# Media processing
22+
ffmpeg \
23+
# OpenCV dependencies
24+
libopencv-dev libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev \
25+
# Python development
26+
python3.10 python3.10-dev python3.10-distutils python3-pip \
27+
# Utilities for debugging
28+
vim less htop \
29+
&& rm -rf /var/lib/apt/lists/*
30+
31+
# Install AWS CLI v2 (official AWS method)
32+
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
33+
unzip awscliv2.zip && \
34+
./aws/install && \
35+
rm -rf awscliv2.zip aws && \
36+
rm -rf /usr/local/aws-cli/v2/*/dist/awscli/examples
37+
38+
# Set Python 3.10 as default
39+
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 && \
40+
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
41+
42+
# Install UV package manager (much faster than pip/conda)
43+
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
44+
45+
# Configure UV to use system Python
46+
ENV UV_SYSTEM_PYTHON=1
47+
ENV PYTHONPATH=/workspace
48+
49+
# ============================================================================
50+
# Clone Isaac-GR00T repository
51+
# This layer changes when using --latest
52+
# ============================================================================
53+
RUN git clone https://github.com/NVIDIA/Isaac-GR00T.git /workspace && \
54+
cd /workspace && \
55+
if [ "${USE_STABLE}" = "true" ]; then \
56+
echo "Using stable commit: ${STABLE_COMMIT} (default)"; \
57+
git checkout ${STABLE_COMMIT}; \
58+
else \
59+
echo "Using latest version from main branch"; \
60+
fi && \
61+
echo "GR00T version info:" && \
62+
git log -1 --format="%H %ai %s"
63+
64+
# Set working directory
65+
WORKDIR /workspace
66+
67+
# Upgrade pip and setuptools using UV
68+
RUN uv pip install --upgrade pip setuptools wheel
69+
70+
# Install GR00T base dependencies using UV (faster resolution and installation)
71+
RUN uv pip install --no-cache -e .[base]
72+
73+
# Install flash-attention separately (requires build isolation disabled)
74+
RUN pip install --no-build-isolation flash-attn==2.7.1.post4
75+
76+
# Install additional utilities
77+
RUN uv pip install --no-cache notebook gpustat wandb
78+
79+
# Install HuggingFace CLI and additional dependencies if necessary
80+
# RUN pip install huggingface_hub[cli] datasets
81+
82+
# Copy the workflow scripts
83+
COPY finetune_gr00t.py /workspace/scripts/
84+
COPY run_finetune_workflow.sh /workspace/scripts/
85+
RUN chmod +x /workspace/scripts/run_finetune_workflow.sh
86+
87+
# Set environment variables with defaults
88+
ENV DATASET_LOCAL_DIR="/workspace/train"
89+
ENV OUTPUT_DIR="/workspace/checkpoints"
90+
91+
# If there is issue with the latest version, use tested checkpoint as of 09 July 2025 by commenting out the following two lines
92+
# RUN hf download nvidia/GR00T-N1.5-3B --revision 869830fc749c35f34771aa5209f923ac57e4564e --local-dir ./GR00T-N1.5-3B
93+
# ENV BASE_MODEL_PATH="./GR00T-N1.5-3B"
94+
95+
# Create directories using environment variables
96+
RUN mkdir -p ${DATASET_LOCAL_DIR} ${OUTPUT_DIR}
97+
98+
# Setting the Entrypoint and Command to /bin/bash whenever executed
99+
ENTRYPOINT ["/bin/bash"]
100+
# Default command to run the workflow, but can be overridden
101+
CMD ["/workspace/scripts/run_finetune_workflow.sh"]
102+
# CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

training/gr00t/README.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# NVIDIA Isaac GR00T Training Component
2+
3+
Fine-tune NVIDIA Isaac GR00T VLA models using teleoperation/simulation datasets. Supports AWS Batch training with GPU and Amazon DCV for monitoring/evaluation. Use this README as a bridge: high-level usage and structure here; detailed infrastructure/deployment in `infra/README.md`.
4+
5+
## Links
6+
7+
- Component docs (this): [README.md](README.md)
8+
- Infrastructure and deployment: [infra/README.md](infra/README.md)
9+
- Workflow scripts: [run_finetune_workflow.sh](run_finetune_workflow.sh), [finetune_gr00t.py](finetune_gr00t.py)
10+
11+
## Deployment
12+
13+
See [infra/README.md](infra/README.md).
14+
15+
## Module Structure
16+
17+
```text
18+
training/gr00t/
19+
├── README.md # GR00T training overview
20+
├── Dockerfile # Training container
21+
├── build_container.sh # Build/test/push helper
22+
├── env.example # Example environment variables
23+
├── finetune_gr00t.py # GR00T training script
24+
├── run_finetune_workflow.sh # Entrypoint: dataset, auth, uploads
25+
└── infra/ # AWS CDK stacks for Batch and DCV
26+
├── README.md # Deployment guide (paths 1–3, troubleshooting)
27+
├── app.py
28+
├── batch_stack.py
29+
├── dcv_stack.py
30+
├── configure_dcv_instance.sh
31+
├── requirements.txt
32+
├── cdk.json # Context (VPC/EFS/SG IDs) when importing existing resources
33+
└── architecture.drawio.png
34+
```
35+
36+
## Submitting Jobs
37+
38+
After deploying the infrastructure (see [infra/README.md](infra/README.md)), submit training jobs to AWS Batch:
39+
40+
**AWS CLI:**
41+
```bash
42+
aws batch submit-job \
43+
--job-name "IsaacGr00tFinetuning" \
44+
--job-queue "IsaacGr00tJobQueue" \
45+
--job-definition "IsaacGr00tJobDefinition"
46+
```
47+
48+
**With custom environment variables:**
49+
```bash
50+
aws batch submit-job \
51+
--job-name "IsaacGr00tFinetuning" \
52+
--job-queue "IsaacGr00tJobQueue" \
53+
--job-definition "IsaacGr00tJobDefinition" \
54+
--container-overrides 'environment=[
55+
{name=HF_DATASET_ID,value=lerobot/your-dataset},
56+
{name=MAX_STEPS,value=6000},
57+
{name=SAVE_STEPS,value=2000}
58+
]'
59+
```
60+
61+
**AWS Console:**
62+
1. Go to AWS Batch → Jobs → Submit new job
63+
2. Select `IsaacGr00tJobDefinition` and `IsaacGr00tJobQueue`
64+
3. Add environment variables as needed
65+
4. Submit
66+
67+
**Monitor progress:**
68+
```bash
69+
# Check status
70+
aws batch describe-jobs --jobs <JOB_ID>
71+
72+
# Stream logs (once RUNNING)
73+
aws logs tail /aws/batch/job --follow \
74+
--log-stream-names "$(aws batch describe-jobs --jobs <JOB_ID> \
75+
--query 'jobs[0].container.logStreamName' --output text)"
76+
```
77+
78+
> Default: 6000 steps (~3 hours on g6e.4xlarge). Checkpoints saved every 2000 steps at `/mnt/efs/gr00t/checkpoints`.
79+
80+
## Configuration (env vars)
81+
82+
See [env.example](env.example) for configuring the training job parameters:
83+
- Dataset sources: `DATASET_LOCAL_DIR`, `DATASET_S3_URI`, `HF_DATASET_ID`
84+
- Uploads: `UPLOAD_TARGET` (hf|s3|none), `HF_TOKEN`, `HF_MODEL_REPO_ID`, `S3_UPLOAD_URI`
85+
- Training: `MAX_STEPS`, `SAVE_STEPS`, `NUM_GPUS`, `BATCH_SIZE`, `LEARNING_RATE`
86+
- Model/data: `BASE_MODEL_PATH`, `DATA_CONFIG`, `VIDEO_BACKEND`, `EMBODIMENT_TAG`
87+
- Tuning: `TUNE_LLM`, `TUNE_VISUAL`, `TUNE_PROJECTOR`, `TUNE_DIFFUSION_MODEL`, LoRA params

0 commit comments

Comments
 (0)