aws-samples
diff --git a/‎.gitattributes‎
Lines changed: 7 additions & 0 deletions b/‎.gitattributes‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 65 additions & 0 deletions b/‎.gitignore‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 47 additions & 7 deletions b/‎README.md‎
Lines changed: 47 additions & 7 deletions
diff --git a/‎training/gr00t/.dockerignore‎
Lines changed: 26 additions & 0 deletions b/‎training/gr00t/.dockerignore‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎training/gr00t/Dockerfile‎
Lines changed: 102 additions & 0 deletions b/‎training/gr00t/Dockerfile‎
Lines changed: 102 additions & 0 deletions
diff --git a/‎training/gr00t/README.md‎
Lines changed: 87 additions & 0 deletions b/‎training/gr00t/README.md‎
Lines changed: 87 additions & 0 deletions
@@ -0,0 +1,7 @@
+# Git LFS configuration for sample dataset
+# Only track large binary files in sample_dataset to allow selective downloads
+
+# Track specific file types in sample_dataset only
+training/sample_dataset/**/*.mp4 filter=lfs diff=lfs merge=lfs -text
+training/sample_dataset/**/*.parquet filter=lfs diff=lfs merge=lfs -text
+training/sample_dataset/**/*.jsonl filter=lfs diff=lfs merge=lfs -text
@@ -0,0 +1,65 @@
+devel/
+log/
+build/
+install/
+bin/
+lib/
+msg_gen/
+srv_gen/
+msg/*Action.msg
+msg/*ActionFeedback.msg
+msg/*ActionGoal.msg
+msg/*ActionResult.msg
+msg/*Feedback.msg
+msg/*Goal.msg
+msg/*Result.msg
+msg/_*.py
+build_isolated/
+devel_isolated/
+
+# Generated by dynamic reconfigure
+*.cfgc
+/cfg/cpp/
+/cfg/*.py
+
+# Ignore generated docs
+*.dox
+*.wikidoc
+
+# eclipse stuff
+.project
+.cproject
+
+# qcreator stuff
+CMakeLists.txt.user
+
+srv/_*.py
+*.pcd
+*.pyc
+qtcreator-*
+*.user
+
+/planning/cfg
+/planning/docs
+/planning/src
+
+*~
+
+# Emacs
+.#*
+
+# Catkin custom files
+CATKIN_IGNORE
+
+# credentials
+.env
+
+# AWS CDK
+training/**/infra/cdk.out/*
+training/**/infra/cdk.context.json
+
+# Kiro IDE settings
+.kiro/
+
+# macOS files
+**/.DS_Store
@@ -1,17 +1,57 @@
-## My Project
+# Sample Embodied AI Platform
 
-TODO: Fill this README out!
+A reference platform with components for collecting data, training, evaluating, and deploying embodied AI systems on AWS. 
 
-Be sure to:
+## What's New
 
-* Change the title in this README
-* Edit your repository description on GitHub
+* [October 2025] We added the first component that demonstrates the fine-tuning pipeline for NVIDIA Isaac GR00T vision-language-action (VLA) model via teleoperation and imitation learning, then deploying for inference on cost-effective SO-ARM100/101.
+
+## Project goals
+
+- **Accelerate adoption**: End-to-end reference architecture combining AWS managed services with open source, purpose-built for physical/embodied AI.
+- **Lower the barrier**: Train and test in the cloud, then deploy to real robots, cost-effectively and reproducibly.
+- **Move fast**: Re-train overnight in AWS as tasks and environments change.
+- **Ecosystem enablement**: A practical baseline for startups and enterprises to build scalable physical AI pipelines on AWS.
+- **Cloud-to-robot path**: Demonstrates integration from simulation and training to on-device inference.
+
+## Component overview
+
+This repository is organized into modular components. Each component has its own documentation with setup, deployment, and usage instructions.
+
+### Available components
+
+| Component | Path | Purpose | Docs |
+| --- | --- | --- | --- |
+| NVIDIA Isaac GR00T Training | `training/gr00t/` | Fine-tune NVIDIA Isaac GR00T with teleop/sim data; reproducible workflow on AWS Batch; DCV workstation for monitoring/eval | [training/gr00t/README.md](training/gr00t/README.md) |
+
+## Roadmap
+
+- Additional VLA backbones and training recipes
+- Alternative data generation: teleop, scripted, sim-to-real augmentation, synthetic video
+- More embodiments (humanoids, robotic arms, etc.)
+- Serving patterns (SageMaker, EKS) and agents (Bedrock, OSS)
+- Robust IoT/edge deployment (AWS IoT/Greengrass), safety/telemetry best practices
 
 ## Security
 
-See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
+Review and run security scans before production use. See:
+- Each component and its own security considerations and best practices.
+- [CONTRIBUTING](CONTRIBUTING.md)
+
+## Reporting Issues
+
+If you notice a defect, feel free to create an [Issue](https://github.com/aws-samples/sample-embodied-ai-platform/issues).
+
+## Contributing
+
+Contributions are welcome. Please see [CONTRIBUTING](CONTRIBUTING.md) and [CODE_OF_CONDUCT](CODE_OF_CONDUCT.md).
 
 ## License
 
-This library is licensed under the MIT-0 License. See the LICENSE file.
+This project is licensed under the MIT-0 License. See [LICENSE](LICENSE).
+
+## Acknowledgments
+
+- AWS teams and community projects
+- NVIDIA Isaac team and open-source contributors
 
@@ -0,0 +1,26 @@
+# Exclude infrastructure and build artifacts
+infra/
+cdk.out/
+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+env/
+venv/
+.venv
+pip-log.txt
+pip-delete-this-directory.txt
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.log
+.git/
+.mypy_cache/
+.pytest_cache/
+.hypothesis/
+**/.DS_Store
@@ -0,0 +1,102 @@
+# Alternative Dockerfile from the GR00T repo: https://github.com/NVIDIA/Isaac-GR00T/blob/main/Dockerfile
+# Optimized Dockerfile for Isaac-GR00T using UV package manager
+FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
+
+# Build argument to control GR00T version (stable vs latest)
+ARG USE_STABLE=true
+ARG STABLE_COMMIT=db107f03d165060998df166292578f1d7fb3c79a
+
+# Setting the frontend to be non-interactive
+# This is to avoid any user input required during the installation of packages
+ENV DEBIAN_FRONTEND=noninteractive
+
+# System dependencies - consolidated for better layer caching
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    # Core utilities
+    wget curl ca-certificates unzip \
+    # Git and version control
+    git git-lfs \
+    # Build essentials
+    build-essential cmake \
+    # Media processing
+    ffmpeg \
+    # OpenCV dependencies
+    libopencv-dev libgl1-mesa-glx libglib2.0-0 libsm6 libxext6 libxrender-dev \
+    # Python development
+    python3.10 python3.10-dev python3.10-distutils python3-pip \
+    # Utilities for debugging
+    vim less htop \
+    && rm -rf /var/lib/apt/lists/*
+
+# Install AWS CLI v2 (official AWS method)
+RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
+    unzip awscliv2.zip && \
+    ./aws/install && \
+    rm -rf awscliv2.zip aws && \
+    rm -rf /usr/local/aws-cli/v2/*/dist/awscli/examples
+
+# Set Python 3.10 as default
+RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.10 1 && \
+    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
+
+# Install UV package manager (much faster than pip/conda)
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
+
+# Configure UV to use system Python
+ENV UV_SYSTEM_PYTHON=1
+ENV PYTHONPATH=/workspace
+
+# ============================================================================
+# Clone Isaac-GR00T repository
+# This layer changes when using --latest
+# ============================================================================
+RUN git clone https://github.com/NVIDIA/Isaac-GR00T.git /workspace && \
+    cd /workspace && \
+    if [ "${USE_STABLE}" = "true" ]; then \
+    echo "Using stable commit: ${STABLE_COMMIT} (default)"; \
+    git checkout ${STABLE_COMMIT}; \
+    else \
+    echo "Using latest version from main branch"; \
+    fi && \
+    echo "GR00T version info:" && \
+    git log -1 --format="%H %ai %s"
+
+# Set working directory
+WORKDIR /workspace
+
+# Upgrade pip and setuptools using UV
+RUN uv pip install --upgrade pip setuptools wheel
+
+# Install GR00T base dependencies using UV (faster resolution and installation)
+RUN uv pip install --no-cache -e .[base]
+
+# Install flash-attention separately (requires build isolation disabled)
+RUN pip install --no-build-isolation flash-attn==2.7.1.post4
+
+# Install additional utilities
+RUN uv pip install --no-cache notebook gpustat wandb
+
+# Install HuggingFace CLI and additional dependencies if necessary
+# RUN pip install huggingface_hub[cli] datasets
+
+# Copy the workflow scripts
+COPY finetune_gr00t.py /workspace/scripts/
+COPY run_finetune_workflow.sh /workspace/scripts/
+RUN chmod +x /workspace/scripts/run_finetune_workflow.sh
+
+# Set environment variables with defaults
+ENV DATASET_LOCAL_DIR="/workspace/train"
+ENV OUTPUT_DIR="/workspace/checkpoints"
+
+# If there is issue with the latest version, use tested checkpoint as of 09 July 2025 by commenting out the following two lines
+# RUN hf download nvidia/GR00T-N1.5-3B --revision 869830fc749c35f34771aa5209f923ac57e4564e --local-dir ./GR00T-N1.5-3B
+# ENV BASE_MODEL_PATH="./GR00T-N1.5-3B"
+
+# Create directories using environment variables
+RUN mkdir -p ${DATASET_LOCAL_DIR} ${OUTPUT_DIR}
+
+# Setting the Entrypoint and Command to /bin/bash whenever executed
+ENTRYPOINT ["/bin/bash"]
+# Default command to run the workflow, but can be overridden
+CMD ["/workspace/scripts/run_finetune_workflow.sh"]
+# CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]
@@ -0,0 +1,87 @@
+# NVIDIA Isaac GR00T Training Component
+
+Fine-tune NVIDIA Isaac GR00T VLA models using teleoperation/simulation datasets. Supports AWS Batch training with GPU and Amazon DCV for monitoring/evaluation. Use this README as a bridge: high-level usage and structure here; detailed infrastructure/deployment in `infra/README.md`.
+
+## Links
+
+- Component docs (this): [README.md](README.md)
+- Infrastructure and deployment: [infra/README.md](infra/README.md)
+- Workflow scripts: [run_finetune_workflow.sh](run_finetune_workflow.sh), [finetune_gr00t.py](finetune_gr00t.py)
+
+## Deployment
+
+See [infra/README.md](infra/README.md).
+
+## Module Structure
+
+```text
+training/gr00t/
+├── README.md                  # GR00T training overview
+├── Dockerfile                 # Training container
+├── build_container.sh         # Build/test/push helper
+├── env.example                # Example environment variables
+├── finetune_gr00t.py          # GR00T training script
+├── run_finetune_workflow.sh   # Entrypoint: dataset, auth, uploads
+└── infra/                     # AWS CDK stacks for Batch and DCV
+    ├── README.md              # Deployment guide (paths 1–3, troubleshooting)
+    ├── app.py
+    ├── batch_stack.py
+    ├── dcv_stack.py
+    ├── configure_dcv_instance.sh
+    ├── requirements.txt
+    ├── cdk.json               # Context (VPC/EFS/SG IDs) when importing existing resources
+    └── architecture.drawio.png
+```
+
+## Submitting Jobs
+
+After deploying the infrastructure (see [infra/README.md](infra/README.md)), submit training jobs to AWS Batch:
+
+**AWS CLI:**
+```bash
+aws batch submit-job \
+  --job-name "IsaacGr00tFinetuning" \
+  --job-queue "IsaacGr00tJobQueue" \
+  --job-definition "IsaacGr00tJobDefinition"
+```
+
+**With custom environment variables:**
+```bash
+aws batch submit-job \
+  --job-name "IsaacGr00tFinetuning" \
+  --job-queue "IsaacGr00tJobQueue" \
+  --job-definition "IsaacGr00tJobDefinition" \
+  --container-overrides 'environment=[
+    {name=HF_DATASET_ID,value=lerobot/your-dataset},
+    {name=MAX_STEPS,value=6000},
+    {name=SAVE_STEPS,value=2000}
+  ]'
+```
+
+**AWS Console:**
+1. Go to AWS Batch → Jobs → Submit new job
+2. Select `IsaacGr00tJobDefinition` and `IsaacGr00tJobQueue`
+3. Add environment variables as needed
+4. Submit
+
+**Monitor progress:**
+```bash
+# Check status
+aws batch describe-jobs --jobs <JOB_ID>
+
+# Stream logs (once RUNNING)
+aws logs tail /aws/batch/job --follow \
+  --log-stream-names "$(aws batch describe-jobs --jobs <JOB_ID> \
+  --query 'jobs[0].container.logStreamName' --output text)"
+```
+
+> Default: 6000 steps (~3 hours on g6e.4xlarge). Checkpoints saved every 2000 steps at `/mnt/efs/gr00t/checkpoints`.
+
+## Configuration (env vars)
+
+See [env.example](env.example) for configuring the training job parameters:
+- Dataset sources: `DATASET_LOCAL_DIR`, `DATASET_S3_URI`, `HF_DATASET_ID`
+- Uploads: `UPLOAD_TARGET` (hf|s3|none), `HF_TOKEN`, `HF_MODEL_REPO_ID`, `S3_UPLOAD_URI`
+- Training: `MAX_STEPS`, `SAVE_STEPS`, `NUM_GPUS`, `BATCH_SIZE`, `LEARNING_RATE`
+- Model/data: `BASE_MODEL_PATH`, `DATA_CONFIG`, `VIDEO_BACKEND`, `EMBODIMENT_TAG`
+- Tuning: `TUNE_LLM`, `TUNE_VISUAL`, `TUNE_PROJECTOR`, `TUNE_DIFFUSION_MODEL`, LoRA params