Skip to content

Commit 4a35bbb

Browse files
authored
Merge pull request #955 from A7med7x7/main
Final Report NYC.
2 parents 8efc518 + 24b400c commit 4a35bbb

File tree

4 files changed

+99
-1
lines changed

4 files changed

+99
-1
lines changed

content/report/osre25/nyu/mlops/07292025-alghali/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ The progress on the project so far is as follows:
3838

3939
Artifacts are now logged directly from the MLflow server to the Chameleon object store, without relying on a database backend or an intermediate MinIO S3 layer.
4040

41-
#### Different jupyter lap images for each framework.
41+
#### Different jupyter lab images for each framework.
4242
We’ve started with the top ML frameworks — PyTorch Lightning, Keras/TensorFlow, and Scikit-Learn. Each framework now has its own image, which will later be tailored to the user’s selection.
4343

4444
#### Github CLI and Hugging Face integration inside the container.
927 KB
Loading
1.2 MB
Loading
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
---
2+
title: "Final Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows"
3+
subtitle: ""
4+
summary: " "
5+
authors:
6+
- alghali
7+
tags: ["osre25","reproducibility","linux", "experiment tracking","machine learning research"]
8+
categories: []
9+
date: 2025-09-18
10+
lastmod: 2025-09-20
11+
featured: true
12+
draft: false
13+
14+
# Featured image
15+
# To use, add an image named `featured.jpg/png` to your page's folder.
16+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
17+
image:
18+
caption: ""
19+
focal_point: ""
20+
preview_only: false
21+
22+
---
23+
24+
# Final Report: Applying MLOps to Overcome Reproducibility Barriers in ML
25+
![Generating project](image1.png)
26+
## Background
27+
28+
Hello! I’m Ahmed Alghali, and this is my final report the project [**Applying MLOps to Overcome Reproducibility Barriers in ML**](https://ucsc-ospo.github.io/project/osre25/nyu/mlops/) under the mentorship of Professor [Fraida Fund](https://ucsc-ospo.github.io/author/fraida-fund/) and [Mohamed Saeed](https://ucsc-ospo.github.io/author/mohamed-saeed/).
29+
30+
This project aims to address the **reproducibility problem** in machine learning—both in core ML research and in applications to other areas of science.
31+
32+
The focus is on making large-scale ML experiments **reproducible on [Chameleon Cloud](https://www.chameleoncloud.org/)**. To do this; we developed [**ReproGen**](https://github.com/A7med7x7/ReproGen), a template generator that produces ready-to-use, reproducible ML training workflows. The goal: is to make the cloud easy for researchers setting up experiments without the worry about the complexity involved in stitching everything together.
33+
34+
---
35+
36+
## Progress Since Mid-Report
37+
38+
### Migration from Cookiecutter to Copier
39+
40+
we initially used [Cookiecutter](https://www.cookiecutter.io/) for template generation as a templating engine, but it lacked features we were interested in (e.g., conditional questions). we switched to [Copier](https://copier.readthedocs.io/en/stable/), which provides more flexibility and better matches our use case.
41+
42+
### Support for Multiple Setup Modes
43+
44+
We now offer **two setup modes**, designed to serve both beginners and users who want advanced options/customization:
45+
46+
- **Basic Mode** – minimal prompts (project name, repository link, framework).
47+
48+
- **Advanced Mode** – detailed control (compute site, GPU type, CUDA version, storage site, etc.).
49+
50+
51+
this ensures accessibility for new users, while still enabling fine-grained control for users.
52+
![prompting](image2.png)
53+
### Automated Credential Generation
54+
55+
previously, users had to manually generate application credentials (via Horizon OpenStack UI). now, we provide scripts that can generate two types of credentials programmatically—**Swift** and **EC2**—using **Chameleon JupyterHub credentials** with `python-chi` and the `openstack-sdk` client.
56+
57+
### Automatic README.md Generation
58+
59+
each generated project includes a **customized README.md**, containing setup guidance and commands tailored to the user’s configuration.
60+
61+
### Bug Fixes and UX Enhancements
62+
63+
Alongside major features, we implemented numerous smaller changes and fixes to improve the reliability and user experience of the tool.
64+
65+
---
66+
67+
## Deliverables
68+
69+
- [**ReproGen GitHub Repository**](https://github.com/A7med7x7/ReproGen): source code for the template generator.
70+
71+
- [**mlflow-replay branch**](https://github.com/A7med7x7/ReproGen/tree/mlflow-replay): explore a past experiment, artifacts, and logged insights.
72+
73+
- [**LLM-Demo branch**](https://github.com/A7med7x7/ReproGen/tree/training-demo): hands-on demo to track fine-tuning of an LLM using infrastructure generated by ReproGen.
74+
75+
76+
---
77+
78+
## Next Steps
79+
80+
1. **Compatibility Matrix**
81+
82+
- the tool and the generated setup both depend on software dependencies that required paying attention to compatibility. in all level Hardware, OS, Drivers, Computing Platforms, core and 3rd-party libraries. writing a documentation as a start to help future debugging and adding pieces without breaking what is there. .
83+
84+
2. **Maintain Docker Images**
85+
86+
so far we have a cpu and GPU docker images for multiple most frequently used framework.
87+
- **CPU based image**: for data science workload (Scikit-Learn)
88+
- **GPU-Nvidia Variant**: for Deep Learning workload on Nvidia Machines (Pytorch, Lightning, TensorFlow)
89+
- **GPU-AMD Variant**: for Deep Learning workload on AMD Machines (Pytorch, Lightning, TensorFlow)
90+
adding more variants for more frameworks + Enhancing the experience of the existing images is recommended.
91+
92+
93+
94+
---
95+
96+
## Reflection
97+
98+
When I first joined SoR 2025, I had a problem crystallizing the idea of how I can practically achieve reproducibility and package a tool that would maximizes the chance of reproducing experiment build using it. throughout the journey my mentors took me under their wings and helped me to understand the **reproducibility challenges in ML**, my Mentor Professor [Fraida Fund](https://ucsc-ospo.github.io/author/fraida-fund/) wrote materials that saved me a lot of time to familiarize my self with the [testbed](chameleoncloud.org),important Linux tools and commands, and even getting to have hand on practice how [large model training](https://teaching-on-testbeds.github.io/mltrain-chi/) happen with MLflow tracking server system is done in the cloud. and [Mohamed Saeed](https://ucsc-ospo.github.io/author/mohamed-saeed/). who took the time reviewing my presentation pushing me to do my best. I'm forever thankful in the way they shaped the project and my personal growth. this hands-on experience help me viewing **MLOps , cloud APIs, and workflow design** in different lenses, and I’m proud to have contributed a tool that can simplify help reproducible research for others.

0 commit comments

Comments
 (0)