|
| 1 | +--- |
| 2 | +title: "Final Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows" |
| 3 | +subtitle: "" |
| 4 | +summary: " " |
| 5 | +authors: |
| 6 | + - alghali |
| 7 | +tags: ["osre25","reproducibility","linux", "experiment tracking","machine learning research"] |
| 8 | +categories: [] |
| 9 | +date: 2025-09-18 |
| 10 | +lastmod: 2025-09-20 |
| 11 | +featured: true |
| 12 | +draft: false |
| 13 | + |
| 14 | +# Featured image |
| 15 | +# To use, add an image named `featured.jpg/png` to your page's folder. |
| 16 | +# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight. |
| 17 | +image: |
| 18 | + caption: "" |
| 19 | + focal_point: "" |
| 20 | + preview_only: false |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +# Final Report: Applying MLOps to Overcome Reproducibility Barriers in ML |
| 25 | + |
| 26 | +## Background |
| 27 | + |
| 28 | +Hello! I’m Ahmed Alghali, and this is my final report the project [**Applying MLOps to Overcome Reproducibility Barriers in ML**](https://ucsc-ospo.github.io/project/osre25/nyu/mlops/) under the mentorship of Professor [Fraida Fund](https://ucsc-ospo.github.io/author/fraida-fund/) and [Mohamed Saeed](https://ucsc-ospo.github.io/author/mohamed-saeed/). |
| 29 | + |
| 30 | +This project aims to address the **reproducibility problem** in machine learning—both in core ML research and in applications to other areas of science. |
| 31 | + |
| 32 | +The focus is on making large-scale ML experiments **reproducible on [Chameleon Cloud](https://www.chameleoncloud.org/)**. To do this; we developed [**ReproGen**](https://github.com/A7med7x7/ReproGen), a template generator that produces ready-to-use, reproducible ML training workflows. The goal: is to make the cloud easy for researchers setting up experiments without the worry about the complexity involved in stitching everything together. |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## Progress Since Mid-Report |
| 37 | + |
| 38 | +### Migration from Cookiecutter to Copier |
| 39 | + |
| 40 | +we initially used [Cookiecutter](https://www.cookiecutter.io/) for template generation as a templating engine, but it lacked features we were interested in (e.g., conditional questions). we switched to [Copier](https://copier.readthedocs.io/en/stable/), which provides more flexibility and better matches our use case. |
| 41 | + |
| 42 | +### Support for Multiple Setup Modes |
| 43 | + |
| 44 | +We now offer **two setup modes**, designed to serve both beginners and users who want advanced options/customization: |
| 45 | + |
| 46 | +- **Basic Mode** – minimal prompts (project name, repository link, framework). |
| 47 | + |
| 48 | +- **Advanced Mode** – detailed control (compute site, GPU type, CUDA version, storage site, etc.). |
| 49 | + |
| 50 | + |
| 51 | +this ensures accessibility for new users, while still enabling fine-grained control for users. |
| 52 | + |
| 53 | +### Automated Credential Generation |
| 54 | + |
| 55 | +previously, users had to manually generate application credentials (via Horizon OpenStack UI). now, we provide scripts that can generate two types of credentials programmatically—**Swift** and **EC2**—using **Chameleon JupyterHub credentials** with `python-chi` and the `openstack-sdk` client. |
| 56 | + |
| 57 | +### Automatic README.md Generation |
| 58 | + |
| 59 | +each generated project includes a **customized README.md**, containing setup guidance and commands tailored to the user’s configuration. |
| 60 | + |
| 61 | +### Bug Fixes and UX Enhancements |
| 62 | + |
| 63 | +Alongside major features, we implemented numerous smaller changes and fixes to improve the reliability and user experience of the tool. |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## Deliverables |
| 68 | + |
| 69 | +- [**ReproGen GitHub Repository**](https://github.com/A7med7x7/ReproGen): source code for the template generator. |
| 70 | + |
| 71 | +- [**mlflow-replay branch**](https://github.com/A7med7x7/ReproGen/tree/mlflow-replay): explore a past experiment, artifacts, and logged insights. |
| 72 | + |
| 73 | +- [**LLM-Demo branch**](https://github.com/A7med7x7/ReproGen/tree/training-demo): hands-on demo to track fine-tuning of an LLM using infrastructure generated by ReproGen. |
| 74 | + |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## Next Steps |
| 79 | + |
| 80 | +1. **Compatibility Matrix** |
| 81 | + |
| 82 | + - the tool and the generated setup both depend on software dependencies that required paying attention to compatibility. in all level Hardware, OS, Drivers, Computing Platforms, core and 3rd-party libraries. writing a documentation as a start to help future debugging and adding pieces without breaking what is there. . |
| 83 | + |
| 84 | +2. **Maintain Docker Images** |
| 85 | + |
| 86 | + so far we have a cpu and GPU docker images for multiple most frequently used framework. |
| 87 | + - **CPU based image**: for data science workload (Scikit-Learn) |
| 88 | + - **GPU-Nvidia Variant**: for Deep Learning workload on Nvidia Machines (Pytorch, Lightning, TensorFlow) |
| 89 | + - **GPU-AMD Variant**: for Deep Learning workload on AMD Machines (Pytorch, Lightning, TensorFlow) |
| 90 | + adding more variants for more frameworks + Enhancing the experience of the existing images is recommended. |
| 91 | + |
| 92 | + |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## Reflection |
| 97 | + |
| 98 | +When I first joined SoR 2025, I had a problem crystallizing the idea of how I can practically achieve reproducibility and package a tool that would maximizes the chance of reproducing experiment build using it. throughout the journey my mentors took me under their wings and helped me to understand the **reproducibility challenges in ML**, my Mentor Professor [Fraida Fund](https://ucsc-ospo.github.io/author/fraida-fund/) wrote materials that saved me a lot of time to familiarize my self with the [testbed](chameleoncloud.org),important Linux tools and commands, and even getting to have hand on practice how [large model training](https://teaching-on-testbeds.github.io/mltrain-chi/) happen with MLflow tracking server system is done in the cloud. and [Mohamed Saeed](https://ucsc-ospo.github.io/author/mohamed-saeed/). who took the time reviewing my presentation pushing me to do my best. I'm forever thankful in the way they shaped the project and my personal growth. this hands-on experience help me viewing **MLOps , cloud APIs, and workflow design** in different lenses, and I’m proud to have contributed a tool that can simplify help reproducible research for others. |
0 commit comments