Skip to content

Commit 51a9f80

Browse files
authored
Merge branch 'main' into main
2 parents db6a7df + 9c7d25d commit 51a9f80

File tree

16 files changed

+461
-5
lines changed

16 files changed

+461
-5
lines changed
411 KB
Loading
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
---
2+
title: "Reproducibility of Interactive Notebooks in Distributed Environments"
3+
subtitle: ""
4+
summary:
5+
authors:
6+
- rahmad
7+
- tanu-malik
8+
tags: ["osre25"]
9+
categories: [distribute systems, notebooks]
10+
date: 2025-07-25
11+
lastmod: 2025-07-25
12+
featured: false
13+
draft: false
14+
15+
# Featured image
16+
# To use, add an image named `featured.jpg/png` to your page's folder.
17+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
18+
image:
19+
caption: ""
20+
focal_point: ""
21+
preview_only: false
22+
---
23+
24+
I am sharing a overview of my project [Reproducibility of Interactive Notebooks in Distributed Environments](/report/osre25/ucsc/06122025-rahmad) and an udpate at the midway mark.
25+
26+
# Project Overview
27+
This project aims at improving the reproducibility of interactive notebooks which are executed in a distributed environment. Notebooks like in the [Jupyter](https://jupyter.org/) environment have become increasingly popular and are widely used in the scientific community due to their ease of use and portability. Reproducing these notebooks is a challenging task especially in a distributed cluster environment.
28+
29+
In the distributed environments we consider, the notebook code is divided into manager and worker code. The manager code is the main entry point of the program which divides the task at hand into one or more worker codes which run in a parallel, distributed fashion. We utlize several open source tools to package and containerize the application code which can be used to reproduce it across different machines and environments. They include [Sciunit](https://github.com/radiant-systems-lab/sciunit), [FLINC](https://github.com/radiant-systems-lab/Flinc), and [TaskVine](https://cctools.readthedocs.io/en/stable/taskvine/). These are the high-level goals of this project:
30+
1. Generate execution logs for a notebook program.
31+
2. Generate code and data dependencies for notebook programs in an automated manner.
32+
3. Utilize the generated dependencies at various granularities to automate the deployment and execution of notebooks in a parallel and distributed environment.
33+
4. Audit and pacakge the notebook code running in a distributed environment.
34+
35+
# Progress So Far
36+
Here are the details of the progress made so far.
37+
## Generation of Execution Logs
38+
We generate execution logs for the notebook programs in a distributed environment the Linux utility [strace](https://man7.org/linux/man-pages/man1/strace.1.html) which records every system call made by the notebook. It includes all files accessed during its execution. We collect separate logs for both manager and the worker code since they are executed on different machines and the dependencies for both are different. By recording the entire notebook execution, we capture all libraries, packages, and data files referenced during notebook execution in the form of execution logs. These logs are then utilized for further analyses.
39+
40+
## Extracting Software Dependencies
41+
When a library such as a Python package like *Numpy* is used by the notebook program, an entry is made in the execution log which has the complete path of the accessed library file(s) along with additional information. We analyze the execution logs for both manager and workers to find and enlist all dependencies. So far, we are limited to Python packages, though this methodology is general and can be used to find dependencies for any programing langauge. For Python packages, their version numbers are also obtained by querying the pacakge managers like *pip* or *Conda* on the local system.
42+
43+
## Extracting Data Dependencies
44+
We utilze similar execution logs to identify which data files were used by the notebook program. The list of logged files also contain various configuration or setting files used by certain packages and libraries. These files are removed from the list of data dependencies through post-processing done by analyzing file paths.
45+
46+
## Testing the Pipeline
47+
We have conducted our experiments on three use cases obtained from different domains using between 5 and 10 workers. They include distributed image convolution, climate trend analysis, and high energy physics experiment analysis. The results so far are promising with good accuracy and with a slight running time overhead.
48+
49+
# Next Steps
50+
The next steps in this project are as follows:
51+
1. Generate the execution logs and dependencies in a notebook at the level of each cell of code.
52+
2. Utilize the dependencies at multiple levels of granularities with the goal of automating the deployment and execution of notebooks in a parallel and distributed environment.
53+
3. Audit notebook program execution in a distributed environment and package it into a container on a single node.
54+
55+
I am very happy abut the experience I have had so far in this project and I am excited about the milestones to come.
56+
Stay tuned!

content/report/osre25/sf/LLMSeqRec/20250614-Connor/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@ title: "LLMSeqRec: LLM Enhanced Contextual Sequential Recommender"
33
authors: [Connor,LinseyPang,bindong]
44
author_notes: ["Salesforce","Research Scientist, Lawrence Berkeley Lab"]
55
tags: ["osre25", "uc", "AI", "LLM", "Recommender"]
6-
date: 2025-02-06T10:15:56-07:00
7-
lastmod: 2025-02-06T10:15:56-07:00
6+
date: 2025-06-06T10:15:56-07:00 y
7+
lastmod: 2025-06-06T10:15:56-07:00
88
---
99

10-
### Project Description
10+
### Project Description
1111
Sequential Recommender Systems are widely used in scientific and business applications to analyze and predict patterns over time. In biology and ecology, they help track species behavior by suggesting related research on migration patterns and environmental changes. Medical applications include personalized treatment recommendations based on patient history and predicting disease progression. In physics and engineering, these systems optimize experimental setups by suggesting relevant past experiments or simulations. Environmental and climate science applications include forecasting climate trends and recommending datasets for monitoring deforestation or pollution. In business and e-commerce, sequential recommenders enhance user experiences by predicting consumer behavior, suggesting personalized products, and optimizing marketing strategies based on browsing and purchase history. By leveraging sequential dependencies, these recommender systems enhance research efficiency, knowledge discovery, and business decision-making across various domains. Traditional sequential recommendation systems rely on historical user interactions to predict future preferences, but they often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors. Existing models primarily use predefined embeddings and handcrafted features, limiting their ability to generalize across diverse recommendation scenarios. To address these challenges, we propose LLM Enhanced Contextual Sequential Recommender (LLMSeqRec), which leverages Large Language Models (LLMs) to enrich sequential recommendations with deep contextual understanding and adaptive reasoning.
1212
By integrating LLM-generated embeddings and contextual representations, LLMSeqRec enhances user intent modeling, cold-start recommendations, and long-range dependencies in sequential data. Unlike traditional models that rely solely on structured interaction logs, LLMSeqRec dynamically interprets and augments sequences with semantic context, leading to more accurate and personalized recommendations. This fusion of LLM intelligence with sequential modeling enables a more scalable, adaptable, and explainable recommender system, bridging the gap between traditional sequence-based approaches and advanced AI-driven recommendations.
1313

content/report/osre25/sf/LLMSeqRec/20250722-Connor/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,8 @@ title: "LLMSeqRec: LLM Enhanced Contextual Sequential Recommender"
33
authors: [Connor,LinseyPang,bindong]
44
author_notes: ["Salesforce","Research Scientist, Lawrence Berkeley Lab"]
55
tags: ["osre25", "uc", "AI", "LLM", "Recommender"]
6-
date: 2025-01-22T10:15:56-07:00
7-
lastmod: 2025-01-22T10:15:56-07:00
6+
date: 2025-07-22T10:15:56-07:00
7+
lastmod: 2025-07-22T10:15:56-07:00
88
---
99

1010
# Midway Through OSRE
126 KB
Loading
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
---
2+
title: "Mid-Term Update: MPI Appliance for HPC Research on Chameleon"
3+
subtitle: ""
4+
summary:
5+
authors:
6+
- rohanbabbar04
7+
tags: ["osre25", "reproducibility", "MPI", "cloud computing"]
8+
categories: ["osre25", "reproducibility", "HPC", "MPI"]
9+
date: 2025-08-03
10+
lastmod: 2025-08-03
11+
featured: false
12+
draft: false
13+
14+
# Featured image
15+
# To use, add an image named `featured.jpg/png` to your page's folder.
16+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
17+
image:
18+
caption: "Chameleon Cloud"
19+
focal_point: Top
20+
preview_only: false
21+
---
22+
23+
Hi everyone! This is my mid-term blog update for the project [MPI Appliance for HPC Research on Chameleon](https://ucsc-ospo.github.io/project/osre25/uchicago/mpi/), developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community.
24+
This blog follows up on my earlier post, which you can find [here](https://ucsc-ospo.github.io/report/osre25/uchicago/mpi/20250614-rohan-babbar/).
25+
26+
### 🔧 June 15 – June 29, 2025
27+
28+
Worked on creating and configuring images on Chameleon Cloud for the following three sites:
29+
CHI@UC, CHI@TACC, and KVM@TACC.
30+
31+
Key features of the images:
32+
- **Spack**: Pre-installed and configured for easy package management of HPC software.
33+
- **Lua Modules (LMod)**: Installed and configured for environment module management.
34+
- **MPI Support**: Both MPICH and Open MPI are pre-installed, enabling users to run distributed applications out-of-the-box.
35+
36+
These images are now publicly available and can be seen directly on the Chameleon Appliance Catalog, titled [MPI and Spack for HPC (Ubuntu 22.04)](https://chameleoncloud.org/appliances/127/).
37+
38+
I also worked on some example Jupyter notebooks on how to get started using these images.
39+
40+
### 🔧 June 30 – July 13, 2025
41+
42+
With the MPI Appliance now published on Chameleon Cloud, the next step was to automate the setup of an MPI-Spack cluster.
43+
44+
To achieve this, I developed a set of Ansible playbooks that:
45+
46+
1) Configure both master and worker nodes with site-specific settings
47+
2) Set up seamless access to Chameleon NFS shares
48+
3) Allow users to easily install Spack packages, compilers, and dependencies across all nodes
49+
50+
These playbooks aim to simplify the deployment of reproducible HPC environments and reduce the time required to get a working cluster up and running.
51+
52+
### 🔧 July 14 – July 28, 2025
53+
54+
This week began with me fixing some issues in python-chi, the official Python client for the Chameleon testbed.
55+
We also discussed adding support for CUDA-based packages, which would make it easier to work with NVIDIA GPUs.
56+
We successfully published a new image on Chameleon, titled [MPI and Spack for HPC (Ubuntu 22.04 - CUDA)](https://chameleoncloud.org/appliances/130/), and added an example to demonstrate its usage.
57+
58+
We compiled the artifact containing the Jupyter notebooks and Ansible playbooks and published it on Chameleon Trovi.
59+
Feel free to check it out [here](https://chameleoncloud.org/experiment/share/7424a8dc-0688-4383-9d67-1e40ff37de17). The documentation still needs some work.
60+
61+
📌 That’s it for now! I’m currently working on the documentation, a ROCm-based image for AMD GPUs, and some container-based examples.
62+
Stay tuned for more updates in the next blog.
325 KB
Loading
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
---
2+
title: "Improving Usability and Performance in cc-snapshot: My Midterm Update"
3+
summary:
4+
authors: ["zahratm"]
5+
tags: ["osre25", "reproducibility", "cc-snapshot"]
6+
categories: ["SummerofReproducibility25"]
7+
date: 2024-07-24
8+
lastmod: 2024-07-24
9+
featured: true
10+
draft: false
11+
12+
# Featured image
13+
# To use, add an image named `featured.jpg/png` to your page's folder.
14+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
15+
image:
16+
caption: "CC-Snapshot performance and usability improvement"
17+
focal_point: Center
18+
preview_only: false
19+
---
20+
Hi! I'm Zahra Temori, a rising junior studying Computer Science at the University of Delaware. This summer, I’ve had the exciting opportunity to participate in the Chameleon Summer Reproducibility Program, where I’ve been working under the mentorship of Paul Marshall.
21+
In this blog post, I’d love to share a midterm update on my project [cc-snapshot](https://github.com/ChameleonCloud/cc-snapshot) and highlight what I’ve accomplished so far, what I’ve learned, and what’s coming next. It's been a challenging but rewarding experience diving into real-world research and contributing to tools that help make science more reproducible!
22+
23+
## Project Overview
24+
25+
CC-Snapshot is a powerful tool on the Chameleon testbed that enables users to package their customized environments for reproducibility and experiment replication. In research, reproducibility is essential. It allows scientists to run experiments consistently, share complete setups with others, and avoid environment-related errors. However, the current snapshotting mechanism has limitations that make it unreliable and inefficient, particularly in terms of usability and performance. These issues can slow down workflows and create barriers for users trying to reproduce results. Our goal is to improve both the usability and performance of the cc-snapshot tool. A more user-friendly and optimized system means that users can create and restore snapshots more quickly and easily, without needing to manually rebuild environments, ultimately saving time and improving reliability in scientific computing.
26+
27+
## Progress So Far
28+
29+
To structure the work, we divided the project into two main phases:
30+
1. Improving usability, and
31+
2. Optimizing performance.
32+
33+
I’ve nearly completed the first phase and have just started working on the second.
34+
35+
## Phase One – Usability Improvements
36+
37+
The original version of the cc-snapshot tool had several usability challenges that made it difficult for users to interact with and for developers to maintain. These issues included a rigid interface, lack of flexibility, and limited testing support. All of which made the tool harder to use and extend.
38+
To address these, I worked on the following improvements:
39+
40+
**Problem**: The command-line interface was limited and inflexible. Users couldn’t easily control features or customize behavior, which limited their ability to create snapshots in different scenarios.
41+
42+
**Solution**: I enhanced the CLI by adding:
43+
- A flag to disable automatic updates, giving users more control.
44+
- A --dry-run flag to simulate actions before actually running them which is useful for testing and safety.
45+
- Support for a custom source path, allowing snapshots of specific directories. This makes the tool much more useful for testing smaller environments.
46+
47+
**Problem**: The code lacked automated tests. Without tests, developers have to manually verify everything, which is time-consuming and error-prone.
48+
49+
**Solution**: I implemented a basic test suite and integrated it with GitHub Actions, so the tool is automatically tested on every pull request.
50+
51+
**Problem**: The tool didn’t follow a modular design. The logic was tightly coupled, making it hard to isolate or extend parts of the code.
52+
53+
**Solution**: I refactored the code by extracting key functions. This makes the code cleaner, easier to understand, and more maintainable in the long term.
54+
55+
## Next Steps – Phase Two: Performance Optimization
56+
57+
After improving the usability of the cc-snapshot tool, the next phase of the project focuses on addressing key performance bottlenecks. Currently, the snapshotting process can be slow and resource-intensive, which makes it less practical for frequent use especially with large environments.
58+
59+
**Problem 1: Slow Image Compression**
60+
The current implementation uses the qcow2 image format with zlib compression, which is single-threaded and often inefficient for large disk images. This leads to long snapshot creation times and high CPU usage.
61+
62+
**Solution**: I will benchmark and compare different compression strategies, specifically:
63+
- qcow2 with no compression
64+
- qcow2 with zstd compression, which is faster and multi-threaded
65+
- raw image format, which has no compression but may benefit from simpler processing
66+
67+
These tests will help determine which method provides the best tradeoff between speed, size, and resource usage.
68+
69+
**Problem 2: Suboptimal Storage Backend**
70+
Snapshots are currently uploaded to Glance, which can be slow and unreliable. Uploading large images can take several minutes, and this slows down the user workflow.
71+
72+
**Solution**: I will compare Glance with a faster alternative, the Object Store. Smaller, compressed images may upload significantly faster to the Object Store e.g. 30 seconds vs. 2 minutes. By measuring upload speeds and reliability, I can recommend a better default or optional backend for users.
73+
74+
## How I will Measure Performance
75+
76+
To understand the impact of different strategies, I will try to collect detailed metrics across three stages:
77+
1. Image creation: How long it takes to build the image, depending on compression and format
78+
2. Image upload: How quickly the snapshot can be transferred to Glance or Object Store
79+
3. Instance boot time: How fast a new instance can start from that image (compressed formats must be decompressed)
80+
81+
I will run multiple tests for each scenario and record performance metrics like CPU usage, memory usage, disk throughput, and total time for each step. This will help identify the most efficient and practical configuration for real-world use.
82+
83+
## Conclusion
84+
85+
Addressing the current usability and performance issues in cc-snapshot is essential to improving the overall user experience. By making the tool easier to use, faster, and more flexible, we can support researchers and developers who depend on reproducible computing for their work. So far, I’ve worked on enhancing the tool’s interface, adding testing support, and refactoring the codebase for better maintainability. In the next phase, I’ll be focusing on benchmarking different compression methods, image formats, and storage backends to improve speed and efficiency.
86+
These improvements will help make cc-snapshot a more powerful and user-friendly tool for the scientific community.
87+
88+
Stay tuned for the next update and thank you for following my journey!
89+
90+

0 commit comments

Comments
 (0)