Skip to content

Commit 0a1a8bb

Browse files
authored
Merge pull request #908 from devadigapratham/main
GSoC 2025: Mid-Term Blog Post Submission
2 parents 5c21e05 + 66be297 commit 0a1a8bb

File tree

5 files changed

+80
-2
lines changed

5 files changed

+80
-2
lines changed

content/authors/devadigapratham/_index.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ organizations:
1818
url: "https://www.pes.edu"
1919

2020
# Short bio (displayed in user profile at end of posts)
21-
bio: Prathamesh Devadiga is a B.Tech Computer Science student at PES University. He works on vector search, deep learning systems, and retrieval-augmented generation.
21+
bio: My research interests lie in Trustworthy AI, AI Safety and Alignment, Machine Learning Systems (MLSys), and all things Deep Learning.
2222

2323
# Social/Academic Networking
2424
social:
@@ -43,4 +43,13 @@ user_groups:
4343
- 2025 Contributors
4444
---
4545

46-
Prathamesh Devadiga is a B.Tech Computer Science student at PES University, Bangalore. He is passionate about deep learning, vector search, and large-scale RAG systems. During OSRE 2025, he is building a billion-scale open-source vector embedding dataset for benchmarking ANN algorithms and empowering real-world retrieval systems.
46+
Welcome to my OSRE 2025 profile!
47+
This reflects my current interests and ongoing work.
48+
**For the latest updates, visit my [homepage](https://prathameshdevadiga.vercel.app/).**
49+
50+
I am currently a final year B.Tech Computer Science student at [PES University](https://www.pes.edu), Bangalore.
51+
My research interests include Trustworthy AI, AI Safety & Alignment, deep learning, vector search, and large-scale Retrieval-Augmented Generation (RAG) systems.
52+
As part of OSRE 2025, I’m building a billion-scale open-source vector embedding dataset to benchmark ANN algorithms and support real-world retrieval systems.
53+
54+
In my free time, I enjoy mentoring, experimenting with open-source tools, and exploring system internals.
55+
You’ll also find me vibing to music, diving into sci-fi shows, or geeking out over futuristic tech ideas.
File renamed without changes.
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
title: "Midway Through GSoC"
3+
subtitle: "Building a Billion-Scale Code Embeddings Dataset"
4+
summary: A midterm update on my GSoC 2025 project under OSRE. This post covers the motivation, goals, and current progress on creating a real-world code embeddings dataset for ANN and RAG applications.
5+
authors:
6+
- devadigapratham
7+
tags: ["osre25", "gsoc", "vector-embeddings", "code", "benchmarking", "rag"]
8+
date: 2025-07-14
9+
lastmod: 2025-07-14
10+
featured: true
11+
draft: false
12+
---
13+
14+
# Midway Through GSoC
15+
16+
Hello everyone! I’m Pratham Devadiga, and I’m thrilled to share a midterm progress update on my [GSoC 2025 project](https://summerofcode.withgoogle.com/programs/2025/projects/GcstSGAO) with the Open Source Research Experience (OSRE). My project is focused on building the **first open-source billion-scale vector embeddings dataset** from **real-world open source code** to support benchmarking of Approximate Nearest Neighbor (ANN) algorithms and facilitate research in Retrieval-Augmented Generation (RAG).
17+
18+
## Project Overview
19+
20+
The goal of this project is to address a critical gap in the ecosystem: existing ANN benchmarks are either synthetic or limited in scale. With the explosion of code-focused LLMs and embedding models, there's a pressing need for:
21+
22+
- **High-volume, high-dimensional vector datasets** built from real-world data (open-source codebases).
23+
- **Open, reproducible benchmarks** that reflect realistic RAG workloads.
24+
- A dataset that can be used to evaluate **ANN libraries** like FAISS, HNSW, and Annoy on massive and practical retrieval tasks.
25+
26+
Our approach is to use high-quality open-source code repositories to extract meaningful code chunks, encode them into vector embeddings using open models, and make these datasets publicly available with metadata for downstream benchmarking and analysis.
27+
28+
## Progress So Far
29+
30+
We’ve made substantial foundational progress in the first half of the coding period. Key highlights:
31+
32+
- **Tested multiple embedding models** such as `codeBERT`, `MiniLM-L6-v2`, and `all-mpnet-base-v2`, evaluating trade-offs in speed, dimensionality, and GPU memory.
33+
- **Selected `codebert-base`** (768d) as the current model for phase one due to its stable performance and manageable resource footprint.
34+
- Implemented and validated a complete **script pipeline** to:
35+
- Traverse large open-source repositories.
36+
- Extract and chunk code intelligently (functions, classes, modules).
37+
- Encode code into embeddings and attach metadata (repo, file path, license).
38+
- Store results efficiently in parquet and NumPy formats.
39+
- **Tested all components** of the pipeline on sample datasets using multi-GPU setups, ensuring compatibility and robustness.
40+
41+
## Challenges and Learnings
42+
43+
Building a billion-scale dataset from real-world codebases is no small task. Here's what we’ve encountered and learned along the way:
44+
45+
### 1. Multi-GPU Pipeline Design
46+
Naively parallelizing the embedding process caused memory overflow and deadlocks due to model reloading across processes. We refactored the code using `torch.multiprocessing` and pinned GPU contexts to avoid such issues, improving throughput on multi-GPU machines.
47+
48+
### 2. Embedding Trade-offs
49+
We experimented with larger models but found that their generation time and memory use were too high to be practical in early phases. This helped us narrow down to scalable configurations for initial dataset generation.
50+
51+
### 3. Preparing for Scale
52+
Although the embeddings are not generated yet, all scripts are now **modular, parallelized, and reproducible**, ensuring a smooth transition to billion-scale data generation in the second half.
53+
54+
## What’s Next
55+
56+
The second half of the project will focus on:
57+
58+
- **Scaling up embedding generation** to >1B code chunks across hundreds of open-source repositories.
59+
- **Running benchmarks** using FAISS, HNSW, and Annoy on these embeddings.
60+
- **Releasing the dataset** on Hugging Face and AWS S3 with sharded access and metadata.
61+
- **Writing a detailed benchmarking report** comparing speed, accuracy, and memory trade-offs across ANN algorithms.
62+
63+
## Final Thoughts
64+
65+
This journey so far has taught me a lot about building large-scale ML pipelines, managing real-world compute constraints, and ensuring reproducibility for research-grade datasets. I'm grateful to my mentor **Jayjeet Chakraborty** and the OSRE team for their continuous support and guidance.
66+
67+
Excited for the next half, where the real scale begins!
68+
69+
Stay tuned for updates. You can find more about the project on my [OSRE project page](/project/osre25/ucsc/embeddings).

0 commit comments

Comments
 (0)