|
| 1 | +--- |
| 2 | +title: "Midway Through GSoC" |
| 3 | +subtitle: "Building a Billion-Scale Code Embeddings Dataset" |
| 4 | +summary: A midterm update on my GSoC 2025 project under OSRE. This post covers the motivation, goals, and current progress on creating a real-world code embeddings dataset for ANN and RAG applications. |
| 5 | +authors: |
| 6 | + - devadigapratham |
| 7 | +tags: ["osre25", "gsoc", "vector-embeddings", "code", "benchmarking", "rag"] |
| 8 | +date: 2025-07-14 |
| 9 | +lastmod: 2025-07-14 |
| 10 | +featured: true |
| 11 | +draft: false |
| 12 | +--- |
| 13 | + |
| 14 | +# Midway Through GSoC |
| 15 | + |
| 16 | +Hello everyone! I’m Pratham Devadiga, and I’m thrilled to share a midterm progress update on my [GSoC 2025 project](https://summerofcode.withgoogle.com/programs/2025/projects/GcstSGAO) with the Open Source Research Experience (OSRE). My project is focused on building the **first open-source billion-scale vector embeddings dataset** from **real-world open source code** to support benchmarking of Approximate Nearest Neighbor (ANN) algorithms and facilitate research in Retrieval-Augmented Generation (RAG). |
| 17 | + |
| 18 | +## Project Overview |
| 19 | + |
| 20 | +The goal of this project is to address a critical gap in the ecosystem: existing ANN benchmarks are either synthetic or limited in scale. With the explosion of code-focused LLMs and embedding models, there's a pressing need for: |
| 21 | + |
| 22 | +- **High-volume, high-dimensional vector datasets** built from real-world data (open-source codebases). |
| 23 | +- **Open, reproducible benchmarks** that reflect realistic RAG workloads. |
| 24 | +- A dataset that can be used to evaluate **ANN libraries** like FAISS, HNSW, and Annoy on massive and practical retrieval tasks. |
| 25 | + |
| 26 | +Our approach is to use high-quality open-source code repositories to extract meaningful code chunks, encode them into vector embeddings using open models, and make these datasets publicly available with metadata for downstream benchmarking and analysis. |
| 27 | + |
| 28 | +## Progress So Far |
| 29 | + |
| 30 | +We’ve made substantial foundational progress in the first half of the coding period. Key highlights: |
| 31 | + |
| 32 | +- **Tested multiple embedding models** such as `codeBERT`, `MiniLM-L6-v2`, and `all-mpnet-base-v2`, evaluating trade-offs in speed, dimensionality, and GPU memory. |
| 33 | +- **Selected `codebert-base`** (768d) as the current model for phase one due to its stable performance and manageable resource footprint. |
| 34 | +- Implemented and validated a complete **script pipeline** to: |
| 35 | + - Traverse large open-source repositories. |
| 36 | + - Extract and chunk code intelligently (functions, classes, modules). |
| 37 | + - Encode code into embeddings and attach metadata (repo, file path, license). |
| 38 | + - Store results efficiently in parquet and NumPy formats. |
| 39 | +- **Tested all components** of the pipeline on sample datasets using multi-GPU setups, ensuring compatibility and robustness. |
| 40 | + |
| 41 | +## Challenges and Learnings |
| 42 | + |
| 43 | +Building a billion-scale dataset from real-world codebases is no small task. Here's what we’ve encountered and learned along the way: |
| 44 | + |
| 45 | +### 1. Multi-GPU Pipeline Design |
| 46 | +Naively parallelizing the embedding process caused memory overflow and deadlocks due to model reloading across processes. We refactored the code using `torch.multiprocessing` and pinned GPU contexts to avoid such issues, improving throughput on multi-GPU machines. |
| 47 | + |
| 48 | +### 2. Embedding Trade-offs |
| 49 | +We experimented with larger models but found that their generation time and memory use were too high to be practical in early phases. This helped us narrow down to scalable configurations for initial dataset generation. |
| 50 | + |
| 51 | +### 3. Preparing for Scale |
| 52 | +Although the embeddings are not generated yet, all scripts are now **modular, parallelized, and reproducible**, ensuring a smooth transition to billion-scale data generation in the second half. |
| 53 | + |
| 54 | +## What’s Next |
| 55 | + |
| 56 | +The second half of the project will focus on: |
| 57 | + |
| 58 | +- **Scaling up embedding generation** to >1B code chunks across hundreds of open-source repositories. |
| 59 | +- **Running benchmarks** using FAISS, HNSW, and Annoy on these embeddings. |
| 60 | +- **Releasing the dataset** on Hugging Face and AWS S3 with sharded access and metadata. |
| 61 | +- **Writing a detailed benchmarking report** comparing speed, accuracy, and memory trade-offs across ANN algorithms. |
| 62 | + |
| 63 | +## Final Thoughts |
| 64 | + |
| 65 | +This journey so far has taught me a lot about building large-scale ML pipelines, managing real-world compute constraints, and ensuring reproducibility for research-grade datasets. I'm grateful to my mentor **Jayjeet Chakraborty** and the OSRE team for their continuous support and guidance. |
| 66 | + |
| 67 | +Excited for the next half, where the real scale begins! |
| 68 | + |
| 69 | +Stay tuned for updates. You can find more about the project on my [OSRE project page](/project/osre25/ucsc/embeddings). |
0 commit comments