Skip to content

Commit c6a9c36

Browse files
authored
Merge pull request #910 from debangi29/main
Added Mid Term Blog
2 parents 0a1a8bb + d82e85e commit c6a9c36

File tree

6 files changed

+170
-0
lines changed

6 files changed

+170
-0
lines changed
162 KB
Loading
258 KB
Loading
201 KB
Loading
80.3 KB
Loading
126 KB
Loading
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
---
2+
title: "Mid-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing"
3+
subtitle: "Enhancing Search Functionality and classification of Research projects"
4+
summary: ""
5+
authors: [debangi29]
6+
tags: ["osre25","statwrap","reproducibility", "search", "indexing", "user interface"]
7+
categories: ["osre25","SoR"]
8+
date: 2025-07-15
9+
lastmod: 2025-07-15
10+
featured: false
11+
draft: false
12+
13+
# Featured image
14+
# To use, add an image named `featured.jpg/png` to your page's folder.
15+
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
16+
image:
17+
caption: ""
18+
focal_point: ""
19+
preview_only: false
20+
---
21+
## Introduction
22+
23+
Hello everyone!
24+
I am Debangi Ghosh from India, an undergraduate student at the Indian Institute of Technology (IIT) BHU, Varanasi. As part of the [StatWrap: Cross-Project Searching and Classification using Local Indexing](/project/osre25/northwestern/statwrap/) project, my [proposal](https://drive.google.com/file/d/1dxyBP2oMJwYDCKyIWzr465zNmm6UWtnI/view?usp=sharing), under the mentorship of {{% mention lrasmus %}}, focuses on developing a full-text search service within the StatWrap user interface. This involves evaluating different search libraries and implementing a classification system to distinguish between active and past projects.
25+
26+
## **About the Project**
27+
28+
As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for investigators to discover relevant projects, notes, and assets—across both current and archived work—using information that is either user-entered or passively collected by StatWrap.
29+
30+
Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:
31+
32+
* **Evaluating open-source search libraries** suitable for local indexing and retrieval
33+
* **Building the full-text search functionality** directly into the StatWrap UI to allow seamless querying across projects
34+
* **Ensuring reliability** through the development of unit tests and comprehensive system testing
35+
* **Implementing a classification system** to label projects as “Active,” “Pinned,” or “Past” within the user interface
36+
37+
This project offers a great opportunity to work at the intersection of software development, information retrieval, and user-centric design—while contributing to research reproducibility and collaboration within scientific workflows.
38+
39+
## Progress
40+
41+
It has been more than six weeks since the project began, and significant progress has been made. Here's a breakdown:
42+
43+
### 1. **Descriptive Comparison of Open-Source Libraries**
44+
Compared various open-source search libraries based on evaluation criteria such as **indexing speed, search speed, memory usage, typo tolerance, fuzzy searching, partial matching, full-text queries, contextual search, Boolean support, exact word match, installation ease, maintenance, documentation**, and **developer experience**.
45+
46+
### 2. **The Libraries**
47+
48+
- **Lunr.js**
49+
A small, client-side full-text search engine that mimics Solr capabilities.
50+
- Field-based search, boosting
51+
- Supports TF-IDF, inverted index
52+
- No built-in fuzzy search (only basic wildcards)
53+
- Can serialize/deserialize index
54+
- Not designed for large datasets
55+
- Moderate memory usage and indexing speed
56+
- Good documentation
57+
- **Best for**: Static websites or SPAs needing simple in-browser search
58+
59+
- **ElasticLunr.js**
60+
A lightweight, more flexible alternative to Lunr.js.
61+
- Dynamic index (add/remove docs)
62+
- Field-based and weighted search
63+
- No advanced fuzzy matching
64+
- Faster and more customizable than Lunr
65+
- Smaller footprint
66+
- Easy to use and maintain
67+
- **Best for**: Developers wanting Lunr-like features with simpler customization
68+
69+
- **Fuse.js**
70+
A fuzzy search library ideal for small to medium datasets.
71+
- Fuzzy search with typo tolerance
72+
- Deep key/path searching
73+
- No need to build index
74+
- Highly configurable (threshold, distance, etc.)
75+
- Linear scan = slower on large datasets
76+
- Not full-text search (scoring-based match)
77+
- Extremely easy to set up and use
78+
- **Best for**: Fuzzy search in small in-memory arrays (e.g., auto-suggest, dropdown filters)
79+
80+
- **FlexSearch**
81+
A blazing-fast, modular search engine with advanced indexing options.
82+
- Extremely fast search and indexing
83+
- Supports phonetic, typo-tolerant, and partial matching
84+
- Asynchronous support
85+
- Multi-language + Unicode-friendly
86+
- Low memory footprint
87+
- Configuration can be complex for beginners
88+
- **Best for**: High-performance search in large/multilingual datasets
89+
90+
- **MiniSearch**
91+
A small, full-text search engine with balanced performance and simplicity.
92+
- Fast indexing and searching
93+
- Fuzzy search, stemming, stop words
94+
- Field boosting and prefix search
95+
- Compact, can serialize index
96+
- Clean and modern API
97+
- Lightweight and easy to maintain
98+
- **Best for**: Balanced, in-browser full-text search for moderate datasets
99+
100+
- **Search-Index**
101+
A persistent, full-featured search engine for Node.js and browsers.
102+
- Persistent storage with LevelDB
103+
- Real-time indexing
104+
- Fielded queries, faceting, filtering
105+
- Advanced queries (Boolean, range, etc.)
106+
- Slightly heavier setup
107+
- Good for offline/local-first apps
108+
- Browser usage more complex than others
109+
- **Best for**: Node.js apps, **not directly compatible with the Electron + React environment of StatWrap**
110+
111+
### 3. Developer Experience and Maintenance
112+
We analyzed the download trends of the search libraries using npm trends, and also reviewed their maintenance statistics to assess how frequently they are updated.
113+
114+
![DOWNLOADS](downloads.png)
115+
![Maintenance](Maintenance.png)
116+
117+
### 4. Comparative Analysis After Testing
118+
Each search library was benchmarked against a predefined set of queries based on the same evaluation criteria.
119+
We are yet to finalize the weights for each criterion, which will be done during the end-term evaluation.
120+
121+
![COMPARATIVE ANALYSIS](image.png)
122+
123+
### 5. The User Interface
124+
125+
![User Interface](UI.png)
126+
![Debug Tools](image-1.png)
127+
128+
The user interface includes options to search using three search modes (Basic, Advanced, Boolean operators) with configurable parameters. Results are sorted based on relevance score (highest first), and also grouped by category.
129+
130+
### 6. Overall Functioning
131+
132+
- **Indexing Workflow**
133+
- Projects are processed sequentially
134+
- Metadata, files, people, and notes are indexed (larger files are queued for later)
135+
- Uses a "brute-force" recursive approach to walk through project directories
136+
- Skips directories like `node_modules`, `.git`, `.statwrap`
137+
- Identifies eligible text files for indexing
138+
- Logs progress every 10 files
139+
140+
- **Document Creation Logic**
141+
- Reads file content as UTF-8 text
142+
- Builds searchable documents with filename, content, and metadata
143+
- Auto-generates tags based on content and file type
144+
- Adds documents to the search index and document store
145+
- Handles errors gracefully with debug logging
146+
147+
- **Search Functionality**
148+
- Uses field-weighted search
149+
- Enriches results with document metadata
150+
- Supports filtering by type or project
151+
- Groups results by category (files, projects, people, etc.)
152+
- Implements caching for improved performance
153+
- Search statistics are generated to monitor performance
154+
155+
## Challenges and End-Term Goals
156+
157+
- **In-memory Indexing Metadata Storing**
158+
Most JavaScript search libraries (like Fuse.js, Lunr, MiniSearch) store indexes entirely in memory, which can become problematic for large-scale datasets. A key challenge is designing a scalable solution that allows for disk persistence or lazy loading to prevent memory overflows.
159+
160+
- **Deciding the Weights Accordingly**
161+
An important challenge is tuning the relevance scoring by assigning appropriate weights to different aspects of the search, such as exact word matches, prefix matches, and typo tolerance. For instance, we prefer exact matches to be ranked higher than fuzzy or partial matches.
162+
163+
- **Implementing the Selected Library**
164+
Once a library is selected (based on speed, features, and compatibility with Electron + React), the next challenge is integrating it into StatWrap efficiently—ensuring local indexing, accurate search results, and smooth performance even with large projects.
165+
166+
- **Classifying Active and Past Projects in the User Interface**
167+
To improve navigation and search scoping, we plan to introduce three project sections in the interface: **Pinned**, **Active**, and **Past** projects. This classification will help users prioritize relevant content while enabling smarter indexing strategies.
168+
169+
170+
Stay tuned for the next blog!

0 commit comments

Comments
 (0)