-
Notifications
You must be signed in to change notification settings - Fork 246
feat(gen-ai): add eval tests COMPASS-10084 #7641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…s://github.com/mongodb-js/compass into COMPASS-10082-add-mms-prompts-and-feature-flag
…PASS-10081-switch-to-edu-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds evaluation tests for gen-ai features using the staging chatbot endpoint. The evaluation framework uses Braintrust to track experiment results and compares the implementation's accuracy against expected outputs.
- Adds comprehensive eval test infrastructure using Braintrust
- Creates test datasets from multiple collections (airbnb, berlin bars, netflix, NYC parking)
- Implements eval cases for both find and aggregate query generation
Reviewed changes
Copilot reviewed 13 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/compass-generative-ai/tests/evals/utils.ts | Utility functions for text processing, sampling, and schema generation |
| packages/compass-generative-ai/tests/evals/types.ts | TypeScript type definitions for eval cases and scorers |
| packages/compass-generative-ai/tests/evals/scorers.ts | Factuality scorer implementation using autoevals |
| packages/compass-generative-ai/tests/evals/gen-ai.eval.ts | Main eval entry point configuring the Braintrust evaluation |
| packages/compass-generative-ai/tests/evals/chatbot-api.ts | Chatbot API client for making eval requests |
| packages/compass-generative-ai/tests/evals/use-cases/index.ts | Builds eval cases from test datasets and prompts |
| packages/compass-generative-ai/tests/evals/use-cases/find-query.ts | Find query test cases with expected outputs |
| packages/compass-generative-ai/tests/evals/use-cases/aggregate-query.ts | Aggregate query test cases with expected outputs |
| packages/compass-generative-ai/tests/evals/fixtures/*.ts | Test data fixtures for multiple collections |
| packages/compass-generative-ai/package.json | Added dependencies for AI SDK, braintrust, and autoevals |
| apiKey: '', | ||
| headers: { | ||
| 'X-Request-Origin': 'compass-gen-ai-braintrust', | ||
| 'User-Agent': 'mongodb-compass/x.x.x', |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a placeholder version 'x.x.x' in the User-Agent header is not ideal for tracking or debugging. Consider using an actual version number or a constant that can be updated, or make it clear this is for testing purposes.
| [{ | ||
| $project: {_id: 0, precio: "$price"}, | ||
| $sort: {price: 1}, | ||
| $limit: 1 | ||
| }] |
Copilot
AI
Dec 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregation pipeline is malformed - it's an array containing a single object with multiple pipeline stages as properties. Each pipeline stage should be a separate object in the array. This should be: [{$project: {_id: 0, precio: \"$price\"}}, {$sort: {price: 1}}, {$limit: 1}].
| [{ | |
| $project: {_id: 0, precio: "$price"}, | |
| $sort: {price: 1}, | |
| $limit: 1 | |
| }] | |
| [ | |
| {$project: {_id: 0, precio: "$price"}}, | |
| {$sort: {price: 1}}, | |
| {$limit: 1} | |
| ] |
paula-stacho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't test this, but looks good. Thanks for reducing the fixtures!
Added eval tests for gen-ai features using staging chatbot endpoint. When running locally, the experiments will show up on BT experiments.
I compared the two implementations (gen-ai using mms api and chatbot). Results are almost similar in terms of accuracy. Check first three chatbot-experiments and mms-experiments. They were run on same dataset under same conditions (of course the sample of docs would be different).
For additional context regarding Braintrust, check #7216.
Description
Checklist
Motivation and Context
Open Questions
Dependents
Types of changes