feat(gen-ai): add eval tests COMPASS-10084 #7641

mabaasit · 2025-12-11T06:12:34Z

Added eval tests for gen-ai features using staging chatbot endpoint. When running locally, the experiments will show up on BT experiments.

I compared the two implementations (gen-ai using mms api and chatbot). Results are almost similar in terms of accuracy. Check first three chatbot-experiments and mms-experiments. They were run on same dataset under same conditions (of course the sample of docs would be different).

For additional context regarding Braintrust, check #7216.

Description

Checklist

New tests and/or benchmarks are included
Documentation is changed or added
If this change updates the UI, screenshots/videos are added and a design review is requested
If this change could impact the load on the MongoDB cluster, please describe the expected and worst case impact
I have signed the MongoDB Contributor License Agreement (https://www.mongodb.com/legal/contributor-agreement)

Motivation and Context

Bugfix
New feature
Dependency update
Misc

Open Questions

Dependents

Types of changes

Backport Needed
Patch (non-breaking change which fixes an issue)
Minor (non-breaking change which adds functionality)
Major (fix or feature that would cause existing functionality to change)

…s://github.com/mongodb-js/compass into COMPASS-10082-add-mms-prompts-and-feature-flag

…PASS-10081-switch-to-edu-api

…l-tests

Copilot

Pull request overview

This PR adds evaluation tests for gen-ai features using the staging chatbot endpoint. The evaluation framework uses Braintrust to track experiment results and compares the implementation's accuracy against expected outputs.

Adds comprehensive eval test infrastructure using Braintrust
Creates test datasets from multiple collections (airbnb, berlin bars, netflix, NYC parking)
Implements eval cases for both find and aggregate query generation

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
packages/compass-generative-ai/tests/evals/utils.ts	Utility functions for text processing, sampling, and schema generation
packages/compass-generative-ai/tests/evals/types.ts	TypeScript type definitions for eval cases and scorers
packages/compass-generative-ai/tests/evals/scorers.ts	Factuality scorer implementation using autoevals
packages/compass-generative-ai/tests/evals/gen-ai.eval.ts	Main eval entry point configuring the Braintrust evaluation
packages/compass-generative-ai/tests/evals/chatbot-api.ts	Chatbot API client for making eval requests
packages/compass-generative-ai/tests/evals/use-cases/index.ts	Builds eval cases from test datasets and prompts
packages/compass-generative-ai/tests/evals/use-cases/find-query.ts	Find query test cases with expected outputs
packages/compass-generative-ai/tests/evals/use-cases/aggregate-query.ts	Aggregate query test cases with expected outputs
packages/compass-generative-ai/tests/evals/fixtures/*.ts	Test data fixtures for multiple collections
packages/compass-generative-ai/package.json	Added dependencies for AI SDK, braintrust, and autoevals

Copilot · 2025-12-15T10:08:01Z

packages/compass-generative-ai/tests/evals/chatbot-api.ts

+    apiKey: '',
+    headers: {
+      'X-Request-Origin': 'compass-gen-ai-braintrust',
+      'User-Agent': 'mongodb-compass/x.x.x',


Using a placeholder version 'x.x.x' in the User-Agent header is not ideal for tracking or debugging. Consider using an actual version number or a constant that can be updated, or make it clear this is for testing purposes.

Copilot · 2025-12-15T10:08:02Z

packages/compass-generative-ai/tests/evals/use-cases/aggregate-query.ts

+      [{
+        $project: {_id: 0, precio: "$price"},
+        $sort: {price: 1},
+        $limit: 1
+      }]


The aggregation pipeline is malformed - it's an array containing a single object with multiple pipeline stages as properties. Each pipeline stage should be a separate object in the array. This should be: [{$project: {_id: 0, precio: \"$price\"}}, {$sort: {price: 1}}, {$limit: 1}].

Suggested change

[{

$project: {_id: 0, precio: "$price"},

$sort: {price: 1},

$limit: 1

}]

[

{$project: {_id: 0, precio: "$price"}},

{$sort: {price: 1}},

{$limit: 1}

]

packages/compass-generative-ai/package.json

paula-stacho

I can't test this, but looks good. Thanks for reducing the fixtures!

…l-tests

mabaasit and others added 30 commits November 25, 2025 12:16

add feature flag

1295d51

migrate prompts to compass

7a44de4

Merge branch 'main' into COMPASS-10082-add-mms-prompts-and-feature-flag

2e2defe

co-pilot feedback

d524682

Merge branch 'COMPASS-10082-add-mms-prompts-and-feature-flag' of http…

f9212fb

…s://github.com/mongodb-js/compass into COMPASS-10082-add-mms-prompts-and-feature-flag

clean up

c96161b

Merge branch 'main' into COMPASS-10082-add-mms-prompts-and-feature-flag

f499026

use edu api for gen ai

f4d05a7

clean up a bit

1763531

fix url for test and ensure aggregations have content

d5e2c84

tests

0781580

fix error handling

058e950

clean up transport

7d54d4d

Merge branch 'main' of https://github.com/mongodb-js/compass into COM…

0aa7ff5

…PASS-10081-switch-to-edu-api

changes in field name

309335a

use query parser

feacddc

fix check

c321983

fix test

f5a34df

clean up

91b17d5

copilot feedback

ea4dfdf

fix log id

f8a4c43

fix cors issue on e2e tests

8190219

more tests

c444cb9

add type

6eec8db

wip

25c8089

fix alltext

9b4c9b7

make expected output xml string

5e6d537

add fixtures and run gen ai eval tests

d1b0b51

ts fixes

1b7343f

reformat

a1b38c5

mabaasit added 4 commits December 11, 2025 00:33

clean up scorer

79e9910

Merge branch 'main' of https://github.com/mongodb-js/compass into eva…

821917d

…l-tests

bootstrap

3bc7295

remove extra files

eb49493

github-actions bot added the feat label Dec 11, 2025

mabaasit added the no release notes Fix or feature not for release notes label Dec 11, 2025

mabaasit added 4 commits December 15, 2025 10:12

Merge branch 'main' of https://github.com/mongodb-js/compass into eva…

fa1cf6f

…l-tests

bootstrap

edc5e73

fix prompts and data

718e8b7

Merge branch 'main' of https://github.com/mongodb-js/compass into eva…

5244598

…l-tests

mabaasit marked this pull request as ready for review December 15, 2025 10:06

mabaasit requested a review from a team as a code owner December 15, 2025 10:06

mabaasit requested review from Copilot and paula-stacho December 15, 2025 10:06

Copilot AI reviewed Dec 15, 2025

View reviewed changes

mabaasit changed the title ~~feat(gen-ai): add eval tests~~ feat(gen-ai): add eval tests COMPASS-10084 Dec 15, 2025

mabaasit added 3 commits December 15, 2025 13:57

copilot review

534b0d8

reduce num of docs

d0568ff

reduce fixtures

40ad7c9

paula-stacho approved these changes Dec 16, 2025

View reviewed changes

Merge branch 'main' of https://github.com/mongodb-js/compass into eva…

48b38ce

…l-tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gen-ai): add eval tests COMPASS-10084 #7641

feat(gen-ai): add eval tests COMPASS-10084 #7641

mabaasit commented Dec 11, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

Copilot AI Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

paula-stacho left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(gen-ai): add eval tests COMPASS-10084 #7641

Are you sure you want to change the base?

feat(gen-ai): add eval tests COMPASS-10084 #7641

Conversation

mabaasit commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Motivation and Context

Open Questions

Dependents

Types of changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

paula-stacho left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mabaasit commented Dec 11, 2025 •

edited

Loading