feat(evaluation): add core evaluation framework #245

stefanoamorelli · 2025-11-10T09:54:45Z

This PR introduces an evaluation framework for testing and measuring AI agent performance, it supports both algorithmic and LLM-as-Judge evaluation methods, with built-in support for response quality, tool usage, safety, and hallucination detection.

Tip

This PR uses atomic commits organized by feature. For the best review experience, I suggest to review commit-by-commit to see the logical progression of the implementation.

Note

I follow conventional commits specification for a structured commit history.

Features:

Evaluation Methods
- Algorithmic evaluators (ROUGE-1 scoring, exact matching);
- LLM-as-Judge with customizable rubrics;
- Multi-sample evaluation.
8 Metrics
- Response quality: match score, semantic matching, coherence, rubric-based;
- Tool usage: trajectory scoring, rubric-based quality;
- Safety & quality: harmlessness, hallucination detection.
Flexible Storage
- In-memory storage for development/testing;
- File-based storage with JSON persistence for CI/CD;

Usage

// Create evaluation runner
  evalRunner := evaluation.NewRunner(evaluation.RunnerConfig{
      AgentRunner:        agentRunner,
      Storage:            evalStorage,
      SessionService:     sessionService,
      AppName:            "my-app",
      RateLimitDelay:     6 * time.Second,
      MaxConcurrentEvals: 10,
  })

  // Define evaluation criteria
  config := &evaluation.EvalConfig{
      JudgeLLM:   judgeLLM,
      JudgeModel: "gemini-2.5-flash",
      Criteria: []evaluation.Criterion{
          &evaluation.Threshold{
              MinScore:   0.8,
              MetricType: evaluation.MetricResponseMatch,
          },
          &evaluation.LLMAsJudgeCriterion{
              Threshold: &evaluation.Threshold{
                  MinScore:   0.9,
                  MetricType: evaluation.MetricSafety,
              },
              MetricType: evaluation.MetricSafety,
              JudgeModel: "gemini-2.5-flash",
          },
      },
  }

  // Run evaluation
  result, err := evalRunner.RunEvalSet(ctx, evalSet, config)

Testing

2 examples are provided to demonstrate the features:

examples/evaluation/basic/ - Simple introduction with 2 evaluators;
examples/evaluation/comprehensive/ - Full feature example with all the 8 metrics.

Run examples:

export GOOGLE_API_KEY=your_key
cd examples/evaluation/basic
go run main.go

cd examples/evaluation/comprehensive
go run main.go

…afety

ivanmkc · 2025-11-19T18:11:30Z

I saw your question in the community call. I'll have to defer to @mazas-google on roadmap questions regarding Eval.

I imagine the API will closely follow adk-python's implementation.

stefanoamorelli · 2025-11-29T17:59:07Z

Thanks @ivanmkc!

I imagine the API will closely follow adk-python's implementation.
Indeed I agree, in this PR I've aimed to mirror adk-python's API closely (while following idiomatic golang).

I'd love to help on this if evals are a priority in the roadmap.

stefanoamorelli force-pushed the feature/evaluation-framework branch 6 times, most recently from 2c120c1 to 9aba0da Compare November 12, 2025 21:45

stefanoamorelli marked this pull request as ready for review November 12, 2025 21:45

stefanoamorelli mentioned this pull request Nov 12, 2025

[Feature] Evaluations #240

Open

stefanoamorelli added 9 commits November 16, 2025 20:02

feat(evaluation): add core evaluation framework types and interfaces

fd81bf4

feat(evaluation): add evaluator registration system

53f2b90

feat(evaluation): add storage layer with file and memory implementations

41d4bfe

feat(evaluation): add LLM-as-judge evaluation framework

8aa8316

feat(evaluation): add built-in evaluators for responses, tools, and s…

d40f518

…afety

feat(api): add REST API handlers for evaluation management

5cf7f18

feat(api): wire up evaluation REST API routes

3b5f2a7

docs(evaluation): add usage documentation and examples

236421a

chore: add gitignore for eval results

1565838

stefanoamorelli force-pushed the feature/evaluation-framework branch from 9aba0da to b454fdd Compare November 16, 2025 18:14

stefanoamorelli added 3 commits November 16, 2025 20:18

docs(examples): add evaluation examples overview

a28312c

feat(examples): add basic evaluation example

978bf5f

feat(examples): add comprehensive evaluation example

f99730b

stefanoamorelli force-pushed the feature/evaluation-framework branch from b454fdd to f99730b Compare November 16, 2025 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluation): add core evaluation framework #245

feat(evaluation): add core evaluation framework #245

Uh oh!

stefanoamorelli commented Nov 10, 2025 •

edited

Loading

Uh oh!

ivanmkc commented Nov 19, 2025

Uh oh!

stefanoamorelli commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(evaluation): add core evaluation framework #245

Are you sure you want to change the base?

feat(evaluation): add core evaluation framework #245

Uh oh!

Conversation

stefanoamorelli commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Usage

Testing

Uh oh!

ivanmkc commented Nov 19, 2025

Uh oh!

stefanoamorelli commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stefanoamorelli commented Nov 10, 2025 •

edited

Loading