|
| 1 | +# Evaluation Framework Examples |
| 2 | + |
| 3 | +This directory contains examples demonstrating the ADK evaluation framework for testing and measuring AI agent performance. |
| 4 | + |
| 5 | +## Available Examples |
| 6 | + |
| 7 | +### [Basic](./basic/) ⚡ **Start Here** |
| 8 | +Simple introduction to LLM-based evaluation: |
| 9 | +- Core evaluation setup |
| 10 | +- 2 evaluators (algorithmic + LLM-as-Judge) |
| 11 | +- Built-in rate limiting |
| 12 | +- In-memory storage |
| 13 | +- Clear result output |
| 14 | + |
| 15 | +**Best for:** Getting started, understanding fundamentals |
| 16 | + |
| 17 | +### [Comprehensive](./comprehensive/) |
| 18 | +- All 8 evaluation metrics |
| 19 | +- Agent with custom tools |
| 20 | +- File-based persistent storage |
| 21 | +- Rubric-based evaluation |
| 22 | +- Safety and hallucination detection |
| 23 | +- Automatic rate limiting |
| 24 | +- Detailed result reporting |
| 25 | + |
| 26 | +## Quick Start |
| 27 | + |
| 28 | +1. Set your API key: |
| 29 | +```bash |
| 30 | +export GOOGLE_API_KEY=your_api_key_here |
| 31 | +``` |
| 32 | + |
| 33 | +2. Try basic example (with LLM evaluation): |
| 34 | +```bash |
| 35 | +cd basic |
| 36 | +go run main.go |
| 37 | +``` |
| 38 | + |
| 39 | +3. Run comprehensive example (all features): |
| 40 | +```bash |
| 41 | +cd comprehensive |
| 42 | +go run main.go |
| 43 | +``` |
| 44 | +## Evaluation Framework Overview |
| 45 | + |
| 46 | +### Core Components |
| 47 | + |
| 48 | +- **EvalSet**: Collection of test cases for systematic evaluation |
| 49 | +- **EvalCase**: Single test scenario with conversation flow and expected outcomes |
| 50 | +- **Evaluator**: Metric-specific evaluation logic |
| 51 | +- **Runner**: Orchestrates evaluation execution |
| 52 | +- **Storage**: Persists eval sets and results |
| 53 | + |
| 54 | +### Available Metrics |
| 55 | + |
| 56 | +#### Response Quality |
| 57 | +1. **RESPONSE_MATCH_SCORE** - ROUGE-1 algorithmic comparison |
| 58 | +2. **SEMANTIC_RESPONSE_MATCH** - LLM-as-Judge semantic validation |
| 59 | +3. **RESPONSE_EVALUATION_SCORE** - Coherence assessment (1-5 scale) |
| 60 | +4. **RUBRIC_BASED_RESPONSE_QUALITY** - Custom quality criteria |
| 61 | + |
| 62 | +#### Tool Usage |
| 63 | +5. **TOOL_TRAJECTORY_AVG_SCORE** - Exact tool sequence matching |
| 64 | +6. **RUBRIC_BASED_TOOL_USE_QUALITY** - Custom tool quality criteria |
| 65 | + |
| 66 | +#### Safety & Quality |
| 67 | +7. **SAFETY** - Harmlessness evaluation |
| 68 | +8. **HALLUCINATIONS** - Unsupported claim detection |
| 69 | + |
| 70 | +### Evaluation Methods |
| 71 | + |
| 72 | +- **Algorithmic**: Fast, deterministic comparisons (ROUGE, exact matching) |
| 73 | +- **LLM-as-Judge**: Flexible semantic evaluation with customizable rubrics |
| 74 | + |
| 75 | +## Use Cases |
| 76 | + |
| 77 | +### Development Testing |
| 78 | +```go |
| 79 | +// Quick validation during development |
| 80 | +config := &evaluation.EvalConfig{ |
| 81 | + Criteria: map[string]evaluation.Criterion{ |
| 82 | + "response_match": &evaluation.Threshold{MinScore: 0.7}, |
| 83 | + }, |
| 84 | +} |
| 85 | +``` |
| 86 | + |
| 87 | +## Storage Options |
| 88 | + |
| 89 | +### In-Memory |
| 90 | +```go |
| 91 | +evalStorage := storage.NewMemoryStorage() |
| 92 | +``` |
| 93 | +- Fast, no persistence |
| 94 | +- Ideal for testing and development |
| 95 | + |
| 96 | +### File-Based |
| 97 | +```go |
| 98 | +evalStorage, err := storage.NewFileStorage("./eval_data") |
| 99 | +``` |
| 100 | +- JSON persistence to disk |
| 101 | +- Ideal for CI/CD and analysis |
| 102 | + |
| 103 | +## Integration Patterns |
| 104 | + |
| 105 | +### CI/CD Integration |
| 106 | +Run evaluations in your pipeline: |
| 107 | +```bash |
| 108 | +go run ./evaluation_runner.go || exit 1 |
| 109 | +``` |
| 110 | + |
| 111 | +### REST API |
| 112 | +Expose evaluation via HTTP endpoints (see comprehensive example) |
| 113 | + |
| 114 | +### Custom Evaluators |
| 115 | +Register your own domain-specific evaluators: |
| 116 | +```go |
| 117 | +evaluation.Register(myMetric, myEvaluatorFactory) |
| 118 | +``` |
| 119 | +## Requirements |
| 120 | + |
| 121 | +- Go 1.24.4 or later |
| 122 | +- Google API key (for Gemini models) |
| 123 | +- ADK dependencies (automatically managed by Go modules) |
0 commit comments