Skip to content

Commit a28312c

Browse files
docs(examples): add evaluation examples overview
1 parent 1565838 commit a28312c

File tree

1 file changed

+123
-0
lines changed

1 file changed

+123
-0
lines changed

examples/evaluation/README.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Evaluation Framework Examples
2+
3+
This directory contains examples demonstrating the ADK evaluation framework for testing and measuring AI agent performance.
4+
5+
## Available Examples
6+
7+
### [Basic](./basic/)**Start Here**
8+
Simple introduction to LLM-based evaluation:
9+
- Core evaluation setup
10+
- 2 evaluators (algorithmic + LLM-as-Judge)
11+
- Built-in rate limiting
12+
- In-memory storage
13+
- Clear result output
14+
15+
**Best for:** Getting started, understanding fundamentals
16+
17+
### [Comprehensive](./comprehensive/)
18+
- All 8 evaluation metrics
19+
- Agent with custom tools
20+
- File-based persistent storage
21+
- Rubric-based evaluation
22+
- Safety and hallucination detection
23+
- Automatic rate limiting
24+
- Detailed result reporting
25+
26+
## Quick Start
27+
28+
1. Set your API key:
29+
```bash
30+
export GOOGLE_API_KEY=your_api_key_here
31+
```
32+
33+
2. Try basic example (with LLM evaluation):
34+
```bash
35+
cd basic
36+
go run main.go
37+
```
38+
39+
3. Run comprehensive example (all features):
40+
```bash
41+
cd comprehensive
42+
go run main.go
43+
```
44+
## Evaluation Framework Overview
45+
46+
### Core Components
47+
48+
- **EvalSet**: Collection of test cases for systematic evaluation
49+
- **EvalCase**: Single test scenario with conversation flow and expected outcomes
50+
- **Evaluator**: Metric-specific evaluation logic
51+
- **Runner**: Orchestrates evaluation execution
52+
- **Storage**: Persists eval sets and results
53+
54+
### Available Metrics
55+
56+
#### Response Quality
57+
1. **RESPONSE_MATCH_SCORE** - ROUGE-1 algorithmic comparison
58+
2. **SEMANTIC_RESPONSE_MATCH** - LLM-as-Judge semantic validation
59+
3. **RESPONSE_EVALUATION_SCORE** - Coherence assessment (1-5 scale)
60+
4. **RUBRIC_BASED_RESPONSE_QUALITY** - Custom quality criteria
61+
62+
#### Tool Usage
63+
5. **TOOL_TRAJECTORY_AVG_SCORE** - Exact tool sequence matching
64+
6. **RUBRIC_BASED_TOOL_USE_QUALITY** - Custom tool quality criteria
65+
66+
#### Safety & Quality
67+
7. **SAFETY** - Harmlessness evaluation
68+
8. **HALLUCINATIONS** - Unsupported claim detection
69+
70+
### Evaluation Methods
71+
72+
- **Algorithmic**: Fast, deterministic comparisons (ROUGE, exact matching)
73+
- **LLM-as-Judge**: Flexible semantic evaluation with customizable rubrics
74+
75+
## Use Cases
76+
77+
### Development Testing
78+
```go
79+
// Quick validation during development
80+
config := &evaluation.EvalConfig{
81+
Criteria: map[string]evaluation.Criterion{
82+
"response_match": &evaluation.Threshold{MinScore: 0.7},
83+
},
84+
}
85+
```
86+
87+
## Storage Options
88+
89+
### In-Memory
90+
```go
91+
evalStorage := storage.NewMemoryStorage()
92+
```
93+
- Fast, no persistence
94+
- Ideal for testing and development
95+
96+
### File-Based
97+
```go
98+
evalStorage, err := storage.NewFileStorage("./eval_data")
99+
```
100+
- JSON persistence to disk
101+
- Ideal for CI/CD and analysis
102+
103+
## Integration Patterns
104+
105+
### CI/CD Integration
106+
Run evaluations in your pipeline:
107+
```bash
108+
go run ./evaluation_runner.go || exit 1
109+
```
110+
111+
### REST API
112+
Expose evaluation via HTTP endpoints (see comprehensive example)
113+
114+
### Custom Evaluators
115+
Register your own domain-specific evaluators:
116+
```go
117+
evaluation.Register(myMetric, myEvaluatorFactory)
118+
```
119+
## Requirements
120+
121+
- Go 1.24.4 or later
122+
- Google API key (for Gemini models)
123+
- ADK dependencies (automatically managed by Go modules)

0 commit comments

Comments
 (0)