Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 30 additions & 63 deletions examples/k_module_problem/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,51 +52,6 @@ Generation 2 (crossover):

**Key insight**: Evolution discovers correct modules in different individuals and **crossover combines them**. This is the "Building Block Hypothesis" - complex solutions are assembled from simpler discovered components.

## Theoretical Analysis

| Method | Expected Evaluations | Why |
|--------|---------------------|-----|
| **Random Search** | ~312 (50% of space) | Pure luck |
| **Pass@100 (LLM)** | ~100 calls, ~15% success | Independent samples, no learning |
| **Iterative Refinement** | ~312+ | No gradient, random walk |
| **Evolution (pop=20)** | ~40-60 | Parallel exploration + crossover |

The gap widens exponentially with more modules:
- K=5 modules: Iterative ~1,562, Evolution ~70
- K=6 modules: Iterative ~7,812, Evolution ~90

### Note on Pass@k with Closed Models

The pass@k metric (probability of finding solution in k independent attempts) is commonly used to evaluate LLM capabilities. However:

- **Open models** (local): Can generate k responses in parallel with `n=k` parameter
- **Closed models** (API): Most don't support `n>1`, requiring k separate API calls

For this comparison, we include a **random baseline** that simulates pass@k without an LLM. This establishes the "no learning" baseline.

### Random Baseline Results (100 trials, 100 samples each)

| Metric | Value |
|--------|-------|
| **Success rate (pass@100)** | 16% (16/100 trials found solution) |
| **Avg samples to solution** | 43.3 (when found) |
| **Min samples** | 5 (lucky guess) |
| **Max samples** | 91 |

**Pass@k breakdown:**

| k | Empirical | Theoretical |
|---|-----------|-------------|
| 1 | 0% | 0.2% |
| 10 | 1% | 1.6% |
| 20 | 4% | 3.2% |
| 50 | 9% | 7.7% |
| 100 | 16% | 14.8% |

The empirical results closely match the theoretical prediction `pass@k ≈ 1 - (624/625)^k`.

Any method that beats this baseline is demonstrating actual optimization, not just random sampling.

## Running the Experiment

### Prerequisites
Expand Down Expand Up @@ -159,6 +114,17 @@ This generates:

## Experimental Results

### Random Baseline (100 trials, 100 samples each)

| Metric | Value |
|--------|-------|
| **Success rate (pass@100)** | 16% (16/100 trials found solution) |
| **Avg samples to solution** | 43.3 (when found) |
| **Min samples** | 5 (lucky guess) |
| **Max samples** | 91 |

This establishes the "no learning" baseline. Any method that beats this is demonstrating actual optimization, not just random sampling.

### Iterative Refinement Results (3 trials, 100 iterations max)

| Trial | Iterations | Result | Best Score |
Expand All @@ -174,31 +140,31 @@ This generates:

**Key observation**: The iterative agent repeatedly finds configurations with 3/4 correct modules (`csv_reader`, `quicksort`, `json`) but cannot identify that `preprocess` is the wrong module. It keeps cycling through variations without escaping this local optimum.

### OpenEvolve (Evolutionary) Results
### OpenEvolve (Evolutionary) Results (3 trials, 100 iterations max)

| Trial | Iterations | Result | Best Score | Notes |
|-------|------------|--------|------------|-------|
| 1 | 21 | SUCCESS | 100% (4/4) | Solution found through population diversity |
| Trial | Iterations | Result | Best Score |
|-------|------------|--------|------------|
| 1 | 18 | SUCCESS | 100% (4/4) |
| 2 | 50 | SUCCESS | 100% (4/4) |
| 3 | 89 | SUCCESS | 100% (4/4) |

**Summary:**
- **Success rate**: 100% (1/1 trial found solution)
- **Solution found at**: Iteration 21
- **Key observation**: OpenEvolve's population-based approach explores multiple configurations in parallel. By iteration 9, the population already had diverse configurations, and by iteration 21, the correct combination was discovered.
- **Success rate**: 100% (3/3 trials found solution)
- **Avg iterations to solution**: 52.3
- **Min iterations**: 18
- **Max iterations**: 89

**Progression:**
- Iteration 3: 25% (1/4) - Initial exploration
- Iteration 9: 50% (2/4) - Multiple 50% configs in population
- Iteration 21: 100% (4/4) - csv_reader, normalize, quicksort, json - PERFECT!

**Key advantage**: OpenEvolve's prompt encourages systematic exploration ("try DIFFERENT options for EACH module") rather than following potentially misleading hints. Combined with higher temperature (0.9), larger population (25), and more frequent migration, this leads to faster discovery.
**Key advantage**: OpenEvolve's population-based approach maintains diverse configurations that explore different module combinations in parallel. Even when some individuals get stuck at local optima (75% with wrong preprocessing), others explore alternatives and eventually discover the correct solution.

### Comparison Summary

| Method | Success Rate | Evaluations to Solution | Key Limitation |
|--------|-------------|------------------------|----------------|
| **Random Baseline** | 16% | 43.3 avg (when found) | No learning |
| **Iterative Refinement** | 33% | 13 (when found) | Gets stuck at 75%, can't escape local optima |
| **OpenEvolve** | 100% | 21 | Population diversity + systematic exploration |
| Method | Success Rate | Avg Iterations | Key Finding |
|--------|-------------|----------------|-------------|
| **Random Baseline** | 16% | 43.3 (when found) | No learning baseline |
| **Iterative Refinement** | 33% (1/3) | 13 (when found) | Gets stuck at 75% local optimum |
| **OpenEvolve** | **100% (3/3)** | 52.3 | Always finds solution |

**Key insight**: While OpenEvolve takes more iterations on average (52.3 vs 13), it has a **100% success rate** compared to iterative refinement's 33%. The evolutionary approach's population diversity ensures it eventually escapes local optima that trap single-trajectory methods.

## Why This Matters

Expand All @@ -224,6 +190,7 @@ Real-world examples:
| `config.yaml` | OpenEvolve configuration |
| `iterative_agent.py` | Iterative refinement agent using OpenRouter API |
| `run_iterative_trials.py` | Run multiple trials of iterative agent |
| `run_openevolve_trials.py` | Run multiple trials of OpenEvolve |
| `run_random_baseline.py` | Random search baseline with pass@k analysis |
| `compare_results.py` | Analysis and visualization |

Expand Down
4 changes: 2 additions & 2 deletions examples/k_module_problem/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ evaluator:
use_llm_feedback: false
enable_artifacts: true

# Early stopping - stop when we find the solution
early_stopping_patience: 30 # Reduced - expect faster convergence
# Early stopping - disabled to allow full exploration
early_stopping_patience: 100 # Allow full run
convergence_threshold: 0.001
early_stopping_metric: "combined_score"
169 changes: 169 additions & 0 deletions examples/k_module_problem/run_openevolve_trials.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
#!/usr/bin/env python3
"""Run multiple trials of OpenEvolve to get statistics."""

import json
import os
import shutil
import subprocess
import sys
from pathlib import Path

# Run from the example directory
os.chdir(Path(__file__).parent)


def run_trial(trial_num: int, max_iterations: int = 100, seed: int = None):
"""Run a single OpenEvolve trial."""
output_dir = f"openevolve_output_trial_{trial_num}"

# Clean output directory
if os.path.exists(output_dir):
shutil.rmtree(output_dir)

# Update config with new seed if provided
if seed is not None:
# Read config
with open("config.yaml", "r") as f:
config_content = f.read()

# Replace seed
import re
config_content = re.sub(r'random_seed:\s*\d+', f'random_seed: {seed}', config_content)

# Write temp config
temp_config = f"config_trial_{trial_num}.yaml"
with open(temp_config, "w") as f:
f.write(config_content)
else:
temp_config = "config.yaml"

# Run OpenEvolve
cmd = [
"openevolve-run",
"initial_program.py",
"evaluator.py",
"--config", temp_config,
"--iterations", str(max_iterations),
"--output", output_dir,
]

print(f"\n{'='*60}")
print(f"TRIAL {trial_num + 1}: Running OpenEvolve with seed {seed}")
print('='*60)

result = subprocess.run(cmd, capture_output=True, text=True)

# Clean up temp config
if seed is not None and os.path.exists(temp_config):
os.remove(temp_config)

# Parse results from log
solution_found_at = None
best_score = 0.0

log_dir = Path(output_dir) / "logs"
if log_dir.exists():
log_files = list(log_dir.glob("*.log"))
if log_files:
with open(log_files[0], "r") as f:
log_content = f.read()

import re

# Find best score
score_matches = re.findall(r'combined_score[=:]\s*([\d.]+)', log_content)
if score_matches:
best_score = max(float(s) for s in score_matches)

# Look for first 100% solution - find the "New best" line with 1.0000
new_best_matches = re.findall(r'New best solution found at iteration (\d+):', log_content)
perfect_matches = re.findall(r'Iteration (\d+):.*?combined_score=1\.0000', log_content)

if perfect_matches:
solution_found_at = int(perfect_matches[0])
elif best_score >= 1.0 and new_best_matches:
# Fallback: find last new best if we have 100%
solution_found_at = int(new_best_matches[-1])

return {
"trial": trial_num,
"seed": seed,
"solution_found_at": solution_found_at,
"best_score": best_score,
"max_iterations": max_iterations,
}


def run_trials(num_trials: int = 3, max_iterations: int = 100, base_seed: int = 100):
"""Run multiple trials and collect statistics."""
results = []
solutions_found = []

for trial in range(num_trials):
seed = base_seed + trial * 111 # Different seeds for each trial
result = run_trial(trial, max_iterations, seed)
results.append(result)

if result["solution_found_at"] is not None:
solutions_found.append(result["solution_found_at"])
print(f"Trial {trial + 1}: SUCCESS at iteration {result['solution_found_at']}")
else:
print(f"Trial {trial + 1}: FAILED (best score: {result['best_score']:.2%})")

# Calculate statistics
success_rate = len(solutions_found) / num_trials
avg_iterations = sum(solutions_found) / len(solutions_found) if solutions_found else float('inf')
min_iterations = min(solutions_found) if solutions_found else None
max_iterations_found = max(solutions_found) if solutions_found else None

print(f"\n{'='*60}")
print("OPENEVOLVE TRIAL RESULTS")
print('='*60)
print(f"Trials: {num_trials}")
print(f"Max iterations per trial: {max_iterations}")
print(f"Success rate: {success_rate:.0%} ({len(solutions_found)}/{num_trials})")
if solutions_found:
print(f"Avg iterations to solution: {avg_iterations:.1f}")
print(f"Min iterations: {min_iterations}")
print(f"Max iterations: {max_iterations_found}")
print('='*60)

# Save summary
summary = {
"config": {
"num_trials": num_trials,
"max_iterations": max_iterations,
},
"summary": {
"success_rate": success_rate,
"avg_iterations_to_solution": avg_iterations if solutions_found else None,
"min_iterations": min_iterations,
"max_iterations": max_iterations_found,
"solutions_found": len(solutions_found),
},
"trials": results,
}

with open("openevolve_trials_results.json", "w") as f:
json.dump(summary, f, indent=2)

print(f"\nResults saved to: openevolve_trials_results.json")

# Clean up trial output directories
for trial in range(num_trials):
output_dir = f"openevolve_output_trial_{trial}"
if os.path.exists(output_dir):
shutil.rmtree(output_dir)

return summary


if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--trials", type=int, default=3, help="Number of trials")
parser.add_argument("--iterations", type=int, default=100, help="Max iterations per trial")
parser.add_argument("--seed", type=int, default=100, help="Base random seed")
args = parser.parse_args()

run_trials(num_trials=args.trials, max_iterations=args.iterations, base_seed=args.seed)