diff --git a/examples/k_module_problem/README.md b/examples/k_module_problem/README.md index 897d58c38..dc995c279 100644 --- a/examples/k_module_problem/README.md +++ b/examples/k_module_problem/README.md @@ -52,51 +52,6 @@ Generation 2 (crossover): **Key insight**: Evolution discovers correct modules in different individuals and **crossover combines them**. This is the "Building Block Hypothesis" - complex solutions are assembled from simpler discovered components. -## Theoretical Analysis - -| Method | Expected Evaluations | Why | -|--------|---------------------|-----| -| **Random Search** | ~312 (50% of space) | Pure luck | -| **Pass@100 (LLM)** | ~100 calls, ~15% success | Independent samples, no learning | -| **Iterative Refinement** | ~312+ | No gradient, random walk | -| **Evolution (pop=20)** | ~40-60 | Parallel exploration + crossover | - -The gap widens exponentially with more modules: -- K=5 modules: Iterative ~1,562, Evolution ~70 -- K=6 modules: Iterative ~7,812, Evolution ~90 - -### Note on Pass@k with Closed Models - -The pass@k metric (probability of finding solution in k independent attempts) is commonly used to evaluate LLM capabilities. However: - -- **Open models** (local): Can generate k responses in parallel with `n=k` parameter -- **Closed models** (API): Most don't support `n>1`, requiring k separate API calls - -For this comparison, we include a **random baseline** that simulates pass@k without an LLM. This establishes the "no learning" baseline. - -### Random Baseline Results (100 trials, 100 samples each) - -| Metric | Value | -|--------|-------| -| **Success rate (pass@100)** | 16% (16/100 trials found solution) | -| **Avg samples to solution** | 43.3 (when found) | -| **Min samples** | 5 (lucky guess) | -| **Max samples** | 91 | - -**Pass@k breakdown:** - -| k | Empirical | Theoretical | -|---|-----------|-------------| -| 1 | 0% | 0.2% | -| 10 | 1% | 1.6% | -| 20 | 4% | 3.2% | -| 50 | 9% | 7.7% | -| 100 | 16% | 14.8% | - -The empirical results closely match the theoretical prediction `pass@k ≈ 1 - (624/625)^k`. - -Any method that beats this baseline is demonstrating actual optimization, not just random sampling. - ## Running the Experiment ### Prerequisites @@ -159,6 +114,17 @@ This generates: ## Experimental Results +### Random Baseline (100 trials, 100 samples each) + +| Metric | Value | +|--------|-------| +| **Success rate (pass@100)** | 16% (16/100 trials found solution) | +| **Avg samples to solution** | 43.3 (when found) | +| **Min samples** | 5 (lucky guess) | +| **Max samples** | 91 | + +This establishes the "no learning" baseline. Any method that beats this is demonstrating actual optimization, not just random sampling. + ### Iterative Refinement Results (3 trials, 100 iterations max) | Trial | Iterations | Result | Best Score | @@ -174,31 +140,31 @@ This generates: **Key observation**: The iterative agent repeatedly finds configurations with 3/4 correct modules (`csv_reader`, `quicksort`, `json`) but cannot identify that `preprocess` is the wrong module. It keeps cycling through variations without escaping this local optimum. -### OpenEvolve (Evolutionary) Results +### OpenEvolve (Evolutionary) Results (3 trials, 100 iterations max) -| Trial | Iterations | Result | Best Score | Notes | -|-------|------------|--------|------------|-------| -| 1 | 21 | SUCCESS | 100% (4/4) | Solution found through population diversity | +| Trial | Iterations | Result | Best Score | +|-------|------------|--------|------------| +| 1 | 18 | SUCCESS | 100% (4/4) | +| 2 | 50 | SUCCESS | 100% (4/4) | +| 3 | 89 | SUCCESS | 100% (4/4) | **Summary:** -- **Success rate**: 100% (1/1 trial found solution) -- **Solution found at**: Iteration 21 -- **Key observation**: OpenEvolve's population-based approach explores multiple configurations in parallel. By iteration 9, the population already had diverse configurations, and by iteration 21, the correct combination was discovered. +- **Success rate**: 100% (3/3 trials found solution) +- **Avg iterations to solution**: 52.3 +- **Min iterations**: 18 +- **Max iterations**: 89 -**Progression:** -- Iteration 3: 25% (1/4) - Initial exploration -- Iteration 9: 50% (2/4) - Multiple 50% configs in population -- Iteration 21: 100% (4/4) - csv_reader, normalize, quicksort, json - PERFECT! - -**Key advantage**: OpenEvolve's prompt encourages systematic exploration ("try DIFFERENT options for EACH module") rather than following potentially misleading hints. Combined with higher temperature (0.9), larger population (25), and more frequent migration, this leads to faster discovery. +**Key advantage**: OpenEvolve's population-based approach maintains diverse configurations that explore different module combinations in parallel. Even when some individuals get stuck at local optima (75% with wrong preprocessing), others explore alternatives and eventually discover the correct solution. ### Comparison Summary -| Method | Success Rate | Evaluations to Solution | Key Limitation | -|--------|-------------|------------------------|----------------| -| **Random Baseline** | 16% | 43.3 avg (when found) | No learning | -| **Iterative Refinement** | 33% | 13 (when found) | Gets stuck at 75%, can't escape local optima | -| **OpenEvolve** | 100% | 21 | Population diversity + systematic exploration | +| Method | Success Rate | Avg Iterations | Key Finding | +|--------|-------------|----------------|-------------| +| **Random Baseline** | 16% | 43.3 (when found) | No learning baseline | +| **Iterative Refinement** | 33% (1/3) | 13 (when found) | Gets stuck at 75% local optimum | +| **OpenEvolve** | **100% (3/3)** | 52.3 | Always finds solution | + +**Key insight**: While OpenEvolve takes more iterations on average (52.3 vs 13), it has a **100% success rate** compared to iterative refinement's 33%. The evolutionary approach's population diversity ensures it eventually escapes local optima that trap single-trajectory methods. ## Why This Matters @@ -224,6 +190,7 @@ Real-world examples: | `config.yaml` | OpenEvolve configuration | | `iterative_agent.py` | Iterative refinement agent using OpenRouter API | | `run_iterative_trials.py` | Run multiple trials of iterative agent | +| `run_openevolve_trials.py` | Run multiple trials of OpenEvolve | | `run_random_baseline.py` | Random search baseline with pass@k analysis | | `compare_results.py` | Analysis and visualization | diff --git a/examples/k_module_problem/config.yaml b/examples/k_module_problem/config.yaml index 0eb244652..5812b1c04 100644 --- a/examples/k_module_problem/config.yaml +++ b/examples/k_module_problem/config.yaml @@ -81,7 +81,7 @@ evaluator: use_llm_feedback: false enable_artifacts: true -# Early stopping - stop when we find the solution -early_stopping_patience: 30 # Reduced - expect faster convergence +# Early stopping - disabled to allow full exploration +early_stopping_patience: 100 # Allow full run convergence_threshold: 0.001 early_stopping_metric: "combined_score" diff --git a/examples/k_module_problem/run_openevolve_trials.py b/examples/k_module_problem/run_openevolve_trials.py new file mode 100644 index 000000000..3db4d2111 --- /dev/null +++ b/examples/k_module_problem/run_openevolve_trials.py @@ -0,0 +1,169 @@ +#!/usr/bin/env python3 +"""Run multiple trials of OpenEvolve to get statistics.""" + +import json +import os +import shutil +import subprocess +import sys +from pathlib import Path + +# Run from the example directory +os.chdir(Path(__file__).parent) + + +def run_trial(trial_num: int, max_iterations: int = 100, seed: int = None): + """Run a single OpenEvolve trial.""" + output_dir = f"openevolve_output_trial_{trial_num}" + + # Clean output directory + if os.path.exists(output_dir): + shutil.rmtree(output_dir) + + # Update config with new seed if provided + if seed is not None: + # Read config + with open("config.yaml", "r") as f: + config_content = f.read() + + # Replace seed + import re + config_content = re.sub(r'random_seed:\s*\d+', f'random_seed: {seed}', config_content) + + # Write temp config + temp_config = f"config_trial_{trial_num}.yaml" + with open(temp_config, "w") as f: + f.write(config_content) + else: + temp_config = "config.yaml" + + # Run OpenEvolve + cmd = [ + "openevolve-run", + "initial_program.py", + "evaluator.py", + "--config", temp_config, + "--iterations", str(max_iterations), + "--output", output_dir, + ] + + print(f"\n{'='*60}") + print(f"TRIAL {trial_num + 1}: Running OpenEvolve with seed {seed}") + print('='*60) + + result = subprocess.run(cmd, capture_output=True, text=True) + + # Clean up temp config + if seed is not None and os.path.exists(temp_config): + os.remove(temp_config) + + # Parse results from log + solution_found_at = None + best_score = 0.0 + + log_dir = Path(output_dir) / "logs" + if log_dir.exists(): + log_files = list(log_dir.glob("*.log")) + if log_files: + with open(log_files[0], "r") as f: + log_content = f.read() + + import re + + # Find best score + score_matches = re.findall(r'combined_score[=:]\s*([\d.]+)', log_content) + if score_matches: + best_score = max(float(s) for s in score_matches) + + # Look for first 100% solution - find the "New best" line with 1.0000 + new_best_matches = re.findall(r'New best solution found at iteration (\d+):', log_content) + perfect_matches = re.findall(r'Iteration (\d+):.*?combined_score=1\.0000', log_content) + + if perfect_matches: + solution_found_at = int(perfect_matches[0]) + elif best_score >= 1.0 and new_best_matches: + # Fallback: find last new best if we have 100% + solution_found_at = int(new_best_matches[-1]) + + return { + "trial": trial_num, + "seed": seed, + "solution_found_at": solution_found_at, + "best_score": best_score, + "max_iterations": max_iterations, + } + + +def run_trials(num_trials: int = 3, max_iterations: int = 100, base_seed: int = 100): + """Run multiple trials and collect statistics.""" + results = [] + solutions_found = [] + + for trial in range(num_trials): + seed = base_seed + trial * 111 # Different seeds for each trial + result = run_trial(trial, max_iterations, seed) + results.append(result) + + if result["solution_found_at"] is not None: + solutions_found.append(result["solution_found_at"]) + print(f"Trial {trial + 1}: SUCCESS at iteration {result['solution_found_at']}") + else: + print(f"Trial {trial + 1}: FAILED (best score: {result['best_score']:.2%})") + + # Calculate statistics + success_rate = len(solutions_found) / num_trials + avg_iterations = sum(solutions_found) / len(solutions_found) if solutions_found else float('inf') + min_iterations = min(solutions_found) if solutions_found else None + max_iterations_found = max(solutions_found) if solutions_found else None + + print(f"\n{'='*60}") + print("OPENEVOLVE TRIAL RESULTS") + print('='*60) + print(f"Trials: {num_trials}") + print(f"Max iterations per trial: {max_iterations}") + print(f"Success rate: {success_rate:.0%} ({len(solutions_found)}/{num_trials})") + if solutions_found: + print(f"Avg iterations to solution: {avg_iterations:.1f}") + print(f"Min iterations: {min_iterations}") + print(f"Max iterations: {max_iterations_found}") + print('='*60) + + # Save summary + summary = { + "config": { + "num_trials": num_trials, + "max_iterations": max_iterations, + }, + "summary": { + "success_rate": success_rate, + "avg_iterations_to_solution": avg_iterations if solutions_found else None, + "min_iterations": min_iterations, + "max_iterations": max_iterations_found, + "solutions_found": len(solutions_found), + }, + "trials": results, + } + + with open("openevolve_trials_results.json", "w") as f: + json.dump(summary, f, indent=2) + + print(f"\nResults saved to: openevolve_trials_results.json") + + # Clean up trial output directories + for trial in range(num_trials): + output_dir = f"openevolve_output_trial_{trial}" + if os.path.exists(output_dir): + shutil.rmtree(output_dir) + + return summary + + +if __name__ == "__main__": + import argparse + parser = argparse.ArgumentParser() + parser.add_argument("--trials", type=int, default=3, help="Number of trials") + parser.add_argument("--iterations", type=int, default=100, help="Max iterations per trial") + parser.add_argument("--seed", type=int, default=100, help="Base random seed") + args = parser.parse_args() + + run_trials(num_trials=args.trials, max_iterations=args.iterations, base_seed=args.seed)