OpenPipe · bradhilton · Jul 10, 2025 · Jul 10, 2025 · Jul 10, 2025 · Jul 10, 2025
diff --git a/dev/swebench/README_art_style.md b/dev/swebench/README_art_style.md
@@ -0,0 +1,142 @@
+# ART-Style Training for SWE-bench
+
+This implementation provides an ART (Agent Reinforcement Trainer) style training script for SWE-bench, inspired by `qwen_rollout.py` but following idiomatic ART patterns.
+
+## Files
+
+- `art_style_rollout.py` - Core rollout function that executes agent interactions in ART style
+- `train_art_style.py` - Main training script with both inference and training modes
+- `test_art_style.py` - Test script to verify the implementation
+
+## Key Features
+
+### ART-Style Rollout (`art_style_rollout.py`)
+
+The rollout function follows ART idioms:
+- Returns `art.Trajectory` objects with messages, rewards, and metrics
+- Uses retry decorators for robustness
+- Tracks detailed metrics including progress, maintenance, and resolution
+- Implements proper tool handling for bash commands and file editing
+- Calculates rewards based on test pass/fail rates
+
+### Training Script (`train_art_style.py`)
+
+Supports two modes:
+
+1. **Inference Mode** - For testing with existing models:
+   ```bash
+   python train_art_style.py --mode inference --num-instances 10
+   ```
+
+2. **Training Mode** - For training new models with gradients:
+   ```bash
+   python train_art_style.py --mode train --epochs 1 --batch-size 4
+   ```
+
+### Configuration
+
+The `ARTModelConfig` class allows customization of:
+- `max_steps`: Maximum interaction steps (default: 30)
+- `temperature`: Model temperature (default: 0.0)
+- `max_tokens`: Maximum tokens per response (default: 4096)
+- `system_prompt`: System prompt for the model
+- `instance_prompt_template`: Template for problem descriptions
+
+### Command-Line Options
+
+Key training script options:
+- `--mode`: Choose between `inference` (no gradients) or `train` (with gradients)
+- `--model`: Model name or path
+- `--num-instances`: Number of instances to use (inference mode)
+- `--batch-size`: Batch size for training
+- `--rollouts-per-instance`: Number of rollouts per instance
+- `--epochs`: Number of training epochs
+- `--learning-rate`: Learning rate for training
+- `--reward-power`: Power to apply to progress metric
+- `--no-quality-filter`: Disable quality filtering (not recommended)
+- `--require-non-zero-tests`: Require instances to have tests (default: True)
+
+## Usage Examples
+
+### Quick Test
+```bash
+# Test a single instance
+python test_art_style.py
+
+# Test a trajectory group
+python test_art_style.py group
+```
+
+### Inference with Local Model
+```bash
+python train_art_style.py \
+    --mode inference \
+    --model "willcb/Qwen3-32B" \
+    --api-base "http://localhost:8000/v1" \
+    --num-instances 5 \
+    --rollouts-per-instance 2
+```
+
+### Disable Quality Filtering (Not Recommended)
+```bash
+# Use all instances without quality filtering
+python train_art_style.py \
+    --mode inference \
+    --model "willcb/Qwen3-32B" \
+    --num-instances 10 \
+    --no-quality-filter
+```
+
+### Training a New Model
+```bash
+python train_art_style.py \
+    --mode train \
+    --model "Qwen/Qwen3-32B" \
+    --batch-size 4 \
+    --rollouts-per-instance 4 \
+    --epochs 1 \
+    --learning-rate 5e-5
+```
+
+## Reward Calculation
+
+The reward function follows the same formula as the original implementation:
+- 20% weight on test maintenance (keeping passing tests passing)
+- 30% weight on progress (fixing failing tests)
+- 50% weight on full resolution (all tests passing)
+
+The `reward_power` parameter can be used to adjust the progress component.
+
+## Quality Filtering
+
+The implementation includes an instance quality filter that identifies instances with reliable test behavior. **Quality filtering is enabled by default.**
+
+- **Filters instances where**:
+  1. All FAIL_TO_PASS tests initially pass
+  2. All FAIL_TO_PASS tests fail after applying the patch (bug introduction)
+  3. All PASS_TO_PASS tests remain passing
+
+- **Usage**: Quality filtering is automatic. To disable it (not recommended), use `--no-quality-filter`
+- **Statistics**: Approximately 54% of instances (4,577 out of 8,480) meet quality criteria
+
+## Differences from Original Implementation
+
+1. **Simplified Architecture**: No dependency on SWE-Agent framework
+2. **Direct Tool Implementation**: Tools are implemented directly without complex abstractions
+3. **ART Integration**: Native support for ART training loops and trajectory management
+4. **Cleaner Error Handling**: Uses ART retry decorators and proper exception handling
+5. **Quality Filtering**: Built-in filtering based on test reliability
+
+## Requirements
+
+- Python 3.8+
+- ART framework (`art`)
+- OpenAI Python client
+- PyTorch (for training mode)
+- Access to SWE-bench instances and sandboxes
+
+## Environment Variables
+
+- `OPENAI_API_KEY`: API key for OpenAI (can be "default" for local inference)
+- `OPENAI_BASE_URL`: Base URL for API (e.g., "http://localhost:8000/v1")
+- Standard SWE-bench environment variables
diff --git a/dev/swebench/analysis/INVESTIGATION_NOTES.md b/dev/swebench/analysis/INVESTIGATION_NOTES.md
@@ -0,0 +1,110 @@
+# SWE-bench Investigation Notes
+
+**Last Updated**: 2025-07-10  
+**Analysis Summary**: 74.6% perfect pass rate (6,329/8,480 instances)
+
+## Critical Issues: Repositories with 0% Perfect Pass Rate
+
+### 1. P2P-Dominant Failures (Patches Break Existing Tests)
+
+These repositories have near-100% P2P failure rates, meaning patches consistently break existing functionality:
+
+#### **oauthlib** (164/166 instances fail P2P - 98.8%)
+- **Pattern**: All patches break existing OAuth functionality tests
+- **Example instances**:
+  - `oauthlib__oauthlib.1fd52536.lm_rewrite__q9ve64pd`: 670 P2P failures
+  - `oauthlib__oauthlib.1fd52536.lm_rewrite__rchxmbd6`: 625 P2P failures
+- **Hypothesis**: OAuth library may have tightly coupled components where any change breaks auth flows
+
+#### **cloudpipe** (37/37 instances fail P2P - 100%)
+- **Pattern**: Every single patch breaks existing tests
+- **Hypothesis**: Likely has integration tests that are sensitive to any code changes
+
+#### **pyparsing** (28/29 instances fail P2P - 96.6%)
+- **Pattern**: Parser modifications break existing parsing tests
+- **Hypothesis**: Grammar/parser changes have cascading effects
+
+### 2. F2P-Initial Failures (Baseline Tests Don't Pass)
+
+These repositories have tests that fail even before applying patches:
+
+#### **Project-MONAI** (159/161 instances fail F2P-initial - 98.8%)
+- **Pattern**: Tests don't pass in baseline state
+- **Hypothesis**: May require specific GPU/CUDA setup or have incorrect test specifications
+
+#### **burnash** (26/26 instances fail F2P-initial - 100%)
+- **Pattern**: All baseline tests fail
+- **Hypothesis**: Test specifications may be incorrect or environment setup issues
+
+#### **python-trio** (90/139 instances fail F2P-initial - 64.7%)
+- **Pattern**: Mix of F2P-initial (90) and P2P (125) failures
+- **Example**: `python-trio__trio.cfbbe2c1.pr_2937`: 730 F2P initial failures
+- **Hypothesis**: Async testing framework may have special requirements
+
+### 3. Mixed Failure Patterns
+
+#### **seperman** (170 P2P + 115 F2P-initial failures out of 172 instances)
+- **Pattern**: Both baseline and regression failures
+- **Hypothesis**: Fundamental test environment or specification issues
+
+#### **django-money**, **facebookresearch**, **tweepy**, **aio-libs**
+- All show similar patterns of both F2P-initial and P2P failures
+- Suggests both incorrect test specs AND patches that break functionality
+
+## Root Cause Analysis
+
+### Environmental Issues
+1. **GPU/CUDA Requirements**: Project-MONAI likely needs GPU setup
+2. **Async Test Runners**: python-trio, aio-libs may need special async test configurations
+3. **Database/Services**: django-money might need database setup
+
+### Test Specification Issues
+1. **Incorrect Baseline**: F2P tests that fail initially suggest wrong test selections
+2. **Version Mismatches**: Tests may be from different versions than the code
+
+### Patch Quality Issues
+1. **Over-broad Changes**: Patches might modify more than intended
+2. **Missing Context**: Patches may not account for all usages of modified code
+
+## Debugging Strategy
+
+### Quick Checks
+```bash
+# Check a specific failing instance
+uv run python analyze_results.py | grep "oauthlib__oauthlib"
+
+# Look at error patterns for a specific repo
+grep "oauthlib" unique_errors.txt
+
+# Run a single instance manually
+async with new_sandbox(image="swesmith/oauthlib__oauthlib.1fd52536", provider="daytona") as sandbox:
+    failed, passed = await sandbox.eval(instance["FAIL_TO_PASS"])
+```
+
+### Deep Investigation Steps
+1. **Pick one instance from each failure category**
+2. **Run with verbose logging to see actual test output**
+3. **Compare working vs failing repos to identify patterns**
+4. **Check if special test commands or setup is needed**
+
+## Recommendations for Future Work
+
+### High Priority
+1. **oauthlib**: Investigate why ALL patches break OAuth flows
+2. **Project-MONAI**: Check GPU requirements and test specifications
+3. **seperman**: High volume of failures (170) makes this impactful to fix
+
+### Medium Priority
+1. **python-trio**: Mixed failures suggest complex issues
+2. **cloudpipe**: Small repo but 100% failure rate is concerning
+
+### System Improvements
+1. Add pre-flight checks for GPU/special requirements
+2. Validate test specifications before running
+3. Add regression test validation to patch generation
+4. Consider repo-specific configuration overrides
+
+## Success Metrics
+- Current: 74.6% perfect pass rate
+- Goal: >85% by addressing top 5 problematic repos
+- 52 repos already at 100% - system works well for standard cases
diff --git a/dev/swebench/analysis/SWE_BENCH_ANALYSIS_REPORT.md b/dev/swebench/analysis/SWE_BENCH_ANALYSIS_REPORT.md
@@ -0,0 +1,103 @@
+# SWE-Bench Test Results Analysis Report
+
+**Date**: 2025-07-10  
+**Total Instances Analyzed**: 8,480 (unique instances, latest run per instance)
+
+## Executive Summary
+
+### Overall Performance
+- **Perfect Pass Rate**: 74.6% (6,329/8,480)
+- **Test Failures**: 24.2% (2,055 instances)
+- **Errors**: 1.5% (122 instances)
+
+A "perfect pass" means:
+- No errors during sandbox creation or test execution
+- All FAIL_TO_PASS tests passed initially (baseline behavior)
+- All FAIL_TO_PASS tests failed after patch (patch introduces intended bug)
+- All PASS_TO_PASS tests passed (no regression)
+
+## Key Findings
+
+### 1. Most Common Failure Modes
+
+| Failure Type | Count | Percentage | Description |
+|-------------|-------|------------|-------------|
+| P2P Failures | 1,758 | 20.7% | Patches broke existing functionality |
+| F2P Initial Failures | 1,171 | 13.8% | Tests failed baseline (should pass initially) |
+| F2P Post Pass | 182 | 2.1% | Tests still passed after patch (should fail) |
+| Execution Errors | 122 | 1.5% | Timeouts, sandbox failures, etc. |
+
+### 2. Repositories with Fundamental Issues (0% Perfect Pass Rate)
+
+15 repositories achieved 0% perfect pass rate, indicating systematic problems:
+
+#### P2P-Dominant Failures (patches consistently break existing tests):
+- **oauthlib**: 164/166 instances fail P2P (98.8%)
+- **cloudpipe**: 37/37 instances fail P2P (100%)
+- **pyparsing**: 28/29 instances fail P2P (96.6%)
+- **life4**: 40/43 instances fail P2P (93.0%)
+- **borntyping**: 16/17 instances fail P2P (94.1%)
+
+#### F2P-Initial Dominant Failures (baseline tests don't pass):
+- **Project-MONAI**: 159/161 instances fail F2P-initial (98.8%)
+- **burnash**: 26/26 instances fail F2P-initial (100%)
+- **python-trio**: 90/139 instances fail F2P-initial (64.7%)
+- **Cog-Creators**: 64/91 instances fail F2P-initial (70.3%)
+- **alanjds**: 18/27 instances fail F2P-initial (66.7%)
+
+#### Mixed Failures (both baseline and regression issues):
+- **seperman**: 115 F2P-initial + 170 P2P failures (172 instances)
+- **django-money**: 67 F2P-initial + 67 P2P failures (68 instances)
+- **facebookresearch**: 57 F2P-initial + 71 P2P failures (72 instances)
+- **tweepy**: 52 F2P-initial + 52 P2P failures (52 instances)
+- **aio-libs**: 8 F2P-initial + 8 P2P failures (8 instances)
+
+### 3. Repository Performance Distribution
+
+| Perfect Pass Rate | Repository Count | Percentage |
+|------------------|------------------|------------|
+| 0% | 15 | 14.3% |
+| 1-25% | 5 | 4.8% |
+| 26-50% | 4 | 3.8% |
+| 51-75% | 3 | 2.9% |
+| 76-99% | 26 | 24.8% |
+| 100% | 52 | 49.5% |
+
+### 4. Error Analysis
+
+Most errors (1.5% of instances) were infrastructure-related:
+- 504 Gateway Timeout: 76 instances
+- Command execution timeout: 64 instances
+- Empty command errors: 31 instances
+- Sandbox creation failures: 10 instances
+
+## Recommendations
+
+### High Priority Investigations
+
+1. **oauthlib & cloudpipe**: Near 100% P2P failure rate suggests patches consistently break core functionality. These need immediate review of patch generation logic.
+
+2. **Project-MONAI & burnash**: 100% F2P-initial failure rate indicates test specifications may be incorrect or tests are not properly capturing baseline behavior.
+
+3. **seperman**: Highest absolute number of failures (170 P2P failures) combined with F2P-initial failures suggests both test specification and patch generation issues.
+
+### System Improvements
+
+1. **Test Specification Review**: 13.8% F2P-initial failure rate suggests many test specifications don't correctly capture baseline behavior.
+
+2. **Patch Quality**: 20.7% P2P failure rate indicates patches often break existing functionality. Consider adding regression checks.
+
+3. **Infrastructure**: The 8,000 test limit resolved previous issues with test count limits.
+
+## Success Stories
+
+- 52 repositories (49.5%) achieved 100% perfect pass rate
+- 74.6% overall perfect pass rate shows the system works well for most cases
+- Very low error rate (1.5%) indicates stable infrastructure
+
+## Technical Notes
+
+- Analysis based on latest run per instance (aggregated by instance_id)
+- Used daytona provider with up to 128 concurrent sandboxes
+- Retry logic with exponential backoff for transient failures
+- Test count limit increased from 3,000 to 8,000