Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
142 changes: 142 additions & 0 deletions dev/swebench/README_art_style.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# ART-Style Training for SWE-bench

This implementation provides an ART (Agent Reinforcement Trainer) style training script for SWE-bench, inspired by `qwen_rollout.py` but following idiomatic ART patterns.

## Files

- `art_style_rollout.py` - Core rollout function that executes agent interactions in ART style
- `train_art_style.py` - Main training script with both inference and training modes
- `test_art_style.py` - Test script to verify the implementation

## Key Features

### ART-Style Rollout (`art_style_rollout.py`)

The rollout function follows ART idioms:
- Returns `art.Trajectory` objects with messages, rewards, and metrics
- Uses retry decorators for robustness
- Tracks detailed metrics including progress, maintenance, and resolution
- Implements proper tool handling for bash commands and file editing
- Calculates rewards based on test pass/fail rates

### Training Script (`train_art_style.py`)

Supports two modes:

1. **Inference Mode** - For testing with existing models:
```bash
python train_art_style.py --mode inference --num-instances 10
```

2. **Training Mode** - For training new models with gradients:
```bash
python train_art_style.py --mode train --epochs 1 --batch-size 4
```

### Configuration

The `ARTModelConfig` class allows customization of:
- `max_steps`: Maximum interaction steps (default: 30)
- `temperature`: Model temperature (default: 0.0)
- `max_tokens`: Maximum tokens per response (default: 4096)
- `system_prompt`: System prompt for the model
- `instance_prompt_template`: Template for problem descriptions

### Command-Line Options

Key training script options:
- `--mode`: Choose between `inference` (no gradients) or `train` (with gradients)
- `--model`: Model name or path
- `--num-instances`: Number of instances to use (inference mode)
- `--batch-size`: Batch size for training
- `--rollouts-per-instance`: Number of rollouts per instance
- `--epochs`: Number of training epochs
- `--learning-rate`: Learning rate for training
- `--reward-power`: Power to apply to progress metric
- `--no-quality-filter`: Disable quality filtering (not recommended)
- `--require-non-zero-tests`: Require instances to have tests (default: True)

## Usage Examples

### Quick Test
```bash
# Test a single instance
python test_art_style.py

# Test a trajectory group
python test_art_style.py group
```

### Inference with Local Model
```bash
python train_art_style.py \
--mode inference \
--model "willcb/Qwen3-32B" \
--api-base "http://localhost:8000/v1" \
--num-instances 5 \
--rollouts-per-instance 2
```

### Disable Quality Filtering (Not Recommended)
```bash
# Use all instances without quality filtering
python train_art_style.py \
--mode inference \
--model "willcb/Qwen3-32B" \
--num-instances 10 \
--no-quality-filter
```

### Training a New Model
```bash
python train_art_style.py \
--mode train \
--model "Qwen/Qwen3-32B" \
--batch-size 4 \
--rollouts-per-instance 4 \
--epochs 1 \
--learning-rate 5e-5
```

## Reward Calculation

The reward function follows the same formula as the original implementation:
- 20% weight on test maintenance (keeping passing tests passing)
- 30% weight on progress (fixing failing tests)
- 50% weight on full resolution (all tests passing)

The `reward_power` parameter can be used to adjust the progress component.

## Quality Filtering

The implementation includes an instance quality filter that identifies instances with reliable test behavior. **Quality filtering is enabled by default.**

- **Filters instances where**:
1. All FAIL_TO_PASS tests initially pass
2. All FAIL_TO_PASS tests fail after applying the patch (bug introduction)
3. All PASS_TO_PASS tests remain passing

- **Usage**: Quality filtering is automatic. To disable it (not recommended), use `--no-quality-filter`
- **Statistics**: Approximately 54% of instances (4,577 out of 8,480) meet quality criteria

## Differences from Original Implementation

1. **Simplified Architecture**: No dependency on SWE-Agent framework
2. **Direct Tool Implementation**: Tools are implemented directly without complex abstractions
3. **ART Integration**: Native support for ART training loops and trajectory management
4. **Cleaner Error Handling**: Uses ART retry decorators and proper exception handling
5. **Quality Filtering**: Built-in filtering based on test reliability

## Requirements

- Python 3.8+
- ART framework (`art`)
- OpenAI Python client
- PyTorch (for training mode)
- Access to SWE-bench instances and sandboxes

## Environment Variables

- `OPENAI_API_KEY`: API key for OpenAI (can be "default" for local inference)
- `OPENAI_BASE_URL`: Base URL for API (e.g., "http://localhost:8000/v1")
- Standard SWE-bench environment variables
110 changes: 110 additions & 0 deletions dev/swebench/analysis/INVESTIGATION_NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# SWE-bench Investigation Notes

**Last Updated**: 2025-07-10
**Analysis Summary**: 74.6% perfect pass rate (6,329/8,480 instances)

## Critical Issues: Repositories with 0% Perfect Pass Rate

### 1. P2P-Dominant Failures (Patches Break Existing Tests)

These repositories have near-100% P2P failure rates, meaning patches consistently break existing functionality:

#### **oauthlib** (164/166 instances fail P2P - 98.8%)
- **Pattern**: All patches break existing OAuth functionality tests
- **Example instances**:
- `oauthlib__oauthlib.1fd52536.lm_rewrite__q9ve64pd`: 670 P2P failures
- `oauthlib__oauthlib.1fd52536.lm_rewrite__rchxmbd6`: 625 P2P failures
- **Hypothesis**: OAuth library may have tightly coupled components where any change breaks auth flows

#### **cloudpipe** (37/37 instances fail P2P - 100%)
- **Pattern**: Every single patch breaks existing tests
- **Hypothesis**: Likely has integration tests that are sensitive to any code changes

#### **pyparsing** (28/29 instances fail P2P - 96.6%)
- **Pattern**: Parser modifications break existing parsing tests
- **Hypothesis**: Grammar/parser changes have cascading effects

### 2. F2P-Initial Failures (Baseline Tests Don't Pass)

These repositories have tests that fail even before applying patches:

#### **Project-MONAI** (159/161 instances fail F2P-initial - 98.8%)
- **Pattern**: Tests don't pass in baseline state
- **Hypothesis**: May require specific GPU/CUDA setup or have incorrect test specifications

#### **burnash** (26/26 instances fail F2P-initial - 100%)
- **Pattern**: All baseline tests fail
- **Hypothesis**: Test specifications may be incorrect or environment setup issues

#### **python-trio** (90/139 instances fail F2P-initial - 64.7%)
- **Pattern**: Mix of F2P-initial (90) and P2P (125) failures
- **Example**: `python-trio__trio.cfbbe2c1.pr_2937`: 730 F2P initial failures
- **Hypothesis**: Async testing framework may have special requirements

### 3. Mixed Failure Patterns

#### **seperman** (170 P2P + 115 F2P-initial failures out of 172 instances)
- **Pattern**: Both baseline and regression failures
- **Hypothesis**: Fundamental test environment or specification issues

#### **django-money**, **facebookresearch**, **tweepy**, **aio-libs**
- All show similar patterns of both F2P-initial and P2P failures
- Suggests both incorrect test specs AND patches that break functionality

## Root Cause Analysis

### Environmental Issues
1. **GPU/CUDA Requirements**: Project-MONAI likely needs GPU setup
2. **Async Test Runners**: python-trio, aio-libs may need special async test configurations
3. **Database/Services**: django-money might need database setup

### Test Specification Issues
1. **Incorrect Baseline**: F2P tests that fail initially suggest wrong test selections
2. **Version Mismatches**: Tests may be from different versions than the code

### Patch Quality Issues
1. **Over-broad Changes**: Patches might modify more than intended
2. **Missing Context**: Patches may not account for all usages of modified code

## Debugging Strategy

### Quick Checks
```bash
# Check a specific failing instance
uv run python analyze_results.py | grep "oauthlib__oauthlib"

# Look at error patterns for a specific repo
grep "oauthlib" unique_errors.txt

# Run a single instance manually
async with new_sandbox(image="swesmith/oauthlib__oauthlib.1fd52536", provider="daytona") as sandbox:
failed, passed = await sandbox.eval(instance["FAIL_TO_PASS"])
```

### Deep Investigation Steps
1. **Pick one instance from each failure category**
2. **Run with verbose logging to see actual test output**
3. **Compare working vs failing repos to identify patterns**
4. **Check if special test commands or setup is needed**

## Recommendations for Future Work

### High Priority
1. **oauthlib**: Investigate why ALL patches break OAuth flows
2. **Project-MONAI**: Check GPU requirements and test specifications
3. **seperman**: High volume of failures (170) makes this impactful to fix

### Medium Priority
1. **python-trio**: Mixed failures suggest complex issues
2. **cloudpipe**: Small repo but 100% failure rate is concerning

### System Improvements
1. Add pre-flight checks for GPU/special requirements
2. Validate test specifications before running
3. Add regression test validation to patch generation
4. Consider repo-specific configuration overrides

## Success Metrics
- Current: 74.6% perfect pass rate
- Goal: >85% by addressing top 5 problematic repos
- 52 repos already at 100% - system works well for standard cases
103 changes: 103 additions & 0 deletions dev/swebench/analysis/SWE_BENCH_ANALYSIS_REPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# SWE-Bench Test Results Analysis Report

**Date**: 2025-07-10
**Total Instances Analyzed**: 8,480 (unique instances, latest run per instance)

## Executive Summary

### Overall Performance
- **Perfect Pass Rate**: 74.6% (6,329/8,480)
- **Test Failures**: 24.2% (2,055 instances)
- **Errors**: 1.5% (122 instances)

A "perfect pass" means:
- No errors during sandbox creation or test execution
- All FAIL_TO_PASS tests passed initially (baseline behavior)
- All FAIL_TO_PASS tests failed after patch (patch introduces intended bug)
- All PASS_TO_PASS tests passed (no regression)

## Key Findings

### 1. Most Common Failure Modes

| Failure Type | Count | Percentage | Description |
|-------------|-------|------------|-------------|
| P2P Failures | 1,758 | 20.7% | Patches broke existing functionality |
| F2P Initial Failures | 1,171 | 13.8% | Tests failed baseline (should pass initially) |
| F2P Post Pass | 182 | 2.1% | Tests still passed after patch (should fail) |
| Execution Errors | 122 | 1.5% | Timeouts, sandbox failures, etc. |

### 2. Repositories with Fundamental Issues (0% Perfect Pass Rate)

15 repositories achieved 0% perfect pass rate, indicating systematic problems:

#### P2P-Dominant Failures (patches consistently break existing tests):
- **oauthlib**: 164/166 instances fail P2P (98.8%)
- **cloudpipe**: 37/37 instances fail P2P (100%)
- **pyparsing**: 28/29 instances fail P2P (96.6%)
- **life4**: 40/43 instances fail P2P (93.0%)
- **borntyping**: 16/17 instances fail P2P (94.1%)

#### F2P-Initial Dominant Failures (baseline tests don't pass):
- **Project-MONAI**: 159/161 instances fail F2P-initial (98.8%)
- **burnash**: 26/26 instances fail F2P-initial (100%)
- **python-trio**: 90/139 instances fail F2P-initial (64.7%)
- **Cog-Creators**: 64/91 instances fail F2P-initial (70.3%)
- **alanjds**: 18/27 instances fail F2P-initial (66.7%)

#### Mixed Failures (both baseline and regression issues):
- **seperman**: 115 F2P-initial + 170 P2P failures (172 instances)
- **django-money**: 67 F2P-initial + 67 P2P failures (68 instances)
- **facebookresearch**: 57 F2P-initial + 71 P2P failures (72 instances)
- **tweepy**: 52 F2P-initial + 52 P2P failures (52 instances)
- **aio-libs**: 8 F2P-initial + 8 P2P failures (8 instances)

### 3. Repository Performance Distribution

| Perfect Pass Rate | Repository Count | Percentage |
|------------------|------------------|------------|
| 0% | 15 | 14.3% |
| 1-25% | 5 | 4.8% |
| 26-50% | 4 | 3.8% |
| 51-75% | 3 | 2.9% |
| 76-99% | 26 | 24.8% |
| 100% | 52 | 49.5% |

### 4. Error Analysis

Most errors (1.5% of instances) were infrastructure-related:
- 504 Gateway Timeout: 76 instances
- Command execution timeout: 64 instances
- Empty command errors: 31 instances
- Sandbox creation failures: 10 instances

## Recommendations

### High Priority Investigations

1. **oauthlib & cloudpipe**: Near 100% P2P failure rate suggests patches consistently break core functionality. These need immediate review of patch generation logic.

2. **Project-MONAI & burnash**: 100% F2P-initial failure rate indicates test specifications may be incorrect or tests are not properly capturing baseline behavior.

3. **seperman**: Highest absolute number of failures (170 P2P failures) combined with F2P-initial failures suggests both test specification and patch generation issues.

### System Improvements

1. **Test Specification Review**: 13.8% F2P-initial failure rate suggests many test specifications don't correctly capture baseline behavior.

2. **Patch Quality**: 20.7% P2P failure rate indicates patches often break existing functionality. Consider adding regression checks.

3. **Infrastructure**: The 8,000 test limit resolved previous issues with test count limits.

## Success Stories

- 52 repositories (49.5%) achieved 100% perfect pass rate
- 74.6% overall perfect pass rate shows the system works well for most cases
- Very low error rate (1.5%) indicates stable infrastructure

## Technical Notes

- Analysis based on latest run per instance (aggregated by instance_id)
- Used daytona provider with up to 128 concurrent sandboxes
- Retry logic with exponential backoff for transient failures
- Test count limit increased from 3,000 to 8,000
Loading