Project Contribution

Check Installation to install Web-Bench from source

1. Create new project

Use rush new-project to create new project

rush new-project

Run

rush update
cd projects/new-project-name
npm t # should be passed

More project configuration see Project Hierarchy.

2. Design Project

Plan the initial and final states of the project, as well as several key intermediate states.
Sometimes it’s difficult to design everything in one step — you can first try 1 task, 3 tasks, or 5 tasks, and then go back to the previous step.
Use readme.md as the Project Design Document. It is recommended to include: technical background, project background, Feature Coverage and Reference. Refer to the DOM readme for an example.

For projects focused on business, we recommend referencing: calculator, svg-chart.

Video: prepare - write a task - write tests - write codes.

new-project-3x.mp4

Directions that may be more challenging for LLMs:

Multi-file projects tend to be more challenging than single-file ones — we’ve seen this in multiple existing projects.
Single tasks with more detailed and feature-rich requirements pose greater difficulty.
Logical dependencies between multiple tasks increase complexity.
Tasks involving numerical computation can be particularly challenging.

Design Tasks

Feature Coverage
1. Standard or Framework: Aim for a high level of core capability coverage. Refer to the DOM Feature Coverage as an example.
2. Business Scenarios: Prioritize frequently used scenarios, while including a proportion of advanced features.
The 20 tasks should be arranged from easy to difficult. The difficulty distribution is for reference only (not strict):
1. Task 1-5, Easy. Use rush eval to validate with the following models. see doc.
```
"models": [
  "claude-3-7-sonnet-latest",
  "openai/gpt-4o",
  "deepseek/deepseek-chat"
]
```
2. Tasks 6–10, Moderate. You can use the above models to verify feasibility; both success and failure are acceptable.
3. Tasks 11–20, Challenging. These should ideally be difficult even for the current top-performing LLMs (i.e., not guaranteed to succeed every time).
After completing the tasks, use an LLM to validate:
- English grammar
- Sentence clarity and conciseness
- Logical consistency or potential issues

Task Data Structure

{"id":"init","date":"2024-11-15","level":"easy","description":"Generate html..."}
{"id":"task-1","date":"2024-11-15","level":"moderate","description":"Do something..."}
{"id":"task-20","date":"2025-1-5","level":"challenging","description":"Do something..."}

id, init | task-n
date, filter contaminated model
level, easy | moderate | challenging
description, task details description

Support JSONL and YAML format.

Design Tests

3-5 test cases for each task
Tests must be passed using the latest codes.

Design Codes

src/ stores latest codes for all tasks

Reference Guide

This section is about extracting the underlying knowledge stack from those references to ensure that the designed project effectively covers core capabilities. Refer to Paper "Section 2.2 Standard and Framework" for guidance.

Standards

Specifications: From official sources such as W3C, WHATWG, ECMA, etc.
MDN: Reference documentation, demos, tutorials
Well-known third-party tutorials: Independent sites (e.g., CSS-Tricks) or GitHub repositories

Frameworks

Official tutorials of the framework
Officially recommended use cases
Well-known third-party demos

Business Scenarios

Representative products for the given scenario
Core technology stack or frameworks used

3. Calibrate Project

Use rush eval to calibrate a project. That is, check for design issues in the project:

tasks.jsonl issues, such as ambiguous descriptions, vulnerabilities, or obvious conflicts with test cases.
test/ cases issues, such as not rigorous enough or too strict.

The current path is /path/to/projects/selector, and run:

rush eval with 'config.json5'.

{  
  projects: ['@web-bench/selector'],
  models: ['claude-3-5-sonnet-20241022'],
}

IF: task-20 passed, end

ELSE: use eval project files to replace src/:

mv src src-20 # your codes
cp -r eval/eval-2025xx/claude-3-5-xx/task-5-2 src # LLM's codes

IF: Modify src/ or test/ codes

IF: All tests passed in src/ and src-20/

THEN: rush eval with new config.json5

  {
    projects: ['@web-bench/selector'],
    models: ['claude-3-5-sonnet-20241022'],
    startTask: 'task-6', // continue from task-6
  }

goto Step-2

IF: Modify tasks.jsonl, and task-n is the smallest index of modified tasks

THEN: use successful task-{n-1} eval project files to replace src/:
```
rm -rf src
cp -r eval/eval-2024xx/claude-3-5-xx/task-{n-1} src
```

THEN:rush eval with new config.json5

{
  packageNames: ['@web-bench/selector'],
  models: ['claude-3-5-sonnet-20241022'],
  startTask: 'task-n', // continue from task-n
}

goto Step-2

Create a doc from template

Template

Calibrate with 'claude-3-5-sonnet-20241022':

4. Pull Request

When submitting a Pull Request to dev/vx.0.0 branch, you must include a calibration report from the previous step.

Evaluation | arXiv Paper | Leaderboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Project Contribution

1. Create new project

2. Design Project

Design Tasks

Task Data Structure

Design Tests

Design Codes

Reference Guide

3. Calibrate Project

Template

4. Pull Request

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally