Skip to content

Project Contribution

Yiwei Mao edited this page Jul 15, 2025 · 21 revisions

Check Installation to install Web-Bench from source

1. Create new project

  1. Use rush new-project to create new project
rush new-project
  1. Run
rush update
cd projects/new-project-name
npm t # should be passed
  1. More project configuration see Project Hierarchy.

2. Design Project

  1. Plan the initial and final states of the project, as well as several key intermediate states.
  2. Sometimes it’s difficult to design everything in one step — you can first try 1 task, 3 tasks, or 5 tasks, and then go back to the previous step.
  3. Use readme.md as the Project Design Document. It is recommended to include: technical background, project background, Feature Coverage and Reference. Refer to the DOM readme for an example.

For projects focused on business, we recommend referencing: calculator, svg-chart.

Video: prepare - write a task - write tests - write codes.

new-project-3x.mp4

Directions that may be more challenging for LLMs:

  1. Multi-file projects tend to be more challenging than single-file ones — we’ve seen this in multiple existing projects.
  2. Single tasks with more detailed and feature-rich requirements pose greater difficulty.
  3. Logical dependencies between multiple tasks increase complexity.
  4. Tasks involving numerical computation can be particularly challenging.

Design Tasks

benchmark-construction
  1. Feature Coverage
    1. Standard or Framework: Aim for a high level of core capability coverage. Refer to the DOM Feature Coverage as an example.
    2. Business Scenarios: Prioritize frequently used scenarios, while including a proportion of advanced features.
  2. The 20 tasks should be arranged from easy to difficult. The difficulty distribution is for reference only (not strict):
    1. Task 1-5, Easy. Use rush eval to validate with the following models. see doc.
      "models": [
        "claude-3-7-sonnet-latest",
        "openai/gpt-4o",
        "deepseek/deepseek-chat"
      ]
    2. Tasks 6–10, Moderate. You can use the above models to verify feasibility; both success and failure are acceptable.
    3. Tasks 11–20, Challenging. These should ideally be difficult even for the current top-performing LLMs (i.e., not guaranteed to succeed every time).
  3. After completing the tasks, use an LLM to validate:
    • English grammar
    • Sentence clarity and conciseness
    • Logical consistency or potential issues

Task Data Structure

{"id":"init","date":"2024-11-15","level":"easy","description":"Generate html..."}
{"id":"task-1","date":"2024-11-15","level":"moderate","description":"Do something..."}
{"id":"task-20","date":"2025-1-5","level":"challenging","description":"Do something..."}
  • id, init | task-n
  • date, filter contaminated model
  • level, easy | moderate | challenging
  • description, task details description

Support JSONL and YAML format.

Design Tests

  1. 3-5 test cases for each task
  2. Tests must be passed using the latest codes.

Design Codes

  1. src/ stores latest codes for all tasks

Reference Guide

This section is about extracting the underlying knowledge stack from those references to ensure that the designed project effectively covers core capabilities. Refer to Paper "Section 2.2 Standard and Framework" for guidance.

Standards

  1. Specifications: From official sources such as W3C, WHATWG, ECMA, etc.
  2. MDN: Reference documentation, demos, tutorials
  3. Well-known third-party tutorials: Independent sites (e.g., CSS-Tricks) or GitHub repositories

Frameworks

  1. Official tutorials of the framework
  2. Officially recommended use cases
  3. Well-known third-party demos

Business Scenarios

  1. Representative products for the given scenario
  2. Core technology stack or frameworks used

3. Calibrate Project

Use rush eval to calibrate a project. That is, check for design issues in the project:

  1. tasks.jsonl issues, such as ambiguous descriptions, vulnerabilities, or obvious conflicts with test cases.
  2. test/ cases issues, such as not rigorous enough or too strict.

The current path is /path/to/projects/selector, and run:

  1. rush eval with 'config.json5'.
    {  
      projects: ['@web-bench/selector'],
      models: ['claude-3-5-sonnet-20241022'],
    }
  2. IF: task-20 passed, end
  3. ELSE: use eval project files to replace src/:
    mv src src-20 # your codes
    cp -r eval/eval-2025xx/claude-3-5-xx/task-5-2 src # LLM's codes
  4. IF: Modify src/ or test/ codes
    1. IF: All tests passed in src/ and src-20/
    2. THEN: rush eval with new config.json5
        {
          projects: ['@web-bench/selector'],
          models: ['claude-3-5-sonnet-20241022'],
          startTask: 'task-6', // continue from task-6
        }
    3. goto Step-2
  5. IF: Modify tasks.jsonl, and task-n is the smallest index of modified tasks
    1. THEN: use successful task-{n-1} eval project files to replace src/:
      rm -rf src
      cp -r eval/eval-2024xx/claude-3-5-xx/task-{n-1} src
    2. THEN:rush eval with new config.json5
      {
        packageNames: ['@web-bench/selector'],
        models: ['claude-3-5-sonnet-20241022'],
        startTask: 'task-n', // continue from task-n
      }
    3. goto Step-2
  6. Create a doc from template

Template

Calibrate with 'claude-3-5-sonnet-20241022':

calibrate-template

4. Pull Request

When submitting a Pull Request to dev/vx.0.0 branch, you must include a calibration report from the previous step.

Clone this wiki locally