-
Notifications
You must be signed in to change notification settings - Fork 19
Project Contribution
Check Installation to install Web-Bench from source
- Use
rush new-projectto create new project
rush new-project- Run
rush update
cd projects/new-project-name
npm t # should be passed- More project configuration see Project Hierarchy.
- Plan the initial and final states of the project, as well as several key intermediate states.
- Sometimes it’s difficult to design everything in one step — you can first try 1 task, 3 tasks, or 5 tasks, and then go back to the previous step.
- Use
readme.mdas the Project Design Document. It is recommended to include: technical background, project background, Feature Coverage and Reference. Refer to the DOM readme for an example.
For projects focused on business, we recommend referencing: calculator, svg-chart.
Video: prepare - write a task - write tests - write codes.
new-project-3x.mp4
Directions that may be more challenging for LLMs:
- Multi-file projects tend to be more challenging than single-file ones — we’ve seen this in multiple existing projects.
- Single tasks with more detailed and feature-rich requirements pose greater difficulty.
- Logical dependencies between multiple tasks increase complexity.
- Tasks involving numerical computation can be particularly challenging.
- Feature Coverage
- Standard or Framework: Aim for a high level of core capability coverage. Refer to the DOM Feature Coverage as an example.
- Business Scenarios: Prioritize frequently used scenarios, while including a proportion of advanced features.
- The 20 tasks should be arranged from easy to difficult. The difficulty distribution is for reference only (not strict):
- Task 1-5, Easy. Use
rush evalto validate with the following models. see doc."models": [ "claude-3-7-sonnet-latest", "openai/gpt-4o", "deepseek/deepseek-chat" ]
- Tasks 6–10, Moderate. You can use the above models to verify feasibility; both success and failure are acceptable.
- Tasks 11–20, Challenging. These should ideally be difficult even for the current top-performing LLMs (i.e., not guaranteed to succeed every time).
- Task 1-5, Easy. Use
- After completing the tasks, use an LLM to validate:
- English grammar
- Sentence clarity and conciseness
- Logical consistency or potential issues
{"id":"init","date":"2024-11-15","level":"easy","description":"Generate html..."}
{"id":"task-1","date":"2024-11-15","level":"moderate","description":"Do something..."}
{"id":"task-20","date":"2025-1-5","level":"challenging","description":"Do something..."}- id, init | task-n
- date, filter contaminated model
- level, easy | moderate | challenging
- description, task details description
Support JSONL and YAML format.
- 3-5 test cases for each task
- Tests must be passed using the latest codes.
-
src/stores latest codes for all tasks
This section is about extracting the underlying knowledge stack from those references to ensure that the designed project effectively covers core capabilities. Refer to Paper "Section 2.2 Standard and Framework" for guidance.
Standards
- Specifications: From official sources such as W3C, WHATWG, ECMA, etc.
- MDN: Reference documentation, demos, tutorials
- Well-known third-party tutorials: Independent sites (e.g., CSS-Tricks) or GitHub repositories
Frameworks
- Official tutorials of the framework
- Officially recommended use cases
- Well-known third-party demos
Business Scenarios
- Representative products for the given scenario
- Core technology stack or frameworks used
Use rush eval to calibrate a project. That is, check for design issues in the project:
-
tasks.jsonlissues, such as ambiguous descriptions, vulnerabilities, or obvious conflicts with test cases. -
test/cases issues, such as not rigorous enough or too strict.
The current path is /path/to/projects/selector, and run:
-
rush evalwith 'config.json5'.{ projects: ['@web-bench/selector'], models: ['claude-3-5-sonnet-20241022'], }
- IF: task-20 passed, end
- ELSE: use eval project files to replace
src/:mv src src-20 # your codes cp -r eval/eval-2025xx/claude-3-5-xx/task-5-2 src # LLM's codes
- IF: Modify
src/ortest/codes- IF: All tests passed in
src/andsrc-20/ - THEN:
rush evalwith newconfig.json5{ projects: ['@web-bench/selector'], models: ['claude-3-5-sonnet-20241022'], startTask: 'task-6', // continue from task-6 }
- goto
Step-2
- IF: All tests passed in
- IF: Modify
tasks.jsonl, and task-n is the smallest index of modified tasks- THEN: use successful task-{n-1} eval project files to replace
src/:rm -rf src cp -r eval/eval-2024xx/claude-3-5-xx/task-{n-1} src - THEN:
rush evalwith newconfig.json5{ packageNames: ['@web-bench/selector'], models: ['claude-3-5-sonnet-20241022'], startTask: 'task-n', // continue from task-n }
- goto
Step-2
- THEN: use successful task-{n-1} eval project files to replace
- Create a doc from template
Calibrate with 'claude-3-5-sonnet-20241022':
When submitting a Pull Request to dev/vx.0.0 branch, you must include a calibration report from the previous step.