feat(benchmark): add evaluation support for finsearchcomp #51

JubSteven · 2025-09-25T03:31:43Z

Describe this PR

Summary

Adds comprehensive support for the FinSearchComp benchmark, enabling financial search and analysis evaluation with dynamic judge prompts and regional analysis.

Key Changes

New Files: FinSearchComp agent config, benchmark config, documentation, evaluation scripts, and progress checker
Core Updates: Enhanced pipeline with metadata supporting evaluation by LLM judges using various prompt templates.
Evaluation Features: Task type handling (T1/T2/T3), progress monitoring, and multiple run support.

Checklist for PR

Must Do

Write a good PR title and description, i.e. feat(agent): add pdf tool via mcp, perf: make llm client async and fix(utils): load custom config via importlib etc. CI job check-pr-title enforces Angular commit message format to PR title.
Run make precommit locally. CI job lint enforce ruff default format/lint rules on all new codes.
Run make pytest. Check test summary (located at report.html) and coverage report (located at htmlcov/index.html) on new codes.

Nice To Have

(Optional) Write/update tests under /tests for feat and test PR.
(Optional) Write/update docs under /docs for docs and ci PR.

- Resolved formatting conflicts in utils/extract_futurex_results.py - Resolved formatting conflicts in utils/prepare_benchmark/gen_futurex.py - Resolved formatting conflicts in utils/progress_check/check_futurex_progress.py All conflicts were due to code formatting differences (whitespace, line breaks, trailing commas). Functionality remains identical between branches.

…ress file to exclude T1.

… greater china respectively.

… china region.

JubSteven added 14 commits September 18, 2025 10:35

upd: add futurex evaluation support.

56b235d

upd: support multiple eval for futurex and add relavent doc.

287a7bc

upd: fix bugs with doc for futurex.

bf43b37

debug: fix wrong calling path.

d1e1637

add preparation for finsearchcomp.

eb6f302

update a premature version of finsearchcomp benchmark.

4dabaee

clean redundent code in merging.

c086e41

upd: modify yaml to use Mirothinker as the main agent, add check prog…

d6a8715

…ress file to exclude T1.

upd: check_progress function for finsearchcomp now consider globe and…

e7163d3

… greater china respectively.

Merge remote-tracking branch 'upstream/miroflow-v0.3' into explorations

b0e494f

upd: add docs and shell script for multiple runs.

256ba2c

fix: check_finsearchcomp_progress not displaying results from greater…

835e590

… china region.

Merge remote-tracking branch 'upstream/miroflow-v0.3' into explorations

5ffc269

JubSteven changed the title ~~Explorations~~ feat(finsearchcomp): add evaluation support for finsearchcomp Sep 25, 2025

JubSteven changed the title ~~feat(finsearchcomp): add evaluation support for finsearchcomp~~ feat(benchmark): add evaluation support for finsearchcomp Sep 25, 2025

BinWang28 approved these changes Sep 25, 2025

View reviewed changes

BinWang28 merged commit e276581 into MiroMindAI:miroflow-v0.3 Sep 25, 2025
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): add evaluation support for finsearchcomp #51

feat(benchmark): add evaluation support for finsearchcomp #51

Uh oh!

JubSteven commented Sep 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(benchmark): add evaluation support for finsearchcomp #51

feat(benchmark): add evaluation support for finsearchcomp #51

Uh oh!

Conversation

JubSteven commented Sep 25, 2025

Describe this PR

Summary

Key Changes

Checklist for PR

Must Do

Nice To Have

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants