Some problems in evaluation configuration (missing criterion/dataframe extraction failure/markdown JSON parsing issues).

Dear Author,

First, thank you very much for your outstanding research. 
 I’ve been using the benchmark and found several structural issues in the evaluation configuration and implementation that significantly affect scoring correctness, reproducibility, and stability. I’d like to document two major problems in the hope of improvement or patching. Following are my confusions:
1. Missing criterion for number_near metrics in deep2wide instances:
In deep2wide instances, the config uses:`"metric": ["number_near"]`, but no criterion field is provided. 
For example, in "instance_id": "deep2wide_result_2_成都", the metric of column \"报考年龄\" is {\"preprocess\": [\"extract_number\"], \"metric\": [\"number_near\"]}. 
As number_near metric is implemented as:`if abs((response_num - target_num)) <= abs(target_num) * criterion:` with criterion = None, the evaluator throws:`TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'`
Also, many of the “number” fields include non-pure numeric content (e.g., “18周岁以上、30周岁及以下”, “1:2 比例进入面试”, date ranges) that cannot be handled by  "number_near" from my perspective.

2. `response_df = response.extract_dataframe()` bug due to variable override:
In evaluation.py, I found in line 127:
`response = llm_completion(
            messages=[{'role': 'user', 'content': entity_acc_check_prompt}],
            model_config_name=eval_model_config_name
        )`
and then in line 155:
`response_df = response.extract_dataframe()`
Here `response` is the LLM judge result (e.g., “Yes”/“No”), not the actual model prediction object containing the markdown table. As a result response.extract_dataframe() fails with:
`AttributeError: 'APIResponse' object has no attribute 'extract_dataframe'`

If possible, could you please clarify or improve these evaluation details — especially the missing criterion for the number_near metric in the deep2wide instances? The absence of a criterion makes the metric undefined and prevents reproducible evaluation.

Any guidance or updates would be greatly appreciated. Thank you again for your excellent work and for open-sourcing this benchmark.

Best wishes,
Mingju Chen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some problems in evaluation configuration (missing criterion/dataframe extraction failure/markdown JSON parsing issues). #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some problems in evaluation configuration (missing criterion/dataframe extraction failure/markdown JSON parsing issues). #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions