Skip to content

Some problems in evaluation configuration (missing criterion/dataframe extraction failure/markdown JSON parsing issues). #2

@mingju-c

Description

@mingju-c

Dear Author,

First, thank you very much for your outstanding research.
I’ve been using the benchmark and found several structural issues in the evaluation configuration and implementation that significantly affect scoring correctness, reproducibility, and stability. I’d like to document two major problems in the hope of improvement or patching. Following are my confusions:

  1. Missing criterion for number_near metrics in deep2wide instances:
    In deep2wide instances, the config uses:"metric": ["number_near"], but no criterion field is provided.
    For example, in "instance_id": "deep2wide_result_2_成都", the metric of column "报考年龄" is {"preprocess": ["extract_number"], "metric": ["number_near"]}.
    As number_near metric is implemented as:if abs((response_num - target_num)) <= abs(target_num) * criterion: with criterion = None, the evaluator throws:TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'
    Also, many of the “number” fields include non-pure numeric content (e.g., “18周岁以上、30周岁及以下”, “1:2 比例进入面试”, date ranges) that cannot be handled by "number_near" from my perspective.

  2. response_df = response.extract_dataframe() bug due to variable override:
    In evaluation.py, I found in line 127:
    response = llm_completion( messages=[{'role': 'user', 'content': entity_acc_check_prompt}], model_config_name=eval_model_config_name )
    and then in line 155:
    response_df = response.extract_dataframe()
    Here response is the LLM judge result (e.g., “Yes”/“No”), not the actual model prediction object containing the markdown table. As a result response.extract_dataframe() fails with:
    AttributeError: 'APIResponse' object has no attribute 'extract_dataframe'

If possible, could you please clarify or improve these evaluation details — especially the missing criterion for the number_near metric in the deep2wide instances? The absence of a criterion makes the metric undefined and prevents reproducible evaluation.

Any guidance or updates would be greatly appreciated. Thank you again for your excellent work and for open-sourcing this benchmark.

Best wishes,
Mingju Chen

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions