Skip to content

Meaning of fold in the results parquet file #209

@puhsu

Description

@puhsu

Hi! I was trying to match some of my run results with the ones reported in tabarena and had a question about folds and repeats reporting in the provided results dataframe. I am using the following dataframe:

df = pd.read_parquet("https://tabarena.s3.us-west-2.amazonaws.com/results/df_results_leaderboard.parquet")

It has the following columns:

# df.columns -> returns
Index(['dataset', 'fold', 'method', 'metric_error', 'time_train_s',
       'time_infer_s', 'metric_error_val', 'config_selected', 'seed',
       'method_metadata', 'ensemble_weight', 'problem_type', 'metric',
       'method_type', 'method_subtype', 'config_type'],
      dtype='object')

There are up to 30 "folds" – I get that the paper used 3-fold outer validation for scoring with up to 10 repeats. Can you please help me understand how do dataset fold indexes and repeat indexes are converted to these 30 folds.

Basically, which of the two options is correct (do we first iterate over fold and then over repeat or is it the other way around):

# just an example
df_fold = 13

# option 1:
openml_fold = df_fold // 10
openml_repeat = df_fold % 10

# or option 2:
openml_fold = df_fold % 3
openml_repeat = df_fold // 3

To help you better understand where my question is coming from. I wanted to quickly check some results on a few different small dataset splits (e.g. fold=0, repeat in range(0,5) -- in openml terms when loading the data). And now I want to match results that I have to the dataframe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions