-
Notifications
You must be signed in to change notification settings - Fork 30
[WIP][New Model] LimiX #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I added batching the test predictions, which allowed me to run it on a few more datasets. Even with this batching, I am not able to run LimiX on an H200 with the current setup for the OpenML task IDs 363628 and 363673, as they have too many samples: 363628 has 90k training samples, 21 features; 363673 has 100k training samples, 10 features. At the same time, the datasets that I can now run with batching on an H200 (roughly up to a size of 70k training samples) also take very long to predict (multiple hours, even longer due to batching the test predictions). I will postpone further investigation for larger datasets until an update from the authors and otherwise stick to the TabPFN-subset limits, as these seem roughly in scope in terms of efficiency. |
|
The authors of LimiX published a new, smaller model and updated their code. Moreover, they benchmark on TabArena datasets in their paper. Yet, the results in the paper are problematic (holdout validation, weak baselines, ...); bascially, most of the errors TabArena wants to avoid. Moreover, the regression CD plot looks weird as well, likely because they don't have enough datasets in their subset. Hence, I think it would be good to get a new snapshot of their method. I contacted the authors to see if they want to help with this. Besides, I need to first rebase/merge this PR with all our recent main-line changes. But the model code should stay similar. |
|
Regarding the points you raised, we believe there are some mismatches here: (1) Holdout Validation: Our model evaluation strictly follows standard practices in the machine learning community. As described in our technical report, for public benchmarks (e.g., OpenML datasets), we use the official train/test splits provided by the benchmark. For models that require hyperparameter tuning, we perform cross-validation only on the training set to select the optimal hyperparameters. Final evaluation is always conducted on a held-out test set with no leakage. We consider this a fair and widely accepted protocol. For transparency and reproducibility, the complete list of datasets is publicly available on GitHub (https://github.com/limix-ldm/LimiX/tree/main/benchmark_list). That said, we acknowledge that alternative, equally valid methodologies exist in academic research, and we welcome diverse perspectives. We do not view differing approaches as inherently “problematic.” (2) Baseline Models: In the interest of fair comparison, we follow the implementation and hyperparameter settings provided in TALENT (https://github.com/LAMDA-Tabular/TALENT) for all NN-based models. Tree-based models undergo hyperparameter optimization via Optuna, and AutoGluon employs its built-in hyperparameter search. Importantly, LimiX is evaluated without any hyperparameter tuning. The reported result of LimiX use a single, generic set of hyperparameters that are neither dataset-specific nor benchmark-specific. Thus, we consider these reasonable baselines for comparison. (3) TabArena Regression Subset: Our inclusion criteria for the TabArena regression subset follow the standards commonly adopted in other benchmarks: ≤ 50,000 training samples and ≤ 10,000 features. Based on these criteria, all 13 regression datasets referenced in the TabArena paper are included in our benchmark (see the GitHub link above).
Accordingly, the differences observed in the CD diagram likely arise from the fact that some prior works (e.g., PFN) evaluated only a smaller subset of the data (e.g., 7 out of the 13 datasets), whereas our analysis covers the full set of 13 regression datasets. It is also worth noting that our implementation of the CD plot follows the same strategy used in https://github.com/hfawaz/cd-diagram and TALENT (see the GitHub link above). In these implementations, the average-rank comparison is replaced by a Wilcoxon signed-rank test (Wilcoxon, 1945) with Holm’s alpha correction (Holm, 1979; Garcia and Herrera, 2008). We greatly appreciate the dialogue and remain open to further discussion and collaboration. |
|
Heyho @limix-ldm, thank you for the reply! I think we slightly disagree on some points, so I wanted to share some of my thoughts on these points below: (1) Inner holdout validation is very well known to bias results. This was a big mistake of many prior benchmarks, which TabArena avoids. Outer holdout validation can be appropriate in some cases, such as TabArena-Lite, but should still be avoided when reporting final results. I'm happy to talk more about this later when needed, but for now, I recommend checking out our NeurIPS paper for the latest state of the literature on this topic. (2) The problem of TALENT is that the baselines are weak and, in some cases, not implemented in pipelines that obtain peak performance (also check out our paper for a lot of details on this topic). I agree that they are reasonable baselines, but they are not enough to claim state-of-the-art when beating them. In TabArena, we ensure all baselines are as strong as possible. And thus, it is much harder to claim state-of-the-art, but when one is state-of-the-art, it is very representative. (3) I have no problems with the used datasets in this case. Note that the CD plot would look even weirder with less datasets, as in the TabPFN-subset -- but it does not for TabArena (checkout the generated results, as we also compute CD plots for all subsets). Small note: you are not using a CD plot, if you use the implementation from hfawaz/cd-diagram. A CD plot requires a critical difference by definition, which is not given by the Wilcoxon-Holm approach (see https://www.jmlr.org/papers/volume7/demsar06a/demsar06a.pdf). I recommend using the CD plot from Autorank (https://github.com/sherbold/autorank). The interpretation of your results depends on what you claim based on these points. There is nothing inherently wrong with them, but they might not be entirely accurate, as we know there are better ways to do benchmarking. Sadly, the tabular benchmarking community has been ignoring the better ways for too long. With TabArena, we aim to change this and hence have to call out such behavior as well! As a closing thought: note that in your work, you have compared on the datasets of TabArena, not the TabArena benchmark. To guarantee a fair comparison, it is essential to use the same pipeline as all baselines, as we require and recommend with TabArena. Just comparing on the datasets from TabArena with your own pipeline and with a different evaluation protocol removes all the improvements to benchmarking we introduced with TabArena, making the results less robust. |
|
Thanks for the reply. This is a valuable starting point for highlighting the limitations of current benchmarks. We will study it and consider incorporating the TabArena pipeline in future work. |



This code adds LimiX (https://arxiv.org/pdf/2509.03505) from https://github.com/limix-ldm/LimiX
Notes
.squeeze()will crash if one runs on a dataset with just one feature (in one of the preprocessing configs). I removed the.squeeze()and am not sure why it was here in the first place, as a randomly appearing dimension sounds more like a bug.Performance on the TabPFN-subset of TabArena-Full
I have tried running the method on more datasets as well, and it worked. However, for the larger dataset (e.g. 50k samples, 130 features) in TabArena, it runs out of VRAM (given 40 GB VRAM). So for now, I will stick to this subset.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.