Thanks for the work on this benchmark.
I was wondering why the baseline accuracies on code.Debug are so low.
de.Debug | 37.06% | < 5% | 17.77% | < 5% | 9.14% | 13.96% | 7.36%
Since it's multiple choice with four options, random guessing should give at least 25%.
Have you released the outputs from your evaluation runs anywhere?