Suprisingly low perf on code.Debug

Thanks for the work on this benchmark. 

I was wondering why the baseline accuracies on code.Debug are so low. 
```
de.Debug | 37.06% | < 5% | 17.77% | < 5% | 9.14% | 13.96% | 7.36%
```

Since it's multiple choice with four options, random guessing should give at least 25%.
Have you released the outputs from your evaluation runs anywhere?