current approach - epochs :12
Baseline MLP also has early rise but suffers at the final stages around and beyond 90%.
The network learns a vector (embedding) for each possible integer value of a and b. It then concatenates those two vectors and uses a small MLP to map them to logits across mod_value classes .
embedding_dim = 16
Adamw: 5e-4 , w_d =1e-4
MLP head: sequence of Linear + ReLU (+ optional Dropout) layers, final Linear with output dimension mod_value (the logits).
Is this a record?