 The 'grad = grad @ layers[i][0].T' line should have been before ' layers[i][0] -= w_grad * lr layers[i][1] -= b_grad * lr' The weight is being upgraded before pulling the gradient through the layer.