Could you explain why we add inverse sigmoid for the refpoint? And, when we get the output of transformer's decoder layer, we do the same thing for the refpoint.
self.refpoint_embed.weight.data[:, :2].uniform_(0, 1)
self.refpoint_embed.weight.data[:, :2] = inverse_sigmoid(self.refpoint_embed.weight.data[:, :2])
And
tmp[..., :self.query_dim] += inverse_sigmoid(reference_points)