Coordinates were not meant to be discrete, you risk class imbalance and no intrinsic knowledge of number line or a loss that caters to that, just use a linear head to output 4 coordinates with an MSE loss. Fourier embed at input side. Here’s a paper we did on that
This #CVPR2026 paper from our research team is trending #1 on @HuggingFace 🤗
Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to act.
Trained on 138M high-quality samples, LocateAnything decodes bounding boxes in parallel instead of one coordinate at a time, improving localization accuracy while dramatically increasing throughput for visual grounding and detection.
Project page: https://nvda.ws/4dKSohb
