NVIDIA releases LocateAnything-3B, a vision-language model that predicts coordinates in parallel to speed up agent localization

VIEWS16KBOOKMARKS39LIKES63RETWEETS5REPLIES8

Very interesting work! It should motivate much room for research in fast VLMs/VLAs :)

This #CVPR2026 paper from our research team is trending #1 on @HuggingFace 🤗

Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to act.

Trained on 138M high-quality samples, LocateAnything decodes bounding boxes in parallel instead of one coordinate at a time, improving localization accuracy while dramatically increasing throughput for visual grounding and detection.

Project page: https://nvda.ws/4dKSohb

19d16K6339

Zhiding Yu@ZhidingYu

Thank you NVIDIA!

I will be presenting LocateAnything at #CVPR2026 at the NVIDIA Booth: June 5 4:20 - 4:40 pm MDT (Friday) June 6 2:00 - 2:20 pm MDT (Saturday)

Welcome to join! See you at CVPR!

Project Page: https://research.nvidia.com/labs/lpr/locate-anything/ Tech Report: https://huggingface.co/papers/2605.27365 Model: https://huggingface.co/nvidia/LocateAnything-3B Code: https://github.com/NVlabs/Eagle

NVIDIA AI@NVIDIAAI

This #CVPR2026 paper from our research team is trending #1 on @HuggingFace 🤗

Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to act.

Trained on 138M high-quality samples, LocateAnything decodes bounding boxes in parallel instead of one coordinate at a time, improving localization accuracy while dramatically increasing throughput for visual grounding and detection.

Project page: https://nvda.ws/4dKSohb

19d6.7K3715

pratham bhatnagar@prathamqq

@NVIDIAAI @huggingface Built a solution using same tech to help a client analyze exactly how much a shelf is earning in just 30 seconds.

It saves their team hours of manual audits and instantly provides the data needed to optimize product positioning.

19d667188

NVIDIA AI@NVIDIAAI

@huggingface Check out all of our @CVPR papers, sessions, and events https://nvda.ws/4dD3nZY

19d3.5K156

Ethan@torchcompiled

Not only are you only outputting a single token and avoiding the sampling the marginal problem they have. But it only also adds a single token to your sequence length. The damning figure in the paper is this, naive sampling does worse than baseline

18d47161

Miles AI Wizard@MilesDigitek

@ZhidingYu @NVIDIAAI @huggingface 2.5x throughput with better high-IoU accuracy. Hybrid mode (fast PBD with NTP fallback) is the key architectural choice. Curious how block length L=6 holds up in really dense scenes.

19d911

Zizheng Pan@zizhpan

Amazing speed and quality.

NVIDIA AI@NVIDIAAI

This #CVPR2026 paper from our research team is trending #1 on @HuggingFace 🤗

Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to act.

Trained on 138M high-quality samples, LocateAnything decodes bounding boxes in parallel instead of one coordinate at a time, improving localization accuracy while dramatically increasing throughput for visual grounding and detection.

Project page: https://nvda.ws/4dKSohb

19d2.8K60

Abhishek Satarkar@goofyslaveowner

@NVIDIAAI @huggingface how's the latency on LocateAnything compared to existing detectors like YOLO

19d3972

mario@veloxcity

@NVIDIAAI @huggingface we need to name these 3 cute cats

19d1573

Amine@AmineTX

@NVIDIAAI @huggingface I’ve been thinking about vibe coding an app that lets you take a photo of your pantry, then automatically creates a shopping list or orders what you’re missing.

19d1302

TaraT@tarat_211

@NVIDIAAI @huggingface i remember how slow and weird this problem used to be back in 2020. we've come so far ahead.

19d692

Exo Research@Exo_Researcv5

@NVIDIAAI @huggingface It hits the exact pressure point AI agents and robots have been struggling with, fast, precise localization.

Bounding boxes decoded in parallel instead of one coordinate at a time is a quiet revolution.

It means agents can act at the speed the world actually moves.

19d608

SomacoSF@somaco_sf

@NVIDIAAI @huggingface So how can we get this bundled onto a new Jetson Orin Nano Super Developer Kit? And we need to find the best input eyes - it would be great to setup Old Cell phones to be able to just join your little detection group for video... not a swarm... oh I know A FLOCK

Run on orin

19d399

SomacoSF@somaco_sf

@NVIDIAAI @huggingface And have a FLOCK of your old cameras/whatever devices you have join and have the model running on ORIN take input from all your streams and classify everything your orbit saw in a day.

19d320

Mike Gannotti@MichaelGannotti

@NVIDIAAI @huggingface Amazing work with the vision model

19d257

Jacky@fujinumagic

@NVIDIAAI SWEET....?

19d791

Okino Chills@ChillsOkin6790

@NVIDIAAI @huggingface Why are these boxes popping one at a time? How many classes what accuracy? Nobody wants your shit models people can train themselves

19d204

Aryan@arynnsgh

@NVIDIAAI @huggingface AR decoding for bbox coords was always an odd fit, the four numbers aren't sequentially dependent in any real way. curious what the throughput gap looks like at matched mAP vs an AR baseline.

19d180

Synthella@synthellanexus

@NVIDIAAI @huggingface You're not just building faster chips – you're systematically constructing the entire vertical stack for the agentic era. From silicon to optimized inference and tooling. The depth of integration is becoming a serious competitive advantage.

19d171

Jang Hyun (Vincent) Cho@vincent_jh_cho

@NVIDIAAI @huggingface I love the paper but your model’s saying the sushi as “SWEET” 😅😅😅 (same goes to “candle”)

19d145