Reasoning in visual content remains an open problem - we are sharing our first study on formalizing visual reasoning. In this blog we walk through reasoning capabilities (backtracking and spatial grounding) in a case study of visual games (Set!).
Perceptron AI releases research formalizing visual reasoning, using the card game Set to show how spatial grounding improves reinforcement learning
Initial reasoning strategies were found to shape reinforcement learning outcomes.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Starting to share deeper portions of our research. The reasoning strategy a base model starts with determines where RL ends up; for vision, grounding wins.
Reasoning in visual content remains an open problem - we are sharing our first study on formalizing visual reasoning. In this blog we walk through reasoning capabilities (backtracking and spatial grounding) in a case study of visual games (Set!).
We taught a model to play Set! to explore which reasoning strategies make a model successful during RL. A base model's initial reasoning strategies determine the final outcome of RL and for visual problem solving, visually grounded reasoning is superior🧵
@ArmenAgha One of my fav games :)
Starting to share deeper portions of our research. The reasoning strategy a base model starts with determines where RL ends up; for vision, grounding wins.

Backtracking/verification are completely essential for effective reasoning even under the SFT condition alone - what we also find is that grounded reasoning trains more stably for longer

Building off work by @gandhikanishk @noahdgoodman @achakravarthy01 et al, we expand into the visual domain by constructing reasoning chains to elicit desired reasoning capabilities: backtracking vs no backtracking, grounded (via bounding boxes) vs not grounded

Read more here: https://www.perceptron.inc/blog/teaching-vlms-to-think-visually

read the blogpost here: https://www.perceptron.inc/blog/teaching-vlms-to-think-visually

We find that without grounding the most common failure on OOD board configurations is hallucinating cards that don’t exist. We also see that grounding shifts attention mass in the CoT and answer towards the image