Last keynote: "Scaling Laws vs. Neural Laws: Toward More Natural Artificial Vision" kicking off now with @tserre
Last keynote: "Scaling Laws vs. Neural Laws: Toward More Natural Artificial Vision" kicking off now with @tserre
Some users express enthusiasm for throwbacks to historical computer vision papers on hierarchical vision models in Thomas Serre's CVPR keynote on scaling laws versus neural laws.

starting with the tl;dr: as we in computer vision are both benchmaxxing and (successfully) improving real performance, we are making our vision systems "more artificial" (less human like).
this talk will advocate ways to make our artificial vision systems less artificial

analysis been repeated in 2026 using many models from timm (h/t @wightmanr).
neural alignment is computed between primates and the models.
we see that CNNs (red dots) and especially w/ adversarial training (yellow dots) show positive correlation between perf and alignment

so if these recurrent models are so good, why haven't we heard of them?
well, training and scaling RNNs is (comparatively) hard.
at first, doing some network surgery and adding recurrent layers into existing ViT models and fine tuning a bit

start with a demo: rapidly flashing images with either animal or non-animal images very quickly and showing that we can all identify them

visual importance data also collected from humans by crowd sourcing data from volunteers.
we see (via gradient based saliencey) that modern models attend quite differently than humans (more diffuse, less interpretable).
human <-> model results mirror primate results closely

Quick demo to assess the gist capabilities of recognizing animal vs non-animal from a few ms of image display. Who remembers the GIST descriptors from the 2000s? #cvpr2026

Enter State Space Models (gated delta net) which allow parallelizing over the sequence and preserving recurrence.
SSM / GDN yields a new pareto frontier for these models

Loving this throwbacks of papers from the past of computer vision: hierarchical vision models #cvpr2026

one might think this only applies to CNNs (because architecturally it is true that CNNs grow receptive field with depth), but even models with global attention at each layer (ViT) empirically fail at the pathfinder task

interestingly - it became clear that abandoning biological inspiration not only improved absolute performance, but also improve correlation with observed activations in primate visual systems (at least at first).
this eval was performed in 2018

of course AlexNet changed everything. Not only in CV, but also in neuro-bio for vision

some history: pre-AlexNet there were some biologically inspired hierarchical vision models

learning positive = "2 dots on the same contour" and negative means "dots on different contours" will be easier with RNNs as contour length grows compared to feed forward networks because feed forward networks grow receptive field w/ depth.

so is the image on the right positive or negative?

Visual illusions (resulting from internal feedback / recurrent / "horizontal" connections in the brain)
seems like a bug: maybe actually a feature?

2nd part of talk:
Architectural capacity vs. recurrent dynamics
assertion: we don't have deep multistage pipeline with distinct layers in our brains, we have a small fixed number of stages, but neurons / stages can communicate locally and propagate signals in a reentrant way

skipping a few results - will add back later (want to catch up to live)
but tl;dr - human salience is stable under viewpoint changes, not so for models. adding certain proxy tasks (eg next frame prediction) increase model salience stability.

however, explicitly supervising ViT gradients succeeds in aligning importance between models and humans (left most column)

however, as we move closer to SOTA, web scale data, ViT, etc., the trend reverses and we see fairly strong negative correlation between ImageNet multi-label accuracy and Neural alignment

The AlexNet moment was not pivotal only to computer vision community, but also to neuroscience researchers. Better AI initially meant better neuroscience models #cvpr2026
Last keynote: "Scaling Laws vs. Neural Laws: Toward More Natural Artificial Vision" kicking off now with @tserre