2h ago

Lucas Beyer, Vision Transformer researcher, details his 2013 experiments using von-Mises loss to predict continuous angles from discrete labels

The approach reduced angular regression error to 29.4 degrees.

0
Original post

The first big insight is that, as we should all know, linear regression is doing MLE with a Gaussian. I learned that there exits such thing as a Gaussian defined on the circle: von-Mises distribution. So I turn it into a loss, and yay, massive improvements!

12:29 PM · May 26, 2026 View on X

The second big insight is that the "output space" for the model, a single number, is awkward.

From my times doing 3D graphics, I know that Quaternions (4-number vectors) are so much better for representing angles in 3D space. So I made a "2D quaternion" which I naturally call a "Biternion", and use that for the output parametrization.

Oh wow, another huge gains by giving the model a more natural output space!

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

The first big insight is that, as we should all know, linear regression is doing MLE with a Gaussian. I learned that there exits such thing as a Gaussian defined on the circle: von-Mises distribution. So I turn it into a loss, and yay, massive improvements!

7:29 PM · May 26, 2026 · 1K Views
7:29 PM · May 26, 2026 · 1K Views

However, because of the robot's first-person perspective, this dataset/model was useless, and it was the only existing dataset.

Luckily this was a large collaboration project across many Uni's, so each time we got together, I grabbed each of my colleagues and made them walk around in circles in front of the robot to record data.

I then played around with different (self-made) UIs to annotate the data (myself). But everything I could make was tedious, slow, and imprecise. I am and have always been a lazy bum. No way I spend half my PhD annotating this, just to throw it all in the bin when we decide to put the robot's camera elsewhere. I had to come up with something scalable.

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

The second big insight is that the "output space" for the model, a single number, is awkward. From my times doing 3D graphics, I know that Quaternions (4-number vectors) are so much better for representing angles in 3D space. So I made a "2D quaternion" which I naturally call a "Biternion", and use that for the output parametrization. Oh wow, another huge gains by giving the model a more natural output space!

7:29 PM · May 26, 2026 · 1K Views
7:29 PM · May 26, 2026 · 985 Views

After some thinking and prototyping, the only thing I'd tolerate annotating that data myself, is dump individual frames/head-crops, and then range-select them and classify into quadrants.

This is fast but coarse. Since images are dumps from videos, there's continuity and I could average out like 5img/sec or so.

I did it twice: once for front/left/right/back. And once again for front-left/front-right/back-left/back-right. Then we can marge this to get 8-bin orientation.

Turns out @PINTO03091, the probably most GOATED labeller in history, converged to the same schema, which is what reminded me of my work yesterday.

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

However, because of the robot's first-person perspective, this dataset/model was useless, and it was the only existing dataset. Luckily this was a large collaboration project across many Uni's, so each time we got together, I grabbed each of my colleagues and made them walk around in circles in front of the robot to record data. I then played around with different (self-made) UIs to annotate the data (myself). But everything I could make was tedious, slow, and imprecise. I am and have always been a lazy bum. No way I spend half my PhD annotating this, just to throw it all in the bin when we decide to put the robot's camera elsewhere. I had to come up with something scalable.

7:29 PM · May 26, 2026 · 985 Views
7:29 PM · May 26, 2026 · 990 Views

A more "quantitative" evaluation and ablation that this actually works.

This is an "angle heatmap" of predictions. The "simple" but non-smooth thing to do with such discrete data would be to do softmax classification, and then use the probabilities to interpolate into a continuous output. As you can see, that doesn't really become smooth, largely because of softmax+dl being notoriously over-confident.

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

Now it turns out that training with these coarse labels as targets, allows the model to predict continuous angle! First a qualitative example in the pic below, a video of me (a held-out person; it's never seen me) turning in circles generating a ~smooth two-circle prediction. That this works (continuous prediction from discrete label training) is not a coincidence. It's due to four factors coming together: - The smoothness of CNNs - The smoothness of Biternion output space - The smoothness of von-Mises loss - The data being (forcibly/naturally) noisy at the "borders" of the discrete classes.

7:29 PM · May 26, 2026 · 7.8K Views
7:29 PM · May 26, 2026 · 893 Views

Now it turns out that training with these coarse labels as targets, allows the model to predict continuous angle! First a qualitative example in the pic below, a video of me (a held-out person; it's never seen me) turning in circles generating a ~smooth two-circle prediction.

That this works (continuous prediction from discrete label training) is not a coincidence. It's due to four factors coming together: - The smoothness of CNNs - The smoothness of Biternion output space - The smoothness of von-Mises loss - The data being (forcibly/naturally) noisy at the "borders" of the discrete classes.

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

After some thinking and prototyping, the only thing I'd tolerate annotating that data myself, is dump individual frames/head-crops, and then range-select them and classify into quadrants. This is fast but coarse. Since images are dumps from videos, there's continuity and I could average out like 5img/sec or so. I did it twice: once for front/left/right/back. And once again for front-left/front-right/back-left/back-right. Then we can marge this to get 8-bin orientation. Turns out @PINTO03091, the probably most GOATED labeller in history, converged to the same schema, which is what reminded me of my work yesterday.

7:29 PM · May 26, 2026 · 990 Views
7:29 PM · May 26, 2026 · 7.8K Views

I never put the paper on arxiv because my plots were too big, I hit arxiv's limit, and I since lost the source.

Paper: https://lucasb.eyer.be/academic/biternions/biternions_gcpr15.pdf Video: https://www.youtube.com/watch?v=5Kbsx7CWxIA Code: http://github.com/lucasb-eyer/BiternionNet

And I decided to make the slides public today: https://docs.google.com/presentation/d/15US8duAtU1dfWh0YMMaIemir33xYKdQECQgq9DoGLS8/edit?usp=sharing

Thanks a ton to @PINTO03091 for reminding me of this, and appreciating my old work, it warms my heart :)

PS: All the data I collected and labeled was never published and destroyed at the end of the project, sadly, as per regulations of the project grants, because it's (consenting) people's faces.

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

Compare to what you get with the exact same data/labels, but using the smooth outputs (Biternion+vonMises), it's night and day. Making the model output/loss naturally fit your problem-space is a game-changer.

7:29 PM · May 26, 2026 · 868 Views
7:29 PM · May 26, 2026 · 1.6K Views

Compare to what you get with the exact same data/labels, but using the smooth outputs (Biternion+vonMises), it's night and day.

Making the model output/loss naturally fit your problem-space is a game-changer.

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

A more "quantitative" evaluation and ablation that this actually works. This is an "angle heatmap" of predictions. The "simple" but non-smooth thing to do with such discrete data would be to do softmax classification, and then use the probabilities to interpolate into a continuous output. As you can see, that doesn't really become smooth, largely because of softmax+dl being notoriously over-confident.

7:29 PM · May 26, 2026 · 893 Views
7:29 PM · May 26, 2026 · 868 Views

This is about this:

kachekache@yacineMTB

if you're doing AI research at all; I recommend doing the "ETH zurich" route Train models that use a single GPU. Make sure that it takes less than a minute to train models. Pufferlib is a great example. The more models you train the more you learn

1:48 PM · May 26, 2026 · 135.9K Views
7:31 PM · May 26, 2026 · 1.7K Views

magic

Lucas Beyer (bl16)Lucas Beyer (bl16)@giffmana

Now it turns out that training with these coarse labels as targets, allows the model to predict continuous angle! First a qualitative example in the pic below, a video of me (a held-out person; it's never seen me) turning in circles generating a ~smooth two-circle prediction. That this works (continuous prediction from discrete label training) is not a coincidence. It's due to four factors coming together: - The smoothness of CNNs - The smoothness of Biternion output space - The smoothness of von-Mises loss - The data being (forcibly/naturally) noisy at the "borders" of the discrete classes.

7:29 PM · May 26, 2026 · 7.8K Views
7:56 PM · May 26, 2026 · 6.7K Views
Lucas Beyer, Vision Transformer researcher, details his 2013 experiments using von-Mises loss to predict continuous angles from discrete labels · Digg