The first big insight is that, as we should all know, linear regression is doing MLE with a Gaussian. I learned that there exits such thing as a Gaussian defined on the circle: von-Mises distribution. So I turn it into a loss, and yay, massive improvements!
First, I found a dataset with head-angle annotations, "TownCentre". This was sometime around 2013, so deep learning just started entering vision. I was the first to train a net for this.
First, the simple baselines: linear regression (on pixels in the bounding-box), then add some conv+pool layers, then learn that modulo is useless in a loss function.
