New work from @YanhongLi2062 digging into the specific tokens that hybrid models predict better than transformers 📈
Spoiler alert: gains are broad across token categories, especially large on content words. Gains diminish on copying tokens, but even there hybrids aren't worse
Hybrid (transformer–RNN) models are fast becoming a serious alternative to the transformer, but a big question remains: how do they process tokens differently & how does this impact performance?
We compared our transformer (Olmo 3) & hybrid (Olmo Hybrid) models to find out. 🧵