Transformers are better at copying, while RNNs are better at modeling "meaning-bearing words—the nouns, verbs, & adjectives that say what a sentence is about"
Hybrid (transformer–RNN) models are fast becoming a serious alternative to the transformer, but a big question remains: how do they process tokens differently & how does this impact performance?
We compared our transformer (Olmo 3) & hybrid (Olmo Hybrid) models to find out. 🧵