i think people underestimate the value of memorization as a precondition for how much you can learn for example: if a model hasn't already memorized the gist of some research paper, any shorthand reference to it somewhere else in the data is nonsequitur-ish when doing NTP over it
@gleech nah you're conflating two things here
language models trained through CE loss are incentivized to memorize random shit and like the stuff you really care about agency/reasoning/language comprehension is tiny fraction of params

