For the record, here is my TLDR on Native Multimodal Models (NMMs)
NMMs will eventually work slightly better. They are more elegant but data hungry
But for very long context tasks, NMMs don't stand a chance.
A dynamic compressor is needed.
Dynamic means that the compression rate depends on input, which is something NMMs can't handle.
This will be true for both visual and textual tasks, and it will be the only way to unlock 100M context lengths in a clean way (there's plenty of hacks to simulate longer context lengths but I'm not talking about those)