How much of the cost of training LLMs (and alike) is tied to Transformers and its variants? Is there any reason to believe/expect that we can have an architecture that is 2-3 orders of magnitude cheaper with a similar behaviour? Or is there any fundamental limit?
Most Activity
I am not talking about sample efficiency or new capabilities. Just the compute cost.
Of course, the cost depends on the hardware. The question can be relaxed: If we are allowed to change the hardware minimally (*), can we come up with a much cheaper architecture?
How much of the cost of training LLMs (and alike) is tied to Transformers and its variants? Is there any reason to believe/expect that we can have an architecture that is 2-3 orders of magnitude cheaper with a similar behaviour? Or is there any fundamental limit?
(*) By minimally, I mean something that can be designed and mass-produced by the current chip makers.
P.S: I am not following the architecture design efforts, so this question might have a simple answer. I don't want to ask ChatGPT either, at least yet.
I am not talking about sample efficiency or new capabilities. Just the compute cost.
Of course, the cost depends on the hardware. The question can be relaxed: If we are allowed to change the hardware minimally (*), can we come up with a much cheaper architecture?