Muyu He proposes Bank of Values, an attention architecture variant that eliminates V cache in deep layers using context-free vectors
The method outperformed standard attention baselines at 780M scale.
Engram… for Attention?
In our new paper, we naturally derive a new attention variant based on the surprising finding that deep layers benefit the most from learning a context-free value vectors, without the input from the residual stream. The attention variant: since the value vector does not depend on the context, it can be directly learned as sparse model parameter, and stored in a table of value vectors for the current layer. The idea of having a value vector table is not new. Nanochat, for example, has it. But it was always learned as an extra group of weights to add to the existing value vector. With the insight that the deep layer **only** need a context-free vector, we can rewrite attention completely, by making the table the only source of value vectors. On two model sizes we see that it significantly outperforms the standard attention baseline on both validation loss and benchmark scores, and slightly surpasses the strictly more expressive nanochat variant. Besides the boost in performance, some very interesting consequences emerge in terms of memory: - We do not need V cache anymore for the deep layers, and at long context it saves memory. - Because we only need the token IDs for fetching value vectors, we can offload the table entirely, and just prefetch the relevant entries when previous layers are busy computing. Me and @YuchenL52766559 are fascinated by this attention architecture's unique properties, and are doing experiments on bigger scales with it. Paper: https://github.com/RiddleHe/nanochat/blob/master/papers/bank_of_values.pdf