Merging Transformer QKV projections cuts KV cache memory by 50% with a 3.1% perplexity increase · Digg