Zhaofeng Wu releases ><former, a variable-width transformer that improves training loss while reducing KV cache size · Digg