it is truly amazing we can supervise a trillion parameters from 16 bits of information. the fact that credit assignment across such a massive & discontinuous network is such a non-issue deserves so much reflection
Charles Foster says batch and sequence dimensions facilitate this.
it is truly amazing we can supervise a trillion parameters from 16 bits of information. the fact that credit assignment across such a massive & discontinuous network is such a non-issue deserves so much reflection
Positive users note catastrophic forgetting is nearly a non-issue at scale for trillion-parameter models, while negative users dismiss 16-bit supervision as still excessive.

@willdepue why do you think that is? high dimensions just dont really have that many local minima so it will "eventually work"?

@willdepue @lu_sichu *batch/sequence dimension voice*

@CFGeek @lu_sichu lol

@willdepue Catastrophic forgetting almost a non-issue as well at scale!

@willdepue 16 bits is a metric fuckton
Charles Foster says batch and sequence dimensions facilitate this.
it is truly amazing we can supervise a trillion parameters from 16 bits of information. the fact that credit assignment across such a massive & discontinuous network is such a non-issue deserves so much reflection