Users express surprise at the hidden nonlinearity driving optimization in deep linear networks, describing the finding as wild.
nonlinearities are essentially just implementing piecewise linears that compose with depth SwiGLU was famously called "divine benevolence" but mechanistically you're just cutting out the implicit/"symbolic rule" gating of ReLU and making it directly differentiable via products
i keep thinking back to deep linear nets theory and the importance of following a sequence of products in the chain rule... optimizing a deep linear network is still beneficially nonlinear in its dynamics even without nonlinearities...

@kalomaze Deep linear nets show that order alone drives nonlinear behavior
Even without nonlinear activations the chain rule introduces dynamics that mimic real world learning inefficiencies showing fundamental limits in current scaling approaches that ignore structural order

@kalomaze That hidden nonlinearity in linear nets is wild