long-horizon RL is honestly the answer to pretty much everything
humans acquire taste through experience
but humans are also cheeky little continuously learning beings, where the effective ratio of RL to pre-training is off the charts
meanwhile most LLMs are probably below a ratio of 10:1
after using GPT-5.5-xhigh for the past week for my research project I'm much less bullish on RSI
models are not opinionated and have 0 taste, like they just return training eigenslop



