Graham Neubig finds continuous reward models introduce noise and recommends discretizing dense rewards to stabilize LLM training · Digg