This is actually pretty interesting and surprisingly useful.
Context: Some of us from MSFTResearch were planning to train an SLM with a fixed harness (essentially optimising the model for the harness).
A valid point raised by one person was, we'll need coding data in addition to RL environments since the world knowledge of SLMs is poor due to size constraints. This would be problematic since the model would not know what libraries to call / how to use them.
I realised the inherent problem in RL is that the model learns 0 world knowledge. This is because the output tokens are masked so no external signal gets internalised within the model.
A very simple fix we're thinking of doing is using ECHO, due to which we can get away with just RL environments (since the model will learn the library behaviour via its exploration).
http://x.com/i/article/2056344151235387392