Our new podcast on evals, with Max Niederman, Ege Erdil, and Stephen Yang.
0:00:00 – What's an eval, and how's it different from an RL environment? 0:19:33 – Why are models bad at building an emulator when the task is fully verifiable? 0:42:00 – How does training on bad data teach models to write terrible code? 1:04:00 – Why is continual learning still so bad? 2:25:24 – Why haven't software engineers been replaced when coding is basically solved?
Listen to the Mechanize Podcast on YouTube, Spotify, etc. Enjoy!




