14h agoSWE-bench creator Ofir Press says sandbox environments and data cleaning drive 40% of recent AI coding gainsCustom human-made evaluation tasks can cost up to $500,000.