AI Science Models Fail Basic Benchmark Tasks 20 Percent of Time · Digg