1d agoOpus 4.8 sets a BullshitBench record for resisting sycophancy and pushing back against incorrect user promptsThe high scores may require harder benchmark evaluation criteria.