22h ago

METR's Daniel Filan argues 'computer use evals' should be renamed to 'GUI use evals' to improve benchmark precision

David Manheim says current benchmarks conflate GUI and computer use.

0