BenchPress estimates LLM performance across multiple benchmarks within 3.9% error using just five core evaluations · Digg