wow
Introducing EdgeBench, a benchmark designed to study how agents learn from environments over at least 12~72-hour runs. We find that performance follows a log-sigmoid function of environment interaction time with high precision.
EdgeBench is built with three ingredients:
- 🌍 Real & Diverse: 134 real-world tasks across 6 task categories, spanning scientific problems, professional knowledge work, software engineering, optimization, formal math, and games. - ⏳ Ultra-Long-Horizon: Each task supports 12–72 hours of agent work. Recorded human effort averages 57.2 hours. - 🔁 Informative Feedback: Agents receive real-world feedback for continuous improvement.
After 38,000 hours of agent runs on EdgeBench, a scaling law for learning from environments emerges:
- 📈 As agents interact with task environments over time, their aggregate performance is precisely fit by a log-sigmoid function. - 🧠 This phenomenon can be explained by an elegant theory of graph exploration.
We are releasing an initial 51 of the 134 tasks, together with the full evaluation framework, to help advance long-horizon agent research. Check our blog & paper for more findings!
Blog https://edge-bench.org/ Paper https://edge-bench.org/paper.pdf GitHub https://github.com/ByteDance-Seed/EdgeBench Dataset https://huggingface.co/datasets/ByteDance-Seed/EdgeBench
Details below 👇🧵









