ByteDance-Seed releases EdgeBench, showing AI agent performance follows a log-sigmoid scaling law over 38,000 hours of runs

VIEWS4KBOOKMARKS18LIKES42

this is probably the most important benchmark since METR time horizons

sick stuff

ByteDance Seed cooked

They developed a new ultra-long horizon benchmark for studying how agents learn in 134 real-world environments over day-long horizons

what they found: - learning-speed doubles every 3 months - overall performance follows a log-sigmoid scaling law as a function of environment interaction time - "The improvement is not explained by repeated sampling alone: accumulating and reusing task experience drives progress beyond what independent restarts achieve." (see section 5.2 in paper) - "A longer context yields a consistent multi-point gain throughout the 12-hour window (Figure 12b). The 1M-context Opus 4.8 stays above the 200k variant at every checkpoint" (see section 5.3 in paper) - Opus 4.8 > GPT-5.5

highly recommend looking at their page and reading the paper: - https://edge-bench.org/ - https://edge-bench.org/paper.pdf

57m4K4218

RETWEETS2

Eric Alcaide@eric_alcaide

Really cool to see it

Deyao Zhu@tikgiau

Introducing EdgeBench, a benchmark designed to study how agents learn from environments over at least 12~72-hour runs. We find that performance follows a log-sigmoid function of environment interaction time with high precision.

EdgeBench is built with three ingredients:

- 🌍 Real & Diverse: 134 real-world tasks across 6 task categories, spanning scientific problems, professional knowledge work, software engineering, optimization, formal math, and games. - ⏳ Ultra-Long-Horizon: Each task supports 12–72 hours of agent work. Recorded human effort averages 57.2 hours. - 🔁 Informative Feedback: Agents receive real-world feedback for continuous improvement.

After 38,000 hours of agent runs on EdgeBench, a scaling law for learning from environments emerges:

- 📈 As agents interact with task environments over time, their aggregate performance is precisely fit by a log-sigmoid function. - 🧠 This phenomenon can be explained by an elegant theory of graph exploration.

We are releasing an initial 51 of the 134 tasks, together with the full evaluation framework, to help advance long-horizon agent research. Check our blog & paper for more findings!

Blog https://edge-bench.org/ Paper https://edge-bench.org/paper.pdf GitHub https://github.com/ByteDance-Seed/EdgeBench Dataset https://huggingface.co/datasets/ByteDance-Seed/EdgeBench

Details below 👇🧵

3h1.5K124

REPLIES1

Deyao Zhu@tikgiau

[8/n] Here’s a time-aware leaderboard showing best-so-far performance at 2h, 6h, and 12h.

6h3087

Deyao Zhu@tikgiau

[5/n] Here's a fun theory of learning from environments: a task's score consists of many tiny units, sitting on a graph. Learning advances like a frontier: each node unlocked reveals new unseen neighbors, so progress compounds; but the shrinking pool of unexplored nodes sets the limit. The pace of the frontier depends on both.

In fact, this heuristic admits a formal proof. Let the fraction of explored nodes be x, so the unexplored fraction is (1 − x). Our theory says the frontier expands at a rate proportional to the product x(1 − x). Taking the natural time scale to be u = log(t), the differential equation dx/du = c·x(1 − x) solves to exactly the log-sigmoid law.

More importantly, we prove that even if each individual task's progress is jagged, the log-sigmoid law still emerges in the benchmark average over many tasks. This gives an elegant explanation of the phenomenon we observed.

6h526123

Deyao Zhu@tikgiau

[6/n] We evaluated model releases from September 2025 to May 2026, using performance improvement within 2 hours as the learning-speed metric. The frontier trend shows that AI learning speed from environments roughly doubles every three months.

6h1.7K101

Deyao Zhu@tikgiau

[2/n] EdgeBench covers real work across six capability families. It includes 134 day-long tasks spanning scientific and ML problems, systems and software engineering, optimization, professional knowledge work, formal math, and interactive games. Each task gives agents at least 12 hours in an executable environment with informative real-world feedback, while recorded human expert effort averages 57.2 hours per task.

6h648131

Deyao Zhu@tikgiau

[3/n] Agents continuously learn from environments and improve their performance in EdgeBench. The representative curves below, drawn from all six capability families, show that agents continually turn environmental feedback into better artifacts, strategies, and final outcomes.

6h451111

Deyao Zhu@tikgiau

[4/n] We average learning curves of different models across 134 tasks. The noisy task-specific trajectories collapse into a simple log-sigmoid, with high precision and mean R^2=0.998.

6h362111

Deyao Zhu@tikgiau

[7/n] Below we show a 12-hour agent performance trace on the gravitational-wave task. Across 247 scored attempts, the performance climbs from 42.8 to 67.0, with seven turning points where the agent reframes the problem rather than just tuning.

6h29381

Deyao Zhu@tikgiau

[9/n] EdgeBench was an awesome team effort. Huge thanks to the incredible team💪: @mingwuzheng @zinuxo87 @odysseusqs @zhu_xuekai @_foreverpiano @Zixin_Wen Zhonglin Xie @poppingG5

6h35381

Hangliang Ding@_foreverpiano

@tikgiau Fun fact: no single task follows the law. Zoom into any one curve — pure chaos. The clean sigmoid only emerges when you average across 134 tasks, exactly as the graph-exploration theory predicts.

6h20351

Hanchen Li@lihanc02

@tikgiau Well I have never been a believer of "AI replacing humans". This benchmark probably justifies this ideology I guess

2h5811

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 I really want to see GLM 5.2 and how graph efficiency scales (vs just increasing reasoning tokens)

Lisan al Gaib@scaling01

ByteDance Seed cooked

They developed a new ultra-long horizon benchmark for studying how agents learn in 134 real-world environments over day-long horizons

what they found: - learning-speed doubles every 3 months - overall performance follows a log-sigmoid scaling law as a function of environment interaction time - "The improvement is not explained by repeated sampling alone: accumulating and reusing task experience drives progress beyond what independent restarts achieve." (see section 5.2 in paper) - "A longer context yields a consistent multi-point gain throughout the 12-hour window (Figure 12b). The 1M-context Opus 4.8 stays above the 200k variant at every checkpoint" (see section 5.3 in paper) - Opus 4.8 > GPT-5.5

highly recommend looking at their page and reading the paper: - https://edge-bench.org/ - https://edge-bench.org/paper.pdf

56m26950

Lisan al Gaib@scaling01

@tikgiau great stuff

59m3552

James Han@hanzhi98368555

@tikgiau Very interesting research! Looking forward to more results!

6h1531

Ethan TS. Liu@ethantsliu

@tikgiau interesting! i always wondered about agent hillclimbing on long term tasks

5h1451

wulala@wulala261087343

@tikgiau Really valuable work! Curious which models will be evaluated on this benchmark next.

5h122

Merinmiro@mybeautifulYe

@tikgiau wow

5h108

Richard Hundt@HundtRichard

@tikgiau GLM-5.1 looks very interesting. Highest S max of them all but much shallower slope. Any ideas what that model does differently? It's like you let it churn for long enough, it eventually organizes itself to the point where it beats everything else out there?

34m19

Lisan al Gaib@scaling01

@tikgiau Will you keep the leaderboard updated with new models like GLM-5.2 for example?

12m12

ByteDance-Seed releases EdgeBench, showing AI agent performance follows a log-sigmoid scaling law over 38,000 hours of runs

Story Overview

Long runs expose new dynamics

Speed gains invite closer scrutiny