/AI11h ago

Anjali argues AI labs must build software to harvest interactive "Work Data" as scaling laws hit data limits

These interaction traces train agents via reinforcement learning

116716288.4K

#931

Original post

Herbie Bradley#1012

anjali@anjali_shriva

the scaling laws in models might feel like inevitable progress if compute and data continue growing. but data has some underrated limitations…

a thread on a new kind of data ("Work Data"): what it is, and why labs now need to build and sell product for continued growth

judah@joodalooped

all aboard the data train!

https://anjalishriva.com/work-data/

7:41 AM · Jun 9, 2026 · 8.4K Views

/AI11h ago

Anjali argues AI labs must build software to harvest interactive "Work Data" as scaling laws hit data limits

These interaction traces train agents via reinforcement learning

116716288.4K

#931

Original post

Herbie Bradley#1012

anjali@anjali_shriva

the scaling laws in models might feel like inevitable progress if compute and data continue growing. but data has some underrated limitations…

a thread on a new kind of data ("Work Data"): what it is, and why labs now need to build and sell product for continued growth

judah@joodalooped

all aboard the data train!

https://anjalishriva.com/work-data/

7:41 AM · Jun 9, 2026 · 8.4K Views

Sentiment

Many users are excited about the new primer on work data for training AI agents because of its insightful content, collaborative reviews, and playful emphasis on the core concept.

Pos

100.0%

Neg

0.0%

6 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

anjali@anjali_shriva

For most domains, real work is the only environment that useful data can come from.

A big reason why labs are building products, acquiring companies, and forwardly deploying engineers into enterprises is to gather enough work data to train their agents on a wider range of tasks.

11h9969

BOOKMARKS2

jihad@jaesmail

@anjali_shriva two most anticipated pieces of media of 2026: - This essay - Iceman

11h14332

LIKES14

anjali@anjali_shriva

with thanks to the many people who reviewed and gave comments

@divya_venn @annihalated @aishdoingthings @peytoncasper @herbiebradley @JoshPurtell @nobu_hibiki @shacrw_ @akbirthko @seconds_0

and @analoguegroup

🫶

11h11614

RETWEETS16

anjali@anjali_shriva

the scaling laws in models might feel like inevitable progress if compute and data continue growing. but data has some underrated limitations…

a thread on a new kind of data ("Work Data"): what it is, and why labs now need to build and sell product for continued growth

judah@joodalooped

all aboard the data train!

https://anjalishriva.com/work-data/

11h8.4K6728

REPLIES2

Aashish Reddy@_AashishReddy

@anjali_shriva @ankit2119 My take is that what an agent needs to learn is basically representations for how to take actions in the world, how to chain together sequences of actions and make plans and so on. So synthetic data makes RL-generalisation feasible

10h192

anjali@anjali_shriva

if work data is what matters, where do you get it?

you can't just scrape it from the web. Work data is a fundamentally different distribution, and the corrective signals that matter aren’t in any textbook, manual, or written wiki.

11h160111

anjali@anjali_shriva

But once you understand it, it raises many, many questions about the future.

Read the full post for our predictions (co-written with @joodalooped) 👉 http://anjalishriva.com/work-data

11h8591

anjali@anjali_shriva

Okay, no dataset. Can't we just build a simulation? A "work gym" (RL env) where agents learn by trial and error?

unfortunately, knowledge work lacks the verifiability that RL relies on: the feedback is too sparse, too delayed, too noisy to learn from (h/t @gwern)

11h839

anjali@anjali_shriva

It's hard to grasp how much work data you generate in a session, let alone the sheer scale of data that's needed

Our examples of its specific nature / excerpt from recent @dwarkesh_sp post:

11h649

aishwarya🍎@aishdoingthings

@anjali_shriva @divya_venn @annihalated @peytoncasper @herbiebradley @JoshPurtell @nobu_hibiki @shacrw_ @akbirthko @seconds_0 @analoguegroup WORK DATA!!!!!

10h416

Analogue@analoguegroup

@anjali_shriva @divya_venn @annihalated @aishdoingthings @peytoncasper @herbiebradley @JoshPurtell @nobu_hibiki @shacrw_ @akbirthko @seconds_0 we love supporting your work!

10h388

Aashish Reddy@_AashishReddy

@anjali_shriva Are you bearish on synthetic data

10h972

anjali@anjali_shriva

@jaesmail i'm fr laughing at how long it took

what can i say, we enjoy the finishing touches

10h472

anjali@anjali_shriva

@_AashishReddy for anti-inductive domains, yeah. and i think this is a *huge* portion of white collar work

good writeup from @ankit2119 https://ankitmaloo.com/anti-inductive/

10h342

Soren Larson@hypersoren

@anjali_shriva @jaesmail 🥲

10h92

Aashish Reddy@_AashishReddy

@anjali_shriva @ankit2119 Whereas it wouldn't for learning models of the world, since synthetic data doesn't tell you what the world is like. But fortunately we already get that from pretraining. I still have exams but hope to have time to flesh this out before the singularity

10h231

anjali@anjali_shriva

@aishdoingthings @divya_venn @annihalated @peytoncasper @herbiebradley @JoshPurtell @nobu_hibiki @shacrw_ @akbirthko @seconds_0 @analoguegroup #werk data

10h264

judah@joodalooped

@hypersoren @anjali_shriva @jaesmail sometimes it includes a CSS rewrite

10h62

harsh@harshh_jainn

@anjali_shriva would be hilarious if the models got dumber after post training on work data 😅

10h181

anjali@anjali_shriva

@_AashishReddy @ankit2119 i can see this, yeah. the full post is a bit more nuanced and comes with an author's note

we mainly wanted to get across

1) this type of data is important, even for compute-rich labs (see labs starting deployment companies, cursor-xai partnership) 2) and it's under-discussed

10h19