/AI1d ago

Researcher Releases Data Agent Benchmark For Enterprise AI Workflows

312123.4K
Original postShreya Shankar#409
Michael Griffiths@msjgriffiths

I asked Codex to download this repo, and run the benchmark with GPT 5.5 (with hints). Frankly did better than I expected it to, since it's not even using the Codex harness!

Codex makes running benchmarks easier. This failed all PATENTS queries, so likely a floor.

It is great that the frontier labs want to support data analytics workflows. It is the #1 thing many enterprises want now. But enterprise data is difficult. Turns out most models are bad at it! They’re ok at writing single SQL queries or Python scripts, thanks to the plethora of text to SQL and data science benchmarks, but they struggle to query, clean, and make sense of data from multiple database systems (relational and non relational).

Fortunately, we released a new benchmark to help, the Data Agent Benchmark, with plans to get it into a super well-known benchmark very soon :-) stay tuned!

https://arxiv.org/abs/2603.20576

10:02 AM · Jun 5, 2026 · 3.4K Views
Sentiment

Users praise GPT-5.5's strong performance on the Codex data agent benchmark, noting scores up to 72% and near-perfect results despite how difficult such tests are to build.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS96LIKES1

@msjgriffiths Interesting! Also we’ve just updated the benchmark (there should be ~100 queries in the repo)

21hViews 96Likes 1
Michael Griffiths@msjgriffiths

Making these benchmarks are hard - a few weeks ago I built a tool to use Datalog to generate increasingly hard SQL queries (a "ladder"). GPT 5.5 (low!!) did obnoxiously well, frankly about perfect for almost all SQL, and very solid in general. xhigh fixes most of this.

23hViews 12