/Tech1d ago

Researcher Releases Data Agent Benchmark For Enterprise AI Workflows

322123.5K

#1260

Original post

Shreya Shankar#1260

Michael Griffiths@msjgriffiths

I asked Codex to download this repo, and run the benchmark with GPT 5.5 (with hints). Frankly did better than I expected it to, since it's not even using the Codex harness!

Codex makes running benchmarks easier. This failed all PATENTS queries, so likely a floor.

Shreya Shankar@sh_reya

It is great that the frontier labs want to support data analytics workflows. It is the #1 thing many enterprises want now. But enterprise data is difficult. Turns out most models are bad at it! They’re ok at writing single SQL queries or Python scripts, thanks to the plethora of text to SQL and data science benchmarks, but they struggle to query, clean, and make sense of data from multiple database systems (relational and non relational).

Fortunately, we released a new benchmark to help, the Data Agent Benchmark, with plans to get it into a super well-known benchmark very soon :-) stay tuned!

https://arxiv.org/abs/2603.20576

10:02 AM · Jun 5, 2026 · 3.5K Views

/Tech1d ago

Researcher Releases Data Agent Benchmark For Enterprise AI Workflows

322123.5K

#1260

Original post

Shreya Shankar#1260

Michael Griffiths@msjgriffiths

I asked Codex to download this repo, and run the benchmark with GPT 5.5 (with hints). Frankly did better than I expected it to, since it's not even using the Codex harness!

Codex makes running benchmarks easier. This failed all PATENTS queries, so likely a floor.

Shreya Shankar@sh_reya

Fortunately, we released a new benchmark to help, the Data Agent Benchmark, with plans to get it into a super well-known benchmark very soon :-) stay tuned!

https://arxiv.org/abs/2603.20576

10:02 AM · Jun 5, 2026 · 3.5K Views

Sentiment

Users praise GPT-5.5 for scoring up to 72% on Codex's data agent benchmark, calling its handling of complex SQL queries generated by custom tools nearly perfect and engaging with benchmark updates.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS96LIKES1

Shreya Shankar@sh_reya

@msjgriffiths Interesting! Also we’ve just updated the benchmark (there should be ~100 queries in the repo)

1d961

Michael Griffiths@msjgriffiths

Making these benchmarks are hard - a few weeks ago I built a tool to use Datalog to generate increasingly hard SQL queries (a "ladder"). GPT 5.5 (low!!) did obnoxiously well, frankly about perfect for almost all SQL, and very solid in general. xhigh fixes most of this.

1d12