I asked Codex to download this repo, and run the benchmark with GPT 5.5 (with hints). Frankly did better than I expected it to, since it's not even using the Codex harness!
Codex makes running benchmarks easier. This failed all PATENTS queries, so likely a floor.
It is great that the frontier labs want to support data analytics workflows. It is the #1 thing many enterprises want now. But enterprise data is difficult. Turns out most models are bad at it! They’re ok at writing single SQL queries or Python scripts, thanks to the plethora of text to SQL and data science benchmarks, but they struggle to query, clean, and make sense of data from multiple database systems (relational and non relational).
Fortunately, we released a new benchmark to help, the Data Agent Benchmark, with plans to get it into a super well-known benchmark very soon :-) stay tuned!
https://arxiv.org/abs/2603.20576

