/AI6h ago

DocETL creator Shreya Shankar releases Data Agent Benchmark to evaluate AI data agents on complex enterprise tasks beyond SQL generation

It tests agents on messy joins and unstructured text.

--0--
Quote posts
Comments
Reposts
Original postShreya Shankar#407
Parul Pandey@pandeyparul

If you're building or evaluating data agents, you should check out the Data Agent Benchmark (DAB) by @ruiyingm1120 , @sh_reya et al. DAB evaluates agents on more realistic enterprise data problems involving multiple databases, messy joins, unstructured text, and domain-specific knowledge. This imo is vital than evaluating agents on SQL generation alone.

OpenAI@OpenAI

We’re making Codex more useful for your work by expanding plugins beyond individual tools.

These plugins turn Codex into a specialist for a specific role with a single install, no coding required.

Codex can access 62 popular apps and 110 skills for work across sales, data analytics, creative production, product design, and public equity investing.

https://openai.com/index/codex-for-every-role-tool-workflow/

10:05 PM · Jun 2, 2026 · 1K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS1.5KBOOKMARKS17LIKES17REPLIES2

Some thoughts after building the Data Agent Benchmark - build a "semantic layer" for all your data. a simple first pass can be done by running an LLM over a sample of data from each table, to come up with column annotations (commonly known as semantic types), possible functional dependencies (i.e., columns that depend on each other) - use the semantic layer in the prompt for all questions - enterprise data often needs to be cleaned. rather than try to clean all data up front (which is really difficult), keep a memory of subsets of data that need to be cleaned and how, so that the relevant data can be cleaned by the agent at query time - extend the harness (i.e., codex, claude code) with a tool like DocETL or Claude workflows; basically the ability to run agentic map-reduce. often questions require reasoning about unstructured text columns to come up with the answer, which SQL or code doesn't support

4hViews 1.5KLikes 17Bookmarks 17