Amplify Partners GP Sarah Catanzaro argues file systems are better suited than databases for AI agent workloads

VIEWS2.1KLIKES16REPLIES2

@sarahcat21 You are describing…a database? 😆 😆

Agents’ data needs will become increasingly complex. I didn't get the filesystems thing at first. I thought it was just hype. But now I believe that file systems are better suited to agents than DBs or object stores. I'm less convinced that today's file systems are...the future.

Talking to senior folks building data infra, I heard some common refrains on why filesystems are right for agents:

1) Training corpora contain lots of file system operations – there are a lot of public Github repos containing file system operations, so these interactions appear frequently in data upon which models are trained. What’s more, filesystem semantics are dead simple. In contrast, database operations appear in training corpora but their semantics are kinda tricky; they are often missing critical context like schemas.

2) File systems do what agents need to work with data – agents typically need high-throughput, low-latency access to specific, small files containing unstructured data. Filesystems support this pattern well. Other data management systems don’t: OLTP databases require too much structure; object storage is too slow.

But while file systems seem to meet the needs of agents, that wil probably change imminently. Specifically, I expect agents will need 4 capabilities:

1) Transactions – when thousands of agents operate simultaneously, reading from and modifying shared state, and writing it back, the file system must store data reliably and mediate interactions between independent processes. Traditional distributed file systems weren't designed for highly concurrent, fine-grained mutation of shared state.

2) Queries – today, an agent might retrieve a single file and make a single update. But as agentic retrieval advances, agents will need to pull data from multiple files, materialize intermediate results, and run operations across those results. As Claude might say, this is not retrieval; it is a query problem.

3) ACLs – file systems do support ACLs, but they’re typically defined at the level of files, directories, users, and groups. Permissions are static, evaluated at access time, and don’t extend to how data is used once it’s read.

4) File scale – today’s agent workloads operate over small files, modest context windows, limited working memory, and bounded outputs. As models improve and context windows expand, agents will start working over much larger artifacts. A coding agent won’t operate file by file, it will need to understand and modify entire codebases.

There are teams starting to build the post-modern data stack for agents (so unoriginal, but I had to sneak some slop in here). @archildata gives agents fast, consistent access to data across environments, while explicitly meeting the latency requirements agents have and will have in the future. @SpiralDB built a columnar file format for extremely fast reads, including random access, selective reads, and large batch scans - it can give agents analytics capabilities.

To be honest, I’m in awe that data infra has held up so well as agents become more prevalent. But this won’t last forever. Agents will need more. Concurrency needs stronger guarantees. Retrieval becomes a query problem over unstructured data. Access control moves from static ACLs to dynamic, execution-aware policies. And file systems need to support efficient operations over much larger artifacts.

We’re starting to see early answers, but no dominant design. The choices we make now about how agents interact with data will harden quickly, so it’s worth getting them right.

More in my latest blog post: https://www.amplifypartners.com/blog-posts/file-systems-for-agents

33d2.1K161

BOOKMARKS1RETWEETS2

Alex Dimakis@AlexGDimakis

@sarahcat21 Told you so :)

Sarah Catanzaro@sarahcat21