This was long needed for AI in finance.
Making SEC filings readable for machines without flattening the accounting logic.
Stanford researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables.
A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents.
Has less than 0.1% overlap with Common Crawl-derived corpora.
The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training.
The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens.
----
Link – arxiv. org/abs/2606.18192v1

