/Tech1h ago

Stanford Releases SEFD Dataset For Token-Efficient SEC Filings Pretraining

4238141.8K

#1257

Original post

Rohan Paul@rohanpaul_ai#1257inTech

This was long needed for AI in finance.

Making SEC filings readable for machines without flattening the accounting logic.

Stanford researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables.

A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents.

Has less than 0.1% overlap with Common Crawl-derived corpora.

The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training.

The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens.

----

Link – arxiv. org/abs/2606.18192v1

4:07 AM · Jun 17, 2026 · 389 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.5KBOOKMARKS12LIKES21RETWEETS8REPLIES2

Rohan Paul@rohanpaul_ai

This was long needed for AI in finance.

Making SEC filings readable for machines without flattening the accounting logic.

Stanford + Univ of Calif + Nanjing Univ researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables.

A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents.

Has less than 0.1% overlap with Common Crawl-derived corpora.

The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training.

----

Link – arxiv. org/abs/2606.18192v1

1h1.5K2112

Broadstreet@mattbroadstreet

@rohanpaul_ai ⏫

1h21

BetterWay@momocowala

@rohanpaul_ai Layout in a 10-K is the semantics — parens mean negative, a footnote ref changes what the line item even is. XBRL promised machine-readable filings back in 2009 and mostly flopped on exactly that. The token-efficient part is the genuinely new piece here.