/Tech1h ago

PixelRAG Retrieves Web Pages As Screenshots To Improve RAG Accuracy

76324194.1K
Original postSewon Min#201
Yichuan Wang@YichuanM

The web was never meant to be flattened into text.

Yet most web RAG systems start by parsing HTML --- a complex and lossy process.

🔥 Introducing PixelRAG: the first RAG system that retrieves and reads 30M+ web pages as pixels.

Instead of extracting text, PixelRAG retrieves screenshots and lets a VLM read them directly.

PixelRAG not only preserves visual information, but also outperforms text-based RAG on text-only QA benchmarks by +18.1%.

Why? (1) HTML-to-text conversion often discards layout, structure, tables, and other useful signals. (2) We continued pretraining a VLM on web page screenshots and turned it into a surprisingly strong visual retriever. (3) Recent VLMs are remarkably good at understanding web pages, often with better accuracy and token efficiency than text-only pipelines.

Takeaway: HTML parsing may be one of the biggest self-inflicted bottlenecks in web RAG.

Demo below 👇

Code: https://github.com/StarTrail-org/PixelRAG Paper: https://github.com/StarTrail-org/PixelRAG/blob/main/assets/pixelrag-paper.pdf Playground: https://pixelrag.ai/

10:07 AM · Jun 10, 2026 · 4.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.5KBOOKMARKS5LIKES19RETWEETS3REPLIES2
Sewon Min@sewon__min

Super excited about this work, led by @YichuanM and @andylizf ! It is possible to completely remove HTML parsing by directly retrieving and reading web screenshots through VLM.

HTML parsing is a hidden bottleneck that causes significant complexity and information loss that nobody really pays attention to, and it's so exciting that VLM progress made it possible to remove it.

Please check out this demo as well which is really cool!!! http://pixelrag.ai

Yichuan Wang@YichuanM

The web was never meant to be flattened into text.

Yet most web RAG systems start by parsing HTML --- a complex and lossy process.

🔥 Introducing PixelRAG: the first RAG system that retrieves and reads 30M+ web pages as pixels.

Instead of extracting text, PixelRAG retrieves screenshots and lets a VLM read them directly.

PixelRAG not only preserves visual information, but also outperforms text-based RAG on text-only QA benchmarks by +18.1%.

Why? (1) HTML-to-text conversion often discards layout, structure, tables, and other useful signals. (2) We continued pretraining a VLM on web page screenshots and turned it into a surprisingly strong visual retriever. (3) Recent VLMs are remarkably good at understanding web pages, often with better accuracy and token efficiency than text-only pipelines.

Takeaway: HTML parsing may be one of the biggest self-inflicted bottlenecks in web RAG.

Demo below 👇

Code: https://github.com/StarTrail-org/PixelRAG Paper: https://github.com/StarTrail-org/PixelRAG/blob/main/assets/pixelrag-paper.pdf Playground: https://pixelrag.ai/

1hViews 1.5KLikes 19Bookmarks 5
Stella Biderman@BlancheMinerva

@sewon__min @YichuanM @andylizf Very cool work! In the Common Pile we ran into this issue because we couldn’t use web agents to analyze the licensing status of websites. It turns out that the footers and sidebars where such info is often recorded is not shown to the AIs!

https://arxiv.org/abs/2506.05209

Sewon Min@sewon__min

Super excited about this work, led by @YichuanM and @andylizf ! It is possible to completely remove HTML parsing by directly retrieving and reading web screenshots through VLM.

HTML parsing is a hidden bottleneck that causes significant complexity and information loss that nobody really pays attention to, and it's so exciting that VLM progress made it possible to remove it.

Please check out this demo as well which is really cool!!! http://pixelrag.ai

1hViews 149Likes 3Bookmarks 0
Yichuan Wang@YichuanM

The dirty secret behind most RAG systems 🤫

Most systems depend on HTML-to-text parsers to get clean text from the web.

And those parsers are surprisingly lossy.

📉 A single parser can throw away 40%+ of useful content

📊 Tables, charts, infoboxes, layouts → flattened or lost

🎰 Changing just the parser can shift RAG accuracy by ~10 points

The web isn't plain text.

It's a visual medium.

Why are we pretending otherwise?

1hViews 117Likes 2
Yichuan Wang@YichuanM

💡 PixelRAG's core idea: Stop parsing. Can we remove text abstractions from RAG and make it fully end-to-end?

Text RAG: HTML→ parse→ text chunks→ retrieve→ LLM reads text

PixelRAG: Render page→ screenshot tiles→ visual retrieval → VLM reads pixels

We built the first visual index covering all of Wikipedia: 30M+ webpage screenshots.

Architecture below 👇

1hViews 55Likes 1
Yichuan Wang@YichuanM

Text RAG stays flat.

PixelRAG rides the VLM scaling curve.

Every new VLM generation delivers:

↑ Better QA accuracy ↓ Lower token cost (for the same accuracy)

No re-indexing. No retraining. No pipeline changes.

Just a stronger reader.

We're still at the beginning of this curve. 📈

More details + a full reader sweep (25+ VLMs) in the paper.

1hViews 23Likes 1
Yichuan Wang@YichuanM

🔍 Agentic search with PixelRAG.

Plug PixelRAG into a ReAct agent on MoNaCo.

PixelRAG achieves higher QA accuracy than both Google Search and DS-Serve — at 2–4× lower cost.

Why?

Better retrieval. Fewer tokens.

Visual search finds better evidence, while screenshots provide a far more compact representation of web pages.

1hViews 13Likes 1
Yichuan Wang@YichuanM

🗂️ PixelRAG Training: 30M screenshot tiles. Zero human labels.

We sample knowledge-intensive web pages, use an LLM to synthesize search queries, filter low-quality generations, and mine hard negatives — resulting in a fully automated contrastive data generation pipeline.

🔥 LoRA on both the VLM and the ViT encoder.

⏱️ ~3 hours of training on a single H100

1hViews 12Likes 1
Yichuan Wang@YichuanM

📊 PixelRAG beats the strongest text-based RAG baseline on every benchmark.

SimpleQA 78.8% (+7.1) NQ-Tables 48.8% (+6.3) EVQA 45.1% (+15.5) LiveVQA 70.3% (+11.3)

The surprising part?

PixelRAG doesn't just win on visual QA.

It also outperforms text-based RAG on text-centric benchmarks like SimpleQA and NQ-Tables.

Additional failure cases and detailed analyses can be found in the paper.

1hViews 12Likes 1
Yichuan Wang@YichuanM

📄 Paper: https://github.com/StarTrail-org/PixelRAG/blob/main/assets/pixelrag-paper.pdf

💻 Code: http://github.com/StarTrail-org/PixelRAG

🌐 Demo: http://pixelrag.ai

🐍 Install: pip install pixelrag

Built at UC Berkeley Sky Computing Lab, BAIR, and Berkeley NLP. @BerkeleySky @berkeley_ai @BerkeleyNLP

This project was co-led with the incredible @andylizf. I couldn’t have asked for a better collaborator throughout this journey. We were also fortunate to work with outstanding researchers @zwcolin, @pteiletche, and @leshenj15.

A huge shout-out to my advisors @matei_zaharia, @profjoeyg, and @sewon__min — I learned an incredible amount throughout this journey.

Would love feedback from people building retrieval, agent, and VLM systems.

1hViews 38Likes 3
Sewon Min@sewon__min

@YichuanM @andylizf Also link to the paper 👉 http://github.com/StarTrail-org/PixelRAG/blob/main/assets/pixelrag-paper.pdf

Sewon Min@sewon__min

Super excited about this work, led by @YichuanM and @andylizf ! It is possible to completely remove HTML parsing by directly retrieving and reading web screenshots through VLM.

HTML parsing is a hidden bottleneck that causes significant complexity and information loss that nobody really pays attention to, and it's so exciting that VLM progress made it possible to remove it.

Please check out this demo as well which is really cool!!! http://pixelrag.ai

1hViews 204Likes 1Bookmarks 0