The web was never meant to be flattened into text.
Yet most web RAG systems start by parsing HTML --- a complex and lossy process.
🔥 Introducing PixelRAG: the first RAG system that retrieves and reads 30M+ web pages as pixels.
Instead of extracting text, PixelRAG retrieves screenshots and lets a VLM read them directly.
PixelRAG not only preserves visual information, but also outperforms text-based RAG on text-only QA benchmarks by +18.1%.
Why? (1) HTML-to-text conversion often discards layout, structure, tables, and other useful signals. (2) We continued pretraining a VLM on web page screenshots and turned it into a surprisingly strong visual retriever. (3) Recent VLMs are remarkably good at understanding web pages, often with better accuracy and token efficiency than text-only pipelines.
Takeaway: HTML parsing may be one of the biggest self-inflicted bottlenecks in web RAG.
Demo below 👇
Code: https://github.com/StarTrail-org/PixelRAG Paper: https://github.com/StarTrail-org/PixelRAG/blob/main/assets/pixelrag-paper.pdf Playground: https://pixelrag.ai/
