1d ago

Largest Synthetic Multilingual OCR Dataset Open Sourced With 1M+ Images

163202817727.4K

——0——

Original post

Can't believe i am doing this Just Open Sourced the Largest Synthetic Parallel Multilingual OCR dataset > 1M+ Document Images > 22 Languages (Arabic, Bengali, German, English, Spanish, French, Gujarati, Hindi, Italian, Japanese, Kannada, Korean, Malayalam, Marathi, Odia, Punjabi, Russian, Sanskrit, Tamil, Telugu, Thai, Chinese) > 6 Tasks (OCR, Layout Detection, Layout-aware Translation, Document VQA, Cross-lingual Retrieval, Document VLM Pretraining) ps: this is the 2025 corpus. 2026 is ~5× bigger (~4.4M images, sharper renders, cleaner annotations) reach out to @cognitivelab_ai or contact@cognitivelab.in for more info

8:10 AM · May 25, 2026

Largest Synthetic Multilingual OCR Dataset Open Sourced With 1M+ Images

Cluster engagement

Sentiment