Can't believe i am doing this
Just Open Sourced the Largest Synthetic Parallel Multilingual OCR dataset
> 1M+ Document Images
> 22 Languages (Arabic, Bengali, German, English, Spanish, French, Gujarati, Hindi, Italian, Japanese, Kannada, Korean, Malayalam, Marathi, Odia, Punjabi, Russian, Sanskrit, Tamil, Telugu, Thai, Chinese)
> 6 Tasks (OCR, Layout Detection, Layout-aware Translation, Document VQA, Cross-lingual Retrieval, Document VLM Pretraining)
ps: this is the 2025 corpus. 2026 is ~5× bigger (~4.4M images, sharper renders, cleaner annotations) reach out to @cognitivelab_ai or contact@cognitivelab.in for more info