15h ago

Keshigeyan Chandrasegaran and Kyle Sargent launch GPIC, a permissively licensed 100-million image-text dataset and benchmark

The dataset spans 28 trillion pixels on Hugging Face.

0
Original post

1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation! 🚀100M VLM-captioned image-text pairs for training 📊1M image-text pairs for benchmarking 🖼️~28 trillion pixels 🤗Centrally Hosted ✅Fully permissive for research + commercial use Dataset, benchmark and models🧵👇 Co-led with @KyleSargentAI

9:30 AM · May 29, 2026 View on X
Reposted by

I’m very excited by this new benchmark dataset for visual generation that is suitable for the modern era of large scale generative models!🤩

Keshigeyan ChandrasegaranKeshigeyan Chandrasegaran@keshigeyan

1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation! 🚀100M VLM-captioned image-text pairs for training 📊1M image-text pairs for benchmarking 🖼️~28 trillion pixels 🤗Centrally Hosted ✅Fully permissive for research + commercial use Dataset, benchmark and models🧵👇 Co-led with @KyleSargentAI

4:30 PM · May 29, 2026 · 69.7K Views
4:56 PM · May 29, 2026 · 29K Views

GPIC should be the new standard benchmark for generative modeling. Training 1 epoch on GPIC is the same cost as 100 epochs on ImageNet, but is a much better proxy for real-world problems. If you work in generative modeling, try GPIC for your next project!

Keshigeyan ChandrasegaranKeshigeyan Chandrasegaran@keshigeyan

1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation! 🚀100M VLM-captioned image-text pairs for training 📊1M image-text pairs for benchmarking 🖼️~28 trillion pixels 🤗Centrally Hosted ✅Fully permissive for research + commercial use Dataset, benchmark and models🧵👇 Co-led with @KyleSargentAI

4:30 PM · May 29, 2026 · 69.7K Views
10:03 PM · May 29, 2026 · 12.8K Views

In recent years, academic and industry work in generative modeling has drifted so far apart that they are playing totally different games, and techniques that work in academia may not transfer to industry problems.

The divide isn't just about scale -- the different tasks in academia vs industry lead to different fundamental challenges.

Academic work focuses on class-conditional ImageNet generation. This has a very weak conditioning signal (single categorical label) and the problem is very data-constrained, with all SOTA methods training for hundreds of epochs. The main challenge in this regime is combatting overfitting.

Industry work on image or video generation usually has a much richer conditioning signal (e.g. very long captions, input images, etc) and is almost always underfitting since data can be scaled to absurd degrees. Overfitting (at least for pretraining) isn't a concern; instead we want to fit the complex data distribution *as fast as possible*.

We hope that GPIC is approachable on the academic budgets people are already expending on ImageNet, but will lead to problems more similar to the industry-scale challenges in generative modeling.

Justin JohnsonJustin Johnson@jcjohnss

GPIC should be the new standard benchmark for generative modeling. Training 1 epoch on GPIC is the same cost as 100 epochs on ImageNet, but is a much better proxy for real-world problems. If you work in generative modeling, try GPIC for your next project!

10:03 PM · May 29, 2026 · 12.8K Views
10:03 PM · May 29, 2026 · 1.3K Views
Keshigeyan Chandrasegaran and Kyle Sargent launch GPIC, a permissively licensed 100-million image-text dataset and benchmark · Digg