/Tech25d ago

Keshigeyan Chandrasegaran and Kyle Sargent launch GPIC, a permissive image-text dataset and benchmark for training visual models

The 28-trillion-pixel corpus is fully permissive for commercial use.

--0--

#10

Original post

Jing Yu Koh#1204

Keshigeyan Chandrasegaran@keshigeyan

1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation!

🚀100M VLM-captioned image-text pairs for training 📊1M image-text pairs for benchmarking 🖼️~28 trillion pixels 🤗Centrally Hosted ✅Fully permissive for research + commercial use

Dataset, benchmark and models🧵👇

Co-led with @KyleSargentAI

9:30 AM · May 29, 2026 · 115K Views

Sentiment

Many users praised the GPIC 100M permissive image-text dataset release as a valuable open contribution that enables better training and benchmarking for visual AI models.

Pos

100.0%

Neg

0.0%

17 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

STANFORD.EDUVia

Posts from X

Most Activity

VIEWS47.1KBOOKMARKS100LIKES256REPLIES16

Fei-Fei Li@drfeifei

I’m very excited by this new benchmark dataset for visual generation that is suitable for the modern era of large scale generative models!🤩

Keshigeyan Chandrasegaran@keshigeyan

1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation!

🚀100M VLM-captioned image-text pairs for training 📊1M image-text pairs for benchmarking 🖼️~28 trillion pixels 🤗Centrally Hosted ✅Fully permissive for research + commercial use

Dataset, benchmark and models🧵👇

Co-led with @KyleSargentAI

25d47.1K256100

RETWEETS68

Keshigeyan Chandrasegaran@keshigeyan

1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation!

🚀100M VLM-captioned image-text pairs for training 📊1M image-text pairs for benchmarking 🖼️~28 trillion pixels 🤗Centrally Hosted ✅Fully permissive for research + commercial use

Dataset, benchmark and models🧵👇

Co-led with @KyleSargentAI

25d115K338214

Justin Johnson@jcjohnss

GPIC should be the new standard benchmark for generative modeling. Training 1 epoch on GPIC is the same cost as 100 epochs on ImageNet, but is a much better proxy for real-world problems. If you work in generative modeling, try GPIC for your next project!

Keshigeyan Chandrasegaran@keshigeyan

1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation!

🚀100M VLM-captioned image-text pairs for training 📊1M image-text pairs for benchmarking 🖼️~28 trillion pixels 🤗Centrally Hosted ✅Fully permissive for research + commercial use

Dataset, benchmark and models🧵👇

Co-led with @KyleSargentAI

25d31.6K8031

Kyle Sargent@KyleSargentAI

Today we released “GPIC: A Giant Permissive Image Corpus for Visual Generation.” It’s a 100M image dataset for visual generation, with text captions and 100% known+permissive licenses, hosted on HuggingFace. I’m excited to get this out! Check it out: https://gpic.stanford.edu/

25d4.8K7013

Keshigeyan Chandrasegaran@keshigeyan

7/ Happy pretraining! 🤗 Dataset: https://huggingface.co/datasets/stanford-vision-lab/gpic 🤗 Models: https://huggingface.co/stanford-vision-lab/gpic-baselines 🛠️ Code + evaluation toolkit: https://github.com/keshik6/gpic 🌎 Website: https://gpic.stanford.edu 📄 Paper: https://arxiv.org/abs/2605.30341

25d1.3K234

Justin Johnson@jcjohnss

In recent years, academic and industry work in generative modeling has drifted so far apart that they are playing totally different games, and techniques that work in academia may not transfer to industry problems.

The divide isn't just about scale -- the different tasks in academia vs industry lead to different fundamental challenges.

Academic work focuses on class-conditional ImageNet generation. This has a very weak conditioning signal (single categorical label) and the problem is very data-constrained, with all SOTA methods training for hundreds of epochs. The main challenge in this regime is combatting overfitting.

Industry work on image or video generation usually has a much richer conditioning signal (e.g. very long captions, input images, etc) and is almost always underfitting since data can be scaled to absurd degrees. Overfitting (at least for pretraining) isn't a concern; instead we want to fit the complex data distribution *as fast as possible*.

We hope that GPIC is approachable on the academic budgets people are already expending on ImageNet, but will lead to problems more similar to the industry-scale challenges in generative modeling.

Justin Johnson@jcjohnss

25d1.7K184

Keshigeyan Chandrasegaran@keshigeyan

2/📌Why does this matter? ImageNet-1K drove progress in visual generation for over a decade. But to train and benchmark modern visual generative models, we need large, permissive, and accessible datasets with rich text captions.

GPIC is designed for this new setting.

25d1.8K182

Keshigeyan Chandrasegaran@keshigeyan

4/📊GPIC Statistics: 🚀100M training image-text pairs 🧪1M test + 200K validation pairs 📦12.9TB across 8,000 shards 🤗Centrally hosted on Hugging Face 📝Captions: 1% tag, 45% short, 45% medium, 9% long ⚡ Benchmark scales: GPIC-Nano (1M), GPIC-Lite (10M), and GPIC-Full (100M)

25d1.3K152

Kyle Sargent@KyleSargentAI

I’m also proud of this section of the paper, which gives best practices for compliance with our eval protocol. Without calling out anyone in particular, let me just say that using auxiliary foundation models to get a better FD-DINOv2 on GPIC without being very up front about the huge advantages of the extra data and model FLOPs is super bad – please don’t do it!

25d35981

Keshigeyan Chandrasegaran@keshigeyan

5/🔧GPIC improves the evaluation protocol for visual generative modeling. ImageNet-1K FID is saturated: several models now score better than held-out real images.

GPIC uses FD-DINOv2, an unsaturated metric, evaluated against the 1M GPIC test set.

25d1.3K13

Keshigeyan Chandrasegaran@keshigeyan

6/🎯We also provide a reference baseline. We train a 1.1B JiT-T2I pixel-space flow matching model on GPIC for one epoch.

It obtains FD=76.3 on the GPIC benchmark (cfg=6.25).

25d1.3K17

Keshigeyan Chandrasegaran@keshigeyan

3/🤔What are the requirements for a visual generation benchmark dataset?

✅Permissive: research + commercial use ✅Stable: the benchmark should not change over time ✅Large: much larger than ImageNet-1K ✅Accessible: centrally hosted and easy to download

GPIC satisfies all four criteria.

25d1.5K15

Lucas Beyer (bl16)@giffmana

@keshigeyan Nice!

25d2.5K61

Manling Li@ManlingLi_

Has been looking forward to such resources for so long! Amazing @keshigeyan

Also really like the quote of Goodhart’s law: When a measure becomes a target, it ceases to be a good measure.

Keshigeyan Chandrasegaran@keshigeyan

1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation!

🚀100M VLM-captioned image-text pairs for training 📊1M image-text pairs for benchmarking 🖼️~28 trillion pixels 🤗Centrally Hosted ✅Fully permissive for research + commercial use

Dataset, benchmark and models🧵👇

Co-led with @KyleSargentAI

20d1.9K110

Kyle Sargent@KyleSargentAI

Check out Keshik’s awesome thread for more details! It was great co-leading this project with him. In my thread, I’m going to talk more conversationally about our scientific motivations for this dataset, and what we hope the community will get out of it.

25d82341

Jing Yu Koh@kohjingyu

@keshigeyan Wow this is a very valuable contribution! Now that @KyleSargentAI pointed it out it’s really odd we don’t have a standard open T2I training set. Congrats Keshik and Kyle on the release!!

25d1K7

Keshigeyan Chandrasegaran@keshigeyan

8/ This work was done at @StanfordAILab & @StanfordSVL with the amazing @KyleSargentAI, @agarwal_suchir, @michaeljkjang, @MichaelPoli6, @jcniebles, @jcjohnss, @jiajunwu_cs & @drfeifei 🎉 Big thanks to @huggingface for the support!

25d1.1K13

Kyle Sargent@KyleSargentAI

One practical example is epoch count – “state-of-the-art” models on ImageNet-1K train for 300-1700 epochs (Fig. credit: PixGen). But that’s not the way you would do things outside of an academic comparison – you’d just go get more data!

25d2674

Bill The Investor@billtheinvestor

@drfeifei This new benchmark dataset could enhance training efficiency for large-scale generative models, potentially reducing compute costs by 30%.

25d1.4K1

Kyle Sargent@KyleSargentAI

It’s kind of crazy to me that in 2026, we don’t have a huge T2I visual generation benchmark dataset. ImageNet-1K is still the standard, but it’s small and is class-conditional only, so over time the dataset has drifted from the way things are done in practice.

25d3075