/Tech1d ago

Study Shows LLMs Score Visual Creativity With Human-Like Accuracy

6387165.1K

#1260

Original post

Rohan Paul@rohanpaul_ai#1260inTech

LLMs can look at an image, judge its creativity, and reveal the logic behind the score.

Most models matched human scores fairly well, especially Gemini 3 Flash, which led on both image types.

But the models had clear biases: they rated polished AI images too generously and rough sketches too harshly.

When 3 models showed their reasoning, they mostly talked about what they saw, how original it seemed, visual quality, and the final score.

So this paper shows that visual creativity scoring can scale, while its biases still need calibration.

----

Link – arxiv. org/abs/2606.29672

Title: "How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning"

1:17 AM · Jul 4, 2026 · 5.1K Views

Sentiment

Users agree that sketches are undervalued because LLMs scoring visual creativity favor polished completed work over rough drafts.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS89REPLIES1

Lucian Armasu@lucian_armasu

@rohanpaul_ai Could you share the arxiv link in the sub-comment instead? Easier for everyone.

22h89

LIKES1

Miles Arden@JKL456497689546

@rohanpaul_ai This is the failure mode to watch. If a judge rewards polish, it will quietly turn creativity scoring into production-value scoring. Useful at scale, but only if the calibration set has enough rough human work to keep the model honest.

1d141

RETWEETS5

Rohan Paul@rohanpaul_ai

LLMs can look at an image, judge its creativity, and reveal the logic behind the score.

Most models matched human scores fairly well, especially Gemini 3 Flash, which led on both image types.

But the models had clear biases: they rated polished AI images too generously and rough sketches too harshly.

When 3 models showed their reasoning, they mostly talked about what they saw, how original it seemed, visual quality, and the final score.

So this paper shows that visual creativity scoring can scale, while its biases still need calibration.

----

Link – arxiv. org/abs/2606.29672

Title: "How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning"

1d5.1K3816

Phi Browser@phibrowser

@rohanpaul_ai the polish bias tracks with my day job. polish is creativity with the risk sanded off: polished sites are all the same site, same nav, same modal, button where i expect it. the rough hand-rolled pages are where i meet ideas nobody warned my selectors about.

1d42

安叫兽|Bird🕊️ 🔶 BNB@ajs6888

@rohanpaul_ai 草图被低估这点还挺真实，模型也爱看完成度

14h11

Vincent Lejeune@VincentLejeune

@rohanpaul_ai Here : https://arxiv.org/pdf/2606.29672v1

14h10

Vincent Lejeune@VincentLejeune

@lucian_armasu @rohanpaul_ai Here : https://arxiv.org/pdf/2606.29672v1

14h4