LLMs can look at an image, judge its creativity, and reveal the logic behind the score.
Most models matched human scores fairly well, especially Gemini 3 Flash, which led on both image types.
But the models had clear biases: they rated polished AI images too generously and rough sketches too harshly.
When 3 models showed their reasoning, they mostly talked about what they saw, how original it seemed, visual quality, and the final score.
So this paper shows that visual creativity scoring can scale, while its biases still need calibration.
----
Link – arxiv. org/abs/2606.29672
Title: "How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning"



