/Tech38d ago

GPT-5.5 achieves 99.46 percent accuracy on multi-digit multiplication across a 20-by-20 grid of problems with up to 20 digits per number

Medium reasoning effort produced near-complete heatmap coverage versus low accuracy without it.

4694553330168.6K

#322

Original post

Yuntian Deng#322

cozyblaze@cozyblaze265065

I redid the multi-digit multiplication experiment, now with gpt-5.5. With medium reasoning and 7 samples each cell, it pretty much aced the test with 99.46% accuracy. The model had no tools to call and had to rely on its reasoning. Can it go further? (1/4)

Yuntian Deng@yuntiandeng

For those curious about how o3-mini performs on multi-digit multiplication, here's the result. It does much better than o1 but still struggles past 13×13. (Same evaluation setup as before, but with 40 test examples per cell.)

1:24 AM · May 22, 2026 · 164.3K Views

Sentiment

Many users praise GPT-5.5's near-perfect multi-digit multiplication accuracy via chain-of-thought as proof of powerful systematic reasoning, while some insist LLMs remain fundamentally unsuited for arithmetic.

Pos

80.0%

Neg

20.0%

12 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS4.4KBOOKMARKS5LIKES29RETWEETS4REPLIES1

Raphaël Millière@raphaelmilliere

I still occasionally hear people claim that LLMs are hilariously bad at arithmetic. Another reminder that it's not 2022 anymore.

cozyblaze@cozyblaze265065

38d4.4K295

🇺🇦🇮🇱dmitriy samsonov@d0rc

@cozyblaze265065 @yuntiandeng What makes you think model wasn’t using tools on the backend?:) they can hide whatever tools they want and produce fake thinking traces:)

38d52051

cozyblaze@cozyblaze265065

@olafwillocx Symmetric in digit count. So there are 7 samples of (d1, d2) = (3, 2), and 7 samples of (d1, d2) = (2, 3), but every sample is randomly generated.

38d88621

Tom Benadryl@olafwillocx

@cozyblaze265065 You tested my exact March 2024 hypothesis oddly enough

38d4721

Tomo@Tomodovodoo

@davemeyers322 @cozyblaze265065 I think you misread. It's digit.

So, 5x5 would be 63528*28360 for example.

50x50 is 50 digits times 50 digits.

38d2403

Tom Dupuis@bellmantd

this shows prove that CoT is still underrated and undervalued.

the only way this is possible is that the CoT naturally knows how to decompose the multiplication into bits that can be computed purely from "raw intuition" in semantic space.

intuitively, there is no reason this should work so well

38d6481

Dave Meyers@davemeyers322

@cozyblaze265065 This is surprising to me

1) I cant imagine that multiplying 5 by 5 would be impossible for GPT5.5 w/o reasoning

2) I cant imagine that multiplying 50 by 50 should take 100k tokens, its just 5 by 5 with two zeros added, id be interested to see the reasoning trace

38d1.9K

Tom Benadryl@olafwillocx

@cozyblaze265065 Are the attempts symmetric? Like if you tried 123*12 for a 2-3 digit test, did you also try 12*123?

38d1.2K

Prathik Kini@kini_prathik

@cozyblaze265065 Does 5.5 not have access to a calculator with a tool call?

38d9082

takenvard@vardtaken

@cozyblaze265065 Llms can never learn multiplication by design. Its mathematically impossible with the way their nodes are currently constructed.

38d89

My name@PitaAndJelly

@vardtaken @cozyblaze265065 Why? They can just reason the process

38d171

My name@PitaAndJelly

@vardtaken @cozyblaze265065 “The user wants to know what 2628638271 times 826381652 is… lets start with 1 times 2. Thats 2. Now lets do 7 times 2. Thats 14….Blah blah blah

38d151

Kitten 🐈@kitten_beloved

@cozyblaze265065 Does the chain of thought carry out a multiplication algorithm step by step?

38d4482

🧟@RaghavKoch19380

@cozyblaze265065 What about Low reasoning?

38d3212

Max Andrews@madmaxbr5

@bellmantd @cozyblaze265065 i mean it's just doing the multiplication "by hand" like we would, no? Though I'm curious if it did long multiplication or partial products.

38d21

cozyblaze@cozyblaze265065

What about xhigh reasoning? For the 100x100 digits test, it came back with 1/3 correct (and a large bill💸). I believe with enough reasoning budget, it can perform long multiplication beyond 50x50, but I'll stop here since I'm seeing the bottom of my wallet. (3/4)

38d21

takenvard@vardtaken

@PitaAndJelly @cozyblaze265065 How?

38d16

cozyblaze@cozyblaze265065

I tested medium reasoning with 30x30 and 50x50 digits multiplication, and each came back with 7/7 correct. As expected, reasoning increased with digit count. I tried to push it to 100x100 digits with no success (0/7 correct). (2/4)

38d11

Ryan Topps@RyanJTopps

@cozyblaze265065 Are you sure it didn’t sneak a python code construction in there

38d3621

Penguin Sensei@PenguinSensei__

@cozyblaze265065 @fentasyl Chat GPT will eventually create better IQ tests than humans can create

38d2261