QR challenge 🔥
I tried to speedrun the QR_2 challenge of GPU MODE in 16 hours
i stopped at 4.7ms, this was my first "proper" gpu mode contest. constraints being: no way to tokenmaxx, no gpu to profile against extensively.
what I did to understand the math: > householder Reflection method to do QR decomposition
- reading the problem statement multiple times - did a 3x3 matrix dry run by hand - consulted gemini for validation of my mental model - studied parallel prefix sums - drew out the tree structure on paper before writing a single line of code - thought of a simpler approach (blocked method) instead of directly jumping to the tree like implementation
what worked: - blocked householder with GEMM trailing update - loading panels at ONCE into the SRAM - tuning block sizes to prevent spilling - tuning warp count for better occupancy - (unintended) pytorch fallback for n=4096
what did not work: - fp8/nvfp4 quantisation (precision violated) - Cholesky method (not allowed) - Mixed precision with iterative correction - TF32 global flag (failed tests lmao) - full panel resident in shared memory (spilled badly)
takeaways:
i didn't get a chance to do extensive kernel programming before, so I had to bet on my intuitions: - bytes moved is the real bottleneck, i have to reduce the round trips of data - asymptomatic wins can still lose in practise - if you're broke, tokenmaxxing is not an option. manual? doable, but time is a constraint. human x ai? hell yeah!