A pseudonymous PyTorch engineer notes that the CUDA caching allocator produces hard-to-diagnose bugs like stale data and allocation failures in NVIDIA GPU code
Edward Z. Yang replies inquiring about fragmentation or stream issues.
and add expandable segments if you want a real challenge
the CUDA caching allocator is such a great way to create extremely "interesting" bugs for yourself
@ezyang we have a kernel that's corrupting memory between the forward and backward pass and i think caching allocator was making it non-deterministic (really not it's fault, i was just being stupid and didn't realize what was going on)
@typedfemale What kinds of interesting bugs? Fragmentation? Streams?
@typedfemale What kinds of interesting bugs? Fragmentation? Streams?
the CUDA caching allocator is such a great way to create extremely "interesting" bugs for yourself
@typedfemale I have had some "fun" out of bounds bugs where CUDA sanitizer didn't help because all the memory accessed was technically valid 😂
@ezyang we have a kernel that's corrupting memory between the forward and backward pass and i think caching allocator was making it non-deterministic (really not it's fault, i was just being stupid and didn't realize what was going on)