Added a new section explaining how sometimes torch.profiler isn't enough and you could be losing a huge amount of overhead time per fwd/bwd calls w/o being aware of it. cProfile is needed and then it's easy:
https://github.com/stas00/the-art-of-debugging/tree/master/pytorch#when-torchprofiler-isnt-enough
I used the issue with liger-kernel recompilation as a demonstration of the hidden very costly problem that impacted performance. The issue itself has been fixed in the recent liger-kernel releases.