elanapearl.github.io
the bug that taught me more about PyTorch than years of using it
Excerpt
# the bug that taught me more about PyTorch than years of using it a loss plateau that looked like my mistake turned out to be a PyTorch bug. tracking it down meant peeling back every layer of abstraction, from optimizer internals to GPU kernels. `Expected to fix: my hyperparameters. Actually had to fix: PyTorch backend.` My training loss plateaued and wouldn’t budge. Obviously I’d screwed something up. I tried every hyperparameter combination, rewrote my loss function, spent days assuming I’d made some stupid mistake. Because it’s always user error. … **The Bug:** A PyTorch GPU kernel bug silently failed when writing to non-contiguous memory, causing my model’s encoder weights to freeze during training on Apple Silicon (MPS backend, PyTorch <2.4). **The Technical Details:** PyTorch’s MPS (Apple Silicon GPU) backend had a kernel bug where `addcmul_` and `addcdiv_` operations silently fail when writing to non-contiguous output tensors. … - Encoder weights initialized as transpose of decoder → non-contiguous memory layout - Adam’s state tensors inherited this layout (`exp_avg` and `exp_avg_sq` became non-contiguous) - MPS kernels for `addcmul_`/`addcdiv_` don’t handle non-contiguous outputs correctly … - **Adjust your code:** Make weights contiguous at initialization - **Upgrade PyTorch:** Upgrade to PyTorch ≥2.4 (fixes `addcmul_`/`addcdiv_`) - **(Complete fix) Upgrade your Operating System:** Upgrade to macOS 15+ (native non-contiguous tensor support) … ## The Mystery: A Plateauing Loss Training loss plateaued way too early. This felt like a standard hyperparameter issue- but I’d trained this same architecture on similar data with similar hyperparameters countless times and hit much lower losses. What had changed? Those runs were months old. I tried reproducing them exactly, but couldn’t pin down the exact environment—the codebase had evolved through multiple projects, refactors, and dependency updates. Without a clean “before vs after,” I had to debug forward. … The second bug masked the first, creating a silent failure: the spookiest type of error. The model appeared to be learning (the decoder was training normally), but progress stalled because the encoder stayed frozen. A subtle plateau that looked exactly like a hyperparameter issue 🙃 **Side note: Why did forward and backward passes work fine with non-contiguous weights?** ... To understand why some operations work and others don’t, I needed to look at PyTorch’s source code for the buggy kernels. While I normally trace through a Python codebase by jumping to definitions in my IDE, that doesn’t work with `tensor.addcmul_()`. When you call this function, there’s no Python source code executing - instead, Python immediately jumps into compiled C++ code for performance. And since PyTorch ships this as a pre-compiled binary, I can’t see that C++ implementation.