blog.ezyang.com

New Years resolutions for PyTorch in 2025

4/1/2025Updated 10/25/2025

Excerpt

In my previous two posts "" and "", I often said that PyTorch would be good for a use case, but there might be some downsides. Some of the downsides are foundational and difficult to remove. But some... just seem like a little something is missing from PyTorch. In this post, here are some things I hope we will end up shipping in 2025! … **Pre-compilation: beyond single graph export.** Whenever someone realizes that torch.compile compilation is taking a substantial amount of time on expensive cluster machines, the first thing they ask is, "Why don't we just compile it in advance?" To support precompiling the torch.compile API exactly as is not so easy; unlike a traditional compiler which gets the source program directly as input, users of torch.compile must actually run their Python program to hit the regions of code that are intended to be compiled. Nor can these regions be trivially enumerated and then compiled: not only must know all the metadata input tensors flowing into a region, a user might not even *know* what the compiled graphs are if a model has graph breaks. OK, but why not just run the model, dump all the compiled products, and then reuse them later? This works! Here is where a special decorator … **Improving caching further.** There are some gaps with caching which we hope to address in the near future: (1) loading Triton cache artifacts takes a long time because we still re-parse the Triton code before doing a cache lookup (James Wu is on this), (2) if you have a lot of small graphs, remote cache ends up having to do lots of small network requests, instead of one batched network request at the beginning (Oguz Ulgen recently landed this), (3) AOTAutograd cache is not fully rolled out yet (James Wu again). These collectively should be worth a 2x speedup or even more on warm cache time. **Fix multithreading.** We should just make sure multithreading works, doing the testing and fiddly thread safety auditing needed to make it work. Here's … **ABI stable PyTorch extensions.** It's hard work being a third-party PyTorch extension with native code, because whenever there's a new release of Python or PyTorch you have to rebuild all of your wheels. If there was a limited ABI that you could build your extension against that didn't expose CPython and only relied on a small, stable ABI of PyTorch functions, your binary packaging situation would be much simpler!

Source URL

https://blog.ezyang.com/2025/01/new-years-resolutions-for-pytorch-in-2025/

Related Pain Points

torch.compile does not support true pre-compilation without running the Python program

Users on expensive clusters want to pre-compile models to avoid paying compilation costs at runtime, but torch.compile requires actually executing the Python program to discover compilable regions, making straightforward ahead-of-time compilation impossible. This is compounded by graph breaks and unknown input metadata.

performancePyTorchtorch.compileTriton

torch.compile caching is slow and incomplete, causing long warm-up times

Multiple gaps in PyTorch's compilation caching pipeline — including slow Triton cache artifact loading, excessive small network requests for remote caches with many small graphs, and an incomplete AOTAutograd cache rollout — collectively add significant overhead even on warm-cache runs.

performancePyTorchtorch.compileTriton

Third-party PyTorch native extensions must be rebuilt on every Python or PyTorch release

PyTorch does not expose a stable ABI for native extensions, so any extension with compiled code must rebuild its wheels whenever Python or PyTorch releases a new version. This significantly burdens third-party maintainers and complicates binary packaging.

compatibilityPyTorchPython