Triton

2 painsavg 6.5/10

performance 2

torch.compile does not support true pre-compilation without running the Python program

Users on expensive clusters want to pre-compile models to avoid paying compilation costs at runtime, but torch.compile requires actually executing the Python program to discover compilable regions, making straightforward ahead-of-time compilation impossible. This is compounded by graph breaks and unknown input metadata.

performancePyTorchtorch.compileTriton

torch.compile caching is slow and incomplete, causing long warm-up times

Multiple gaps in PyTorch's compilation caching pipeline — including slow Triton cache artifact loading, excessive small network requests for remote caches with many small graphs, and an incomplete AOTAutograd cache rollout — collectively add significant overhead even on warm-cache runs.

performancePyTorchtorch.compileTriton