dev-discuss.pytorch.org

Meta PyTorch Team 2025 H1 Roadmaps

2/19/2025Updated 3/7/2026

Excerpt

While `cudaMallocManaged` offers convenient automatic memory management, I’d strongly advise against using it everywhere. When GPU memory gets saturated, UVM has to perform costly double transfers, evicting pages to CPU before bringing in new ones. This effectively halves your memory bandwidth. For DL workloads that fit in GPU memory (which is most cases), explicit placement consistently outperforms UVM since there are no page faults and access patterns remain predictable. … you mean in 2025? ... I think, in addition to this roadmap, for the distributed section, if the PyTorch team could regularly benchmark TP, PP, CP, etc., against a large cluster setup (which is usually not available to mere mortals), it would help the community a lot. Also, latelty converting a torch distributed checkpoint to an HF checkpoint has become extremely painful. NVIDIA has apparently decided not to contribute to that for the sake of their NeMo framework. It would be really beneficial for the community if there were starter code and/or an implementation for converting distributed checkpoints to Transformers HF checkpoints. Huge props for this, … > In developer infra doc O[3] mentions PEP 759 which has been withdrawn here. Yes that is unfortunate but it was deemed not the best way forward.

Source URL

https://dev-discuss.pytorch.org/t/meta-pytorch-team-2025-h1-roadmaps/2794

Related Pain Points

Converting PyTorch distributed checkpoints to Hugging Face format is extremely painful

There is no official or well-supported path for converting PyTorch distributed training checkpoints to Hugging Face Transformers-compatible checkpoints. NVIDIA has deprioritized this in favor of their NeMo framework, leaving the community without reliable tooling for this common workflow.

migrationPyTorchHugging Face Transformers

CUDA Unified Virtual Memory (UVM) causes severe performance degradation when GPU memory is saturated

Using cudaMallocManaged (UVM) in PyTorch workloads leads to costly double-transfer overhead when GPU memory is full — pages are evicted to CPU and re-fetched, effectively halving memory bandwidth. Explicit memory placement consistently outperforms UVM for typical deep learning workloads.

performancePyTorchCUDA