dev-discuss.pytorch.org

Meta PyTorch Team 2025 H1 Roadmaps

2/19/2025Updated 3/7/2026

Excerpt

While `cudaMallocManaged` offers convenient automatic memory management, I’d strongly advise against using it everywhere. When GPU memory gets saturated, UVM has to perform costly double transfers, evicting pages to CPU before bringing in new ones. This effectively halves your memory bandwidth. For DL workloads that fit in GPU memory (which is most cases), explicit placement consistently outperforms UVM since there are no page faults and access patterns remain predictable. … you mean in 2025? ... I think, in addition to this roadmap, for the distributed section, if the PyTorch team could regularly benchmark TP, PP, CP, etc., against a large cluster setup (which is usually not available to mere mortals), it would help the community a lot. Also, latelty converting a torch distributed checkpoint to an HF checkpoint has become extremely painful. NVIDIA has apparently decided not to contribute to that for the sake of their NeMo framework. It would be really beneficial for the community if there were starter code and/or an implementation for converting distributed checkpoints to Transformers HF checkpoints. Huge props for this, … > In developer infra doc O[3] mentions PEP 759 which has been withdrawn here. Yes that is unfortunate but it was deemed not the best way forward.

Source URL

https://dev-discuss.pytorch.org/t/meta-pytorch-team-2025-h1-roadmaps/2794

Related Pain Points