CUDA

11 painsavg 6.7/10

performance 4architecture 2compatibility 2dependency 1config 1docs 1

Power Delivery and Cooling Infrastructure Insufficient for Production Workloads

GPU infrastructure planned for 6-8 kW per node discovers actual power demands of 10-12 kW when enabling higher TDP profiles in production, requiring physical infrastructure renegotiation and topology redesign.

architectureCUDANCCL

PyTorch hardware-specific backend bugs cause failures across MPS, CUDA, and ONNX

Multiple hardware-specific issues affect PyTorch across different backends: LayerNorm/BatchNorm fail to compile on Apple M4 MPS, Conv2d is slower on macOS without MKLDNN, CUDA CI tests exhibit memory corruption (SIGIOT), and ONNX exports with dynamic inputs regressed between versions. These issues require constant per-platform debugging.

compatibilityPyTorchCUDAONNX+1

Version Mismatch Across GPU Software Stack Components

CUDA, driver, NCCL, container runtime, and Kubernetes device plugin version conflicts cause cluster flakiness when not strictly pinned, with uncontrolled upgrades introducing silent failures.

dependencyCUDANCCLKubernetes

Inference Latency SLOs Conflict with Training Throughput Optimization

Optimizing GPU systems solely for training throughput ignores inference latency requirements; when p99 latency targets (e.g., 300ms) are introduced, existing optimization strategies become inadequate.

performanceCUDA

CUDA version alignment for PyTorch GPU setup is error-prone for newcomers

Developers must manually align PyTorch, CUDA toolkit, and Python versions to enable GPU acceleration. Mismatches produce cryptic errors like 'Torch not compiled with CUDA enabled,' and newcomers unfamiliar with CUDA can spend significant time debugging installation issues.

configPyTorchCUDA

CUDA Unified Virtual Memory (UVM) causes severe performance degradation when GPU memory is saturated

Using cudaMallocManaged (UVM) in PyTorch workloads leads to costly double-transfer overhead when GPU memory is full — pages are evicted to CPU and re-fetched, effectively halving memory bandwidth. Explicit memory placement consistently outperforms UVM for typical deep learning workloads.

performancePyTorchCUDA

Developers Lack Understanding of Data Transfer Costs in GPU Computing

Programmers underestimate PCIe and memory bandwidth costs for moving data between CPU and GPU, leading to poor algorithm designs that don't account for transfer overhead, particularly for smaller workloads.

docsCUDA

GPU Memory Hogging and Allocation Issues

TensorFlow attempts to allocate all available GPU memory on startup, which can prevent other code from accessing the same hardware and limits flexibility in local development environments where developers want to allocate portions of GPU to different tasks.

performanceTensorFlowGPUCUDA

GPU Acceleration Not Seamless in Java for AI Workloads

GPU acceleration support in Java requires extra setup and tuning compared to Python, and forcing GPU allocation per application instance (even when idle) creates scaling and maintenance challenges with higher infrastructure costs and lower resource efficiency.

performanceJavaGPUCUDA

Sequential and Fine-Grained Branching Algorithms Inefficient on GPUs

GPU programming is poorly suited for sequential dependent-step problems and algorithms with extensive branching (many if-statements), causing thread divergence and underperformance despite GPU architecture.

architectureCUDA

Limited GPU Support (NVIDIA/Python Only)

TensorFlow only supports NVIDIA GPUs and Python for GPU programming with no additional support for other accelerators, limiting cross-platform development flexibility.

compatibilityTensorFlowGPUNVIDIA+2