stackoverflow.blog
You're probably underutilizing your GPUs
Excerpt
With numerous complex challenges at hand, we need to develop better tools, which is what I’ve been dedicated to, alongside a community that has been pushing for progress in this space. ... **Ryan Donovan:** Since the rise of AI, many individuals are grappling with the right tools, and there's a surge in hardware scaling. It's common to hear concerns about acquiring enough GPUs to train and run inferences on incoming data. You mentioned that this isn't merely a GPU supply issue but rather an efficiency challenge. Can you elaborate? … **Jared Quincy Davis:** There are several reasons, both economic and technical. The systems we’ve developed over decades to optimize CPU sharing don’t translate as effectively to GPUs. For instance, many GPU workloads today involve large language models. When we refer to 'large,' it often means that the model requires more GPU memory than what even a top-tier server can accommodate. … It necessitates thoughtful scheduling and potentially rethinking pricing models to enhance efficiency in the GPU context. Otherwise, poor allocation decisions can lead to significant underutilization. Many companies have opted to sidestep this complex issue by selling large blocks of single-tenant capacity to major clients, effectively shifting the burden of utilization back onto them—contradicting the cloud's original value proposition of simplifying infrastructure complexity. … **Jared Quincy Davis:** Exactly! It's often not just a single GPU but a network of connected nodes. Additionally, standard virtualization tools don't easily adapt to GPUs, requiring significant work to optimize performance. This friction contributes to the inefficiency we see. Upon reflection, I believe we may take the cloud's evolution for granted, overlooking the significant decisions made by individuals that shaped its development. … **Jared Quincy Davis:** This is a complex topic. Even before a fundamental redesign, we need to accurately express the topology of nodes, including CPU-GPU and CPU-NIC affinities. Proper initialization of memory and PCI connections is crucial. Much work is required to adapt existing technologies to enhance performance in the GPU context. Many large cloud providers have chosen to stick to bare metal and single-tenant models, often foregoing the multi-tenant approach, which diminishes the democratization aspect of cloud services. … **Jared Quincy Davis:** Yes, indeed. While GPU workloads require consideration of storage and data gravity, this isn’t the predominant factor, as GPUs are typically constrained by memory bandwidth. Thus, it’s very much a scheduling challenge. Our concept of the Omni Cloud recognizes that today’s sophisticated users are often operating in multi-cloud environments. The high capital expenditure associated with GPUs necessitates efficient resource routing to maximize economic benefits. Users desire the flexibility to run workloads in various clouds, seeking both spot and reserved resources for optimal cost efficiency. **Ryan Donovan:** We recently spoke with Arm, the chip designer, who mentioned moving some GPU workloads to CPUs in resource-constrained settings. ... **Jared Quincy Davis:** Not typically. Most of our clients have GPU-intensive workloads, and shifting these to CPUs generally proves inefficient. Instead, we often see the reverse—clients migrating CPU workloads to GPU-native architectures, which significantly enhances efficiency in terms of time, power, and cost.
Related Pain Points
Standard virtualization tools inadequate for GPU performance optimization
7Existing virtualization technologies don't easily adapt to GPU workloads, requiring significant custom work to optimize performance. Multi-tenant GPU virtualization friction contributes to widespread GPU underutilization, forcing cloud providers to abandon multi-tenancy for bare metal single-tenant models.
Scalability challenges with multi-GPU setups
6Enterprise architects report difficulties scaling Hugging Face models across multiple GPUs, limiting the platform's applicability for large-scale production deployments.