Gokul Chandra
Contributor

How I doubled my GPU efficiency without buying a single new card

opinion
Apr 23, 20267 mins

Stop overpaying for idle GPUs by splitting your LLM workload into prompt and generation pools. It’s like giving your AI its own dedicated fast and slow lanes.

Black and white photo of a highway with fast lane and slow lanes
Credit: Phil Hearing

Late last year I got pulled into a capacity planning exercise for a global retailer that had wired a 70B model into their product search and recommendation pipeline. Every search query triggered an inference call. During holiday traffic their cluster was burning through GPU-hours at a rate that made their cloud finance team physically uncomfortable. They had already scaled from 24 to 48 H100s and latency was still spiking during peak hours. I was brought in to answer a simple question: Do we need 96 GPUs for the January sale or is something else going on?

I started where I always start with these engagements: profiling. I instrumented the serving layer and broke the utilization data down by inference phase. What came back changed how I think about GPU infrastructure.

During prompt processing — the phase where the model reads the entire user input in parallel — the H100s were running at 92% compute utilization. Tensor cores fully saturated. Exactly what you want to see on a $30K GPU. But that phase lasted about 200 milliseconds per request. The next phase, token generation, ran for 3 to 9 seconds. During that stretch the same GPUs dropped to 30% utilization. The compute cores sat idle while the memory bus worked flat out reading the attention cache.

We were paying H100-hour rates for peak compute capability and getting peak performance for roughly 5% of every request’s wall time. The other 95% was a memory bandwidth problem wearing a compute-priced GPU.

The pattern hiding in plain sight

Once I saw it, I couldn’t unsee it. LLM inference is two workloads pretending to be one. Prompt processing (the industry calls it prefill) is a dense matrix multiplication that lights up every core on the chip. Token generation (decode) is a sequential memory read that touches a fraction of the compute. They alternate on the same hardware inside the same scheduling loop. I’ve worked on carrier-scale Kubernetes clusters and high-throughput data pipelines, and I’ve never seen a workload profile this bimodal running on hardware this expensive.

If you ran a database this way — provisioning for peak write throughput and then using the server 90% of the time for reads — you’d split, it into a write primary and read replicas without a second thought. But most teams serving LLMs haven’t made that connection yet.

The monitoring tools make it worse. Every inference dashboard I looked at reported a single “GPU utilization” number: The average of both phases blended together. Our cluster showed 55%. Looks fine. Nobody panics at 55%. But 55% was the average of 92% for a few hundred milliseconds and 30% for several seconds. The dashboards were hiding a bimodal distribution behind a single number.

Researchers at UC San Diego’s Hao AI Lab published a paper called DistServe at OSDI 2024 that laid out the problem with numbers I could have pulled from my own profiling. Their measurements on H100s showed the same pattern: Prefill at 90–95% utilization, decode at 20–40%. They also proposed the fix.

Splitting the work in two

The fix is called disaggregated inference. Instead of running both phases on the same GPU pool you stand up two pools: One tuned for compute throughput (prompt processing) and one tuned for memory bandwidth (token generation). A routing layer in front sends each request to the right pool at the right time and the attention cache transfers between them over a fast network link.

When I first proposed this to the customer, they were skeptical. Two pools mean more operational complexity. A cache transfer protocol adds a network dependency that monolithic serving doesn’t have. Fair objections. So, I pointed them at who’s already running it.

Perplexity built their entire production serving stack on disaggregated inference using RDMA for cache transfers. Meta runs it. LinkedIn runs it. Mistral runs it. By early 2026 NVIDIA shipped an orchestration framework called Dynamo that treats prefill and decode as first-class pool types. The open-source engines — vLLM and SGLang — both added native disaggregated serving modes. Red Hat and IBM Research open-sourced a Kubernetes-native implementation called llm-d that maps the architecture onto standard cluster management workflows.

This isn’t a research prototype waiting for someone brave enough to try it. It’s the default architecture at the companies serving more LLM traffic than anyone else on the planet.

What changed when we split the pools

We ran a two-week proof of concept. I split the cluster into two pools: Eight GPUs dedicated to prompt processing and the remaining GPUs handling token generation. No new hardware, no new cluster — just a configuration change in the serving layer and a routing policy that sent each request to the right pool based on its inference phase. The prompt-processing pool hit 90–95% compute utilization consistently because that’s all it did. No token generation competing for scheduling slots. No decode requests sitting idle while a prefill burst hogged the cores.

The token-generation pool was the bigger surprise. By batching hundreds of concurrent decode requests together the memory reads got amortized across more work. Bandwidth utilization climbed above 70% — far better than the 30% we’d been seeing when decode requests were interleaved with prefill on the same GPU. Overall compute efficiency roughly doubled.

The cost math followed. The customer was spending about $2M annually on inference GPU-hours. After disaggregation they were on track to cut that by $600–800K while serving the same request volume at the same latency targets. No new hardware purchased. Same GPUs, same cluster, same model weights — different architecture.

The latency story was just as good. In the monolithic setup every time a new prompt arrived its processing burst would stall active token-generation requests. Users watching streaming responses would see the text pause mid-sentence while someone else’s prompt got processed. After the split: Steady token cadence with no prefill-induced stalls. P99 inter-token latency flattened out completely.

There are workloads where this doesn’t pay off. Short prompts under 512 tokens with short outputs don’t generate enough cache to justify a network transfer. Multi-turn conversations where 80%+ of the cache already lives on the decode worker from a previous turn are better served locally. And if you have fewer than a dozen GPUs the scheduling overhead of two pools can eat into whatever you save on utilization. But the teams complaining about GPU shortages and GPU bills are not running 4-GPU deployments with 512-token prompts. They’re running dozens to hundreds of GPUs at enterprise scale where the utilization waste adds up to millions per year.

The industry spends a lot of energy on the GPU supply side: Build more fabs, design better chips, negotiate bigger cloud contracts. Those things matter. But I keep coming back to what I saw in that profiling data. If the teams running monolithic LLM inference today switched to disaggregated serving the effective GPU supply would roughly double overnight. No new silicon required. The tools are ready. The proof points are in production. The only thing missing is the profiling step that makes the waste visible.

If you haven’t broken your inference utilization down by phase yet, do it this week. Add per-phase instrumentation to your serving layer. Plot prefill utilization and decode utilization separately over a 24-hour window. If the two lines look like they belong on different charts — and they will — you have your answer. You’ll stop paying for compute you’re not using.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Gokul Chandra

Gokul Chandra is a principal solutions architect with over 12 years of experience architecting production-scale cloud platforms for global operators and Fortune 500 enterprises. His work spans hybrid and edge cloud deployments, sovereign cloud architectures and large-scale migration and modernization programs across industry verticals. Gokul builds the cloud-native foundations enterprises rely on to run mission-critical workloads at scale, with deep focus on edge inference, distributed AI infrastructure and MLOps and LLMOps platforms that turn model experiments into production systems. He writes and speaks about the architectural decisions that separate prototypes from systems that survive production traffic.