Everyone’s obsessing over FLOPs. Benchmarks, leaderboards, token throughput. But here’s the dirty secret nobody in AI infrastructure wants to admit: the memory wall is the real bottleneck, and we’ve been pretending it doesn’t exist. While GPU suppliers print money selling GPUs with ever-fatter HBM stacks, a quiet revolution is happening in how we think about memory hierarchy—and it’s about to reshape the entire inference stack.
The proof is in the simple comparison. Two terminals, same model, same question, same hardware. First one – 100 tokens of context — the model answers in half a second. Second -100,000 tokens of context — 12 seconds. 30 times slower. For the same question. The intuitive guess is “more context just means more math.” But that doesn’t add up: the latency went up 30×, not a thousand. Pure compute isn’t the bottleneck. Your next guess might be “the model is thinking harder.” It can’t — the weights are frozen at inference time. The real answer is something almost nobody talks about, even though it shapes every modern AI system you use. It’s not compute. It’s memory bandwidth. This is the memory wall — a phrase coined in 1995, decades before transformers existed.
The Memory Wall, Quantified
Let’s ground this with numbers. Every token in a prompt has two associated vectors: a key (K) and a value (V). Each is a vector of about 4,000 numbers in half precision — roughly 8 KB per vector. Two vectors per token, two bytes per number: that’s 16 KB per token per layer. A modern model has around 96 layers. So one token takes about 1.5 megabytes in KV cache. Multiply by 100,000 tokens (context window), and you get 150 gigabytes of data for a single prompt sitting in GPU memory right now. Here’s where it gets uncomfortable: a modern H100 GPU has 80 gigabytes of high-bandwidth memory. The cache is almost twice as big as the slot it’s trying to fit into. In practice, every production system does some combination of three things: shards the KV cache across multiple GPUs and pays the network cost; compresses it and sacrifices some accuracy; or simply eats the latency. Most do all three.
But it gets worse. Even when the KV cache fits on one GPU, why is it still slow? Because attention isn’t bottlenecked by compute — it’s bottlenecked by bandwidth. Every time the model generates a new token, it has to read every prior token’s key and value from memory. That’s not arithmetic. That’s just moving bytes. An H100 does 989 teraflops of compute but its memory bandwidth is only 3.35 terabytes per second. Compute keeps doubling. Bandwidth doesn’t. This is the gap that has been growing for 30 years — Wolf and McKee saw it coming in 1995 and called it the memory wall. 30 years later, transformers ran straight into it.
And then it gets worse still — because the cost grows quadratically. Every new token has to compare itself against every prior token. 10 tokens, 100 comparisons. 100 tokens, 10,000. 100,000 tokens, 10 billion. Double the context, four times the cost. This is why a million-token context window was a big deal — and why it took years to get there.
You can read about why KV cache matters here
Four Tricks the Field Uses to Fight Back
The field has been fighting the memory wall with four techniques that every long-context model uses simultaneously.
- Flash Attention reorders the math so K and V never leave the GPU’s fastest cache.
- Sliding Window Attention only attends to the last N tokens — linear cost instead of quadratic.
- KV Quantization stores keys and values at 4 bits instead of 16, cutting memory traffic by four.
- Multi-Query Attention shares K and V across all attention heads — almost 100× less cache with almost zero quality loss. Flash, sliding, quantize, share. Every modern model uses all four simultaneously. And still, long context lives or dies on bandwidth.
Memory Basics
Let’s start with the basics. SRAM and DRAM are not the same beast. SRAM needs six transistors per bit, doesn’t require refresh cycles, and sits right next to the compute. It’s fast, expensive, power-hungry, and transforms everything it touches. DRAM—what we call HBM when it’s stacked all fancy—uses one transistor and one capacitor per bit. Cheaper, denser, but with the latency penalty of being off-chip.
The hierarchy is straightforward: registers and L1/L2/L3 caches are SRAM. Then you have on-chip scratchpads, also SRAM. Then HBM (stacked DRAM), then system DDR, then storage. Each step away from the compute adds latency. In AI, latency is the enemy.
In layman’s terms, 16/32/64GB RAM in your Mac is DRAM. SRAM is limited to gaming PC’s and is seldom used in consumer tech spec discussions
How the Vendors Play the Memory Game
Every AI accelerator vendor has made a different bet on this tradeoff. And their choices reveal what they think the future looks like.
Cerebras went all-in on SRAM. Their WSE-3 puts ~44 GB of on-chip SRAM across an entire wafer—46,225 square millimeters of silicon. That’s ~21 PB/s of on-chip memory bandwidth feeding ~900,000 AI cores. No HBM. No off-chip memory. Trillion-parameter models fit entirely in fast SRAM. The catch? ~25 kW per system and a price tag that makes CFOs weep, ~$100 Mil for a cluster of 45 nodes. But for certain inference workloads, the latency story is unmatched.
Groq took a similar path but different execution. Their LPU puts 230-500 MB of on-chip SRAM per chip as primary weight storage—not cache, primary storage. We’re talking 80+ TB/s internal bandwidth versus the 3-8 TB/s you get from HBM on GPUs. Their deterministic streaming architecture treats inference like an assembly line. The limitation is capacity: large models need sharding across multiple LPUs. But for the models that fit, the latency is absurd.
AMD plays it traditional. MI300X: 192 GB HBM3 at 5.3 TB/s. MI350X: up to 288 GB HBM3E at 8 TB/s. They have Infinity Cache (on-chip SRAM) but it’s HBM-dominant. This is the safe, scalable bet. Works great for training clusters. The Infinity Fabric interconnect shines for multi-node setups.
Google TPU uses HBM as main memory—96 GB+ per chip, scalable to 192 GB. Their VMEM gives tens of MB per TensorCore as software-managed scratchpad SRAM. The XLA compiler handles the tedious data movement between HBM and SRAM. SparseCores connect directly to HBM for embeddings. It’s a hybrid approach that reflects Google’s scale.
Intel Gaudi 3 splits the difference: 128 GB HBM2e at 3.7 TB/s plus 96 MB on-die SRAM at 12.8 TB/s. They emphasize unified memory space with Xeon CPUs and Ethernet-heavy scaling. It’s a pragmatic architecture for enterprises already in the Intel ecosystem.
The pattern is clear. SRAM is transformative when you can afford it. HBM is the workhorse when you need capacity. Hybrids win for general-purpose infrastructure because the real world demands both.
Ongoing mad rush happening in the memory space is focused on HBM where SK Hynix, Samsung and Micron are the leading suppliers and building new fabs to serve the demand.
The Real Battleground: Tokens Per Watt
Here’s where it gets interesting. The scarcity of HBM and the brutal cost of SRAM are forcing innovation in unexpected places. NVIDIA knows this—they’ve been beating the “tokens per watt” drum for a reason. The B200 adds native FP4 support specifically to push more concurrent requests through the same power envelope.
Schneider Electric puts it bluntly: tokens per watt measures how much work a system produces per watt consumed. Once you normalize for joules, energy-optimal hardware routinely outperforms brute-force approaches. This isn’t just about being green—it’s about economics. Power is often the constraint in data centers. More tokens per watt means more revenue per rack.
Price per token follows the same logic. If you can compress your memory footprint, you can serve more requests from the same hardware. The scarce resource isn’t compute cycles—it’s memory bandwidth and capacity.
Scarcity Breeds Innovation: The KV Cache Revolution
This is where the story gets good. The constraints on HBM supply and SRAM economics have sparked a wave of innovation in KV cache optimization. And it’s happening across the stack—from NVIDIA’s research labs to DeepSeek’s open-source infrastructure to Moonshot AI’s production systems.
NVIDIA KVTC (KV Tensor Compression) is the most intriguing. It compresses KV caches 20-40× smaller, treating them like a media codec treats video. Think about that for a second. If your serving bottleneck is KV cache capacity—and for long-context, multi-turn systems, it absolutely is—this changes the economics entirely. A 40× compression ratio means you can either serve 40× more concurrent sessions or handle context windows that were previously impossible.
DeepSeek FlashMLA attacks the problem from a different angle. Their optimized attention kernels for Multi-head Latent Attention (MLA) compress the K and V projections into a smaller joint latent vector, then expand via learned matrices. This powers DeepSeek V3 and V3.2-Exp. They use token-level sparse attention for both prefill and decoding with FP8 KV cache, plus paged KV cache with block size 64 for variable-length sequence batching.
But DeepSeek didn’t stop there. Their disaggregated KV cache architecture achieves 40+ GiB/s peak throughput per client node for KV cache lookups, with 6.6 TiB/s aggregate read throughput across a 180-node cluster. They’re seeing over 95% KV cache hit rates in multi-round agentic workflows. The kicker? Loading efficiency has replaced computation as the dominant performance factor. We’re optimizing data movement now, not just matrix multiplies.
Kimi K2.6 from Moonshot AI uses the same MLA approach but targets a specific problem: long-horizon coding task degradation at long context. Their 256K context window on a 1T parameter MoE model (32B active) would be impossible without aggressive KV cache compression into a low-dimensional latent space. vLLM now supports CPU offload or NVMe-backed KV storage for sessions that exceed HBM capacity—an admission that memory hierarchy extends beyond the accelerator.
What’s Coming Next
The inference stack is about to get reshuffled. We’re moving from a world where FLOPs were the scarce resource to one where memory bandwidth and clever compression algorithms determine who wins.
Expect to see:
- More aggressive quantization of KV caches—FP8 is just the start
- Disaggregated serving architectures that separate prefill from decoding and cache KV across nodes
- Custom silicon optimized specifically for compressed attention patterns
- Software-managed memory hierarchies that treat HBM as a cache for compressed KV, not the source of truth
The vendors that figure out how to minimize data movement while maximizing cache hit rates will own the next generation of inference infrastructure. This is why Cerebras and Groq’s SRAM-heavy approaches aren’t just interesting experiments—they’re previews of where the industry is heading once HBM scarcity bites harder.
The Memory Hierarchy Isn’t Solved
Here’s the contrarian take: we’ve been treating memory as a solved problem. Stack more HBM, add some on-chip cache, call it a day. The KV cache innovations from NVIDIA, DeepSeek, and Moonshot prove that’s wrong. Memory hierarchy is the new frontier.
The scarce resources—HBM supply, SRAM economics, power budgets—are forcing exactly the kind of creative pressure that produces breakthroughs. We’re seeing 40× compression ratios, disaggregated architectures, and software-defined memory hierarchies because we have to. Not because it’s elegant, but because the alternative is hitting a wall.
Tokens per watt and price per token are the metrics that matter. Everything else is noise. The teams that internalize this and build their systems around memory efficiency—not just compute density—will define the next era of AI infrastructure.
The memory wall isn’t coming. It’s already here. And the companies treating it as an opportunity rather than a constraint are the ones worth watching.