All That Jazz in GPUs but a Quiet Hero Stirs
Attention mechanism enhancements, lightning-fast matrix computing, fluid task parallelism, and high-bandwidth memory—these strengths propelled us into the Large Language Model (LLM) era. Graphic processing units (GPUs) crushed the massive number-crunching needed for Transformers—those clever systems that pick out key words in sentences—making AI like ChatGPT, Claude, or Grok possible. GPUs juggle thousands of tasks at once, turning years of training into mere days. With that power, we now marvel at the AI revolution. GPUs deserve a standing ovation for all that jazz.
However, the winds are shifting in AI land—recent breakthroughs like DeepSeek show we don't always need a firehose of computing power—not in every scenario. There are smart ways to trim the fat off those hefty calculations, making AI computing cheaper and faster. What this revealed is that the CPU—Central Processing Unit—the quiet, steady heart of every computer and smartphone—is underestimated. The CPU, born over 50 years ago (with the Intel 4004), isn't a faded memory; it's a chapter yet to unfold. A golden age for the CPU is actually on the horizon.
The DeepSeek Surprise: A Full-Stack Feat of Efficiency Over Muscle
DeepSeek cracked open the door to a new world. Everyone thought top-tier language models needed a fortune in computing power—OpenAI’s GPT-4 supposedly burned through $100 million in GPUs. But DeepSeek built their masterpiece, DeepSeek R1, with 671 billion parameters for reportedly under $6 million, using just 2048 Nvidia H800 GPUs.
How’d they do it? Imagine you’re hosting a dinner party. Instead of hiring a full crew of chefs to cook everything, DeepSeek picked a few star cooks—called Mixture of Experts (MoE). Only 5-10% of the model’s parameters work at a time, like having just two chefs whip up the meal while the rest nap. They also used simpler math—8-bit numbers (floating -point 8) instead of 32-bit (floating-point 32)—like swapping a fancy calculator for a pencil and paper. And their DualPipe trick kept GPUs busy, like a chef chopping veggies while the sauce simmers. The result? Less power, fewer GPUs, and a model that still talks as smoothly as the big shots.
Here’s the kicker: DeepSeek’s efficiency means you don’t need a GPU army. Smaller versions of their model can even run on a CPU, churning out words at a decent pace. It’s proof that smart design can outshine raw computing power, setting the stage for CPUs to step up.
L-Mul: Crunching Numbers the Smart Way
DeepSeek isn’t just a one-trick pony—it’s a full-stack force, slashing computing power across the board and cutting the need for fancy GPUs. Meanwhile, there’s another method proving that clever math alone can save energy in computing. It’s called L-Mul, discovered by a Chinese team—a perfect handoff from DeepSeek’s big-picture wins to a laser focus on number-crunching magic.
Back in 2017, the paper Attention is All You Need shook the tech world, introducing the Transformer—a genius idea that sparked today’s LLM revolution by teaching AI to focus on what matters in a sentence. Fast forward to 2025, and another paper, Addition is All You Need for Energy-efficient Language Models, took it further, showing how crunching numbers differently can save the day. More specifically, instead of chugging through energy-consuming multiplication, simple addition and subtraction can slash the need for computing power even more.
The Cambridge, MA-based called this trick L-Mul—Linear-complexity Multiplication. This 2025 paper drops a bombshell: we don’t always need the GPU’s favorite trick—matrix multiplication. In language models, like the Transformer magic, GPUs shine by multiplying giant number grids fast. Think of it as a blender pulverizing ingredients in seconds. But L-Mul says, “Hold on, let’s use a spoon instead.”
Picture this: you need to figure out 3 × 4. A GPU would zip to 12 with multiplication. L-Mul says, “Why not add 3 + 3 + 3 + 3?” It’s slower but uses way less energy—up to 37 times less per step! In AI, those multiplications happen billions of times, especially in the Transformer’s “attention” trick, where it decides which words matter most in a sentence. L-Mul swaps most of those power-hungry multiplications for simple additions, cutting energy use by 80% while keeping the AI’s answers nearly spot-on.
Why does this matter? GPUs are built for multiplication marathons, but if AI can run simply on addition, CPUs—great at all-purpose tasks—can step in. No need for a blender when a spoon stirs just as well for less. L-Mul whispers a future where CPUs could take charge, loosening the GPU’s energy-guzzling grip.
WLRU: Rewriting Cache Rules
We’ve all heard of Moore’s Law—the idea that tech gets faster and smaller every couple of years, like magic doubling our computer power. But there’s a snag nobody talks about enough: the “memory wall barrier.” CPUs can think fast, but they’re stuck waiting for data from slow memory, like a chef twiddling thumbs while the oven preheats.
To fix this, we’ve leaned on cache—a temporary storage area that holds frequently accessed data for quick grabs, bridging the gap between CPU and main memory (RAM). However, imagine you’re cooking a big meal and think keeping all your spices on the counter will save time—but with too many, you’re still fumbling to find the right one. Cache helps, but that memory wall barrier keeps it from being a perfect fix—too often, it’s holding yesterday’s ingredients when you need today’s.
To address that, CPUs have long followed an old trick called the “principle of locality.” It’s like assuming you’ll need your trusty spatula tomorrow because you flipped pancakes with it today—keep it handy. CPUs stash recently used data in cache, betting it’ll be needed soon. Picture your kitchen: you just used salt for soup, so you keep it in a little basket nearby, figuring you’ll grab it again. But here’s the rub: this “temporal locality” idea is crumbling. Today’s internet and apps hop around data like kids on a playground, not neatly reusing the same stuff. Caches miss, and CPUs waste time fetching from slower memory. It’s like reaching for salt again, only to realize you’re now baking a cake and need sugar—your basket’s cluttered with the wrong stuff, and you’re stuck digging through the pantry.
Enter WLRU—Weighted Least Recently Used—a clever twist that tackles the principle of locality’s limits and ties our story together. Picture your kitchen: instead of tossing out the saffron you haven’t used in weeks, WLRU checks, “Wait, is it precious?” It gives data a “weight”—how valuable it is—not just “when’d I last use it.” Like keeping rare saffron over everyday salt, even if you grabbed the salt yesterday.
This cracks the memory war myth—Intel and friends piling on bigger caches like it’s a numbers game. More cache doesn’t tame chaotic data needs. WLRU’s smarter approach could let CPUs rethink memory, playing to their strength: flexibility. Unlike GPUs, built for brute force, CPUs can adapt, and WLRU shows they’ve got plenty of room to grow.
CPUs: Poised for a Golden Leap
If DeepSeek delivers a full-stack overhaul to trim computing needs, L-Mul’s method lights up sheer math power, and WLRU tackles the memory mess, proving cache doesn’t need to be a guessing game, then together they guide us toward pushing CPU transformation even further.
GPUs sparked an AI revolution. It’s like inventing the car, speeding us into the AI era. But revolutions evolve. DeepSeek shows efficiency beats muscle, L-Mul proves we don’t need GPU’s math monopoly, and WLRU hints CPUs can outsmart memory limits. GPUs are supercars—fast, loud, thirsty. CPUs? They’re the trusty sedans, ready for a makeover.
Our next computing chapter isn’t about piling on power—it’s about transformation. AI’s future craves energy savings and adaptability, where CPUs shine. Their golden age isn’t behind us; it’s coming of age. So next time you tap your iPhone or boot your laptop, remember: that little CPU inside might just be the hero of tomorrow’s tech tale.
Disclaimer: The authors hold a patent for WLRU.