Kernel Fusion on CPU: What llama.cpp's RMS_NORM + MUL Fusion Teaches Us About LLM Performance
Llama.cpp's PR #22423 landed a kernel fusion for RMS_NORM + MUL in the ggml CPU backend a few weeks ago. The speedup: 1.60×. Consistently. Across dimension sizes, thread counts, even hardware variatio
Modern C++ // dev Apr 21, 2026 7 min read
Anatomy of llama.cpp: How 105K Stars of C++ Runs LLMs on Your Laptop
I spent a week reading llama.cpp's source. Not the GitHub README, not the model card — the actual C that runs when you type `./llama-cli -m llama-7b-q4.gguf`. What I found is one of the better-enginee
Modern C++ // dev Mar 13, 2026 13 min read