#AI/ML

2 articles

Kernel Fusion on CPU: What llama.cpp's RMS_NORM + MUL Fusion Teaches Us About LLM Performance

Llama.cpp's PR #22423 landed a kernel fusion for RMS_NORM + MUL in the ggml CPU backend a few weeks ago. The speedup: 1.60×. Consistently. Across dimension sizes, thread counts, even hardware variatio

Modern C++ // dev Apr 21, 2026 7 min read

AI/ML llama.cpp Inference Quantization

Anatomy of llama.cpp: How 105K Stars of C++ Runs LLMs on Your Laptop

I spent a week reading llama.cpp's source. Not the GitHub README, not the model card — the actual C that runs when you type `./llama-cli -m llama-7b-q4.gguf`. What I found is one of the better-enginee

Modern C++ // dev Mar 13, 2026 13 min read