Your threads are doing independent work on independent data, and yet adding a second thread makes everything six times slower. Not twice as fast. Not the same speed. Six times slower. The profiler shows no lock contention, no syscall overhead, nothing obviously wrong. The data structures are correct. The algorithm is correct. The hardware is lying to you about where your data lives.
This is false sharing, and it hides in struct layouts and thread-local counters across more production codebases than anyone wants to admit.
What the Hardware Actually Does
A modern x86 CPU doesn’t read memory one byte at a time. It reads in cache lines — 64-byte chunks on every Intel and AMD processor shipped in the last two decades. When core 0 writes to byte 0 of a cache line, the entire 64-byte line gets marked as modified. If core 1 was reading byte 8 of that same line, the coherence protocol (MESI, or its multi-socket extension MESIF/MOESI) forces core 1 to invalidate its copy and re-fetch the line.
The critical insight: the cores don’t know that bytes 0 and 8 are logically independent. They only see the line. Two threads writing to adjacent members of the same struct will bounce that cache line back and forth between cores on every write. This is false sharing — the data isn’t actually shared, but the hardware treats it as if it were.
Here’s what it looks like in code:
struct SharedCounters {
std::atomic<std::uint64_t> a{0}; // offset 0
std::atomic<std::uint64_t> b{0}; // offset 8
};
static_assert(sizeof(SharedCounters) <= 64, "Both fit in one cache line");
Both atomics sit within the same 64-byte cache line. When two threads each
increment their own counter, every fetch_add triggers a cross-core
invalidation.
The measured cost on an Intel i7-4790 at 3.60 GHz:
| Threads | Shared (ns/op) | Padded (ns/op) | Slowdown |
|---|---|---|---|
| 1 | 6.56 | 6.56 | 1.0x |
| 2 | 40.00 | 6.65 | 6.0x |
| 4 | 78.25 | 57.32 | 1.4x |
| 8 | 156.33 | 97.66 | 1.6x |
At two threads, the false-sharing version is six times slower than the padded version. At single-threaded, they’re identical — the cache line never bounces. The 4-thread and 8-thread numbers show the i7-4790’s 4 physical cores saturating: even the padded version slows down from core contention, but the false-sharing version is consistently worse.
That 6x gap at two threads is not a contrived worst case. It’s two atomic counters in a struct. This pattern shows up in connection managers, stats collectors, and lock-free queues across real codebases.
Detection: perf c2c Finds It in Seconds
You don’t need to guess. Linux ships a tool that points directly at the
offending cache line: perf c2c.
perf c2c record -- ./your-binary
perf c2c report --stdio
Run against our false-sharing test binary with 8 threads hammering the shared
counters for 50 million iterations each, perf c2c produces this:
=================================================
Global Shared Cache Line Event Information
=================================================
Total Shared Cache Lines : 27
Load HITs on shared lines : 112055
Load Local HITM : 7171
=================================================
Shared Data Cache Line Table
=================================================
# ----------- Cacheline ---------- Tot
# Index Address Node PA cnt Hitm
0 0x405080 0 177441 99.61%
One cache line — 0x405080 — accounts for 99.6% of all HITM (Hit-In-Modified)
events. The HITM count is the signal: 7,143 cross-core invalidations, every one
a false-sharing penalty.
The pareto breakdown identifies the exact offending accesses:
0 0 7143 214344 0 0 0x405080
---------------------------------------------------------------
0.00% 49.10% ... 0x0 [.] writer_a atomic_base.h:631
0.00% 50.90% ... 0x8 [.] writer_b atomic_base.h:631
Offset 0x0 and offset 0x8 on the same cache line, split nearly 50/50
between writer_a and writer_b. That’s the two std::atomic members of
SharedCounters. The tool tells you the symbol name, the source file, and the
exact byte offset. On Intel VTune, the equivalent analysis is the “Memory
Access” viewpoint with “Contested Accesses” highlighted — same data, friendlier
GUI.
Fix Patterns
Three approaches, each with different tradeoffs.
1. Alignment Padding
The direct fix: force each counter onto its own cache line.
struct PaddedCounters {
alignas(64) std::atomic<std::uint64_t> a{0};
alignas(64) std::atomic<std::uint64_t> b{0};
};
static_assert(sizeof(PaddedCounters) >= 128);
This wastes 56 bytes per counter (64-byte line minus 8-byte atomic). For a handful of hot counters, that’s nothing. For an array of ten thousand objects, you’ve just blown 560 KB on padding. Whether that matters depends on whether those objects are hot enough to false-share in the first place.
The performance result at two threads: 6.65 ns/op versus 40.00 ns/op for the unpadded layout. The padding eliminates the problem entirely.
2. Hot/Cold Field Restructuring
When a struct mixes frequently-written fields with read-only configuration, the writes invalidate cache lines that readers need. The fix: group the hot fields together on one cache line, cold fields on another.
struct GroupedData {
// Hot partition — writers touch this line
alignas(64) std::atomic<std::uint64_t> counter{0};
std::atomic<std::uint64_t> counter2{0};
// Cold partition — readers touch this line
alignas(64) std::uint64_t config_a{42};
std::uint64_t config_b{99};
};
Measured at two threads (one writer, one reader):
| Layout | 2 Threads (ns/op) | 8 Threads (ns/op) |
|---|---|---|
| Interleaved | 32.90 | 134.03 |
| Grouped | 7.35 | 82.65 |
| Speedup | 4.5x | 1.6x |
The 4.5x improvement at two threads comes purely from keeping the reader’s cache line stable. No algorithmic change, no new data structure — just field ordering.
3. Thread-Local Accumulation
For stats counters, the best fix is often structural: don’t share at all. Each thread accumulates into its own counter and the values merge on read. This eliminates coherence traffic entirely at the cost of slightly stale reads.
// Each thread owns a counter in its own cache line
struct alignas(64) ThreadLocalCounter {
std::uint64_t value{0};
};
std::array<ThreadLocalCounter, MAX_THREADS> per_thread_counts;
// Writer: zero contention
per_thread_counts[my_thread_id].value++;
// Reader: sum all thread slots (stale by one increment, usually fine)
auto total = std::accumulate(per_thread_counts.begin(),
per_thread_counts.end(), 0ULL,
[](auto sum, auto& c) { return sum + c.value; });
This is the pattern behind std::execution::parallel_policy reduction,
jemalloc’s per-thread arenas, and most high-performance counters in
production. We didn’t benchmark it separately because it sidesteps the problem
rather than demonstrating it — but it’s the right answer for most stats
collection.
std::hardware_destructive_interference_size
C++17 gave us a portable way to express “keep these things on separate cache
lines”: std::hardware_destructive_interference_size. It’s a constexpr std::size_t defined in <new> that the implementation sets to the cache line
size that causes destructive interference (false sharing).
#include <new>
// std::hardware_destructive_interference_size == 64 on both
// GCC 15.2.1 and Clang 21.1.8 (x86_64, Fedora 43)
Both GCC 15 and Clang 21 define __cpp_lib_hardware_interference_size to
201703L and report a value of 64. Its complement,
std::hardware_constructive_interference_size (also 64), tells you the line
size for data you want to share — useful for packing related fields together.
Using it in practice:
struct StdPadded {
alignas(std::hardware_destructive_interference_size)
std::atomic<std::uint64_t> a{0};
alignas(std::hardware_destructive_interference_size)
std::atomic<std::uint64_t> b{0};
};
The measured performance is identical to alignas(64):
| Variant | 2 Threads (ns/op) | 8 Threads (ns/op) |
|---|---|---|
| No alignment | 38.82 | 143.56 |
alignas(64) | 6.66 | 96.96 |
alignas(std::h…) | 6.90 | 92.34 |
The numbers overlap within measurement noise. On x86_64, the constant is 64 on every implementation I’ve tested, so the codegen is identical.
The reality check. The constant is a compile-time value baked into the binary. If you compile on a machine with 64-byte cache lines and deploy to a machine with 128-byte lines (some ARM server cores), the constant is wrong. Apple shipped this header with a value of 128 on Apple Silicon specifically to avoid that trap. For x86_64 targets, 64 has been correct for over two decades and isn’t changing soon. For portable libraries targeting ARM, consider making the alignment value a build-system parameter rather than trusting the constant from the build host.
The Production Scenario
A web server’s stats collector tracking requests, errors, bytes sent, and bytes received — four atomic counters, four threads, each updating its own counter:
| Layout | 1T (ns/op) | 2T (ns/op) | 4T (ns/op) | 8T (ns/op) |
|---|---|---|---|---|
| Packed | 6.81 | 37.54 | 77.02 | 149.37 |
| Padded | 6.80 | 6.96 | 7.16 | 57.71 |
| Speedup | 1.0x | 5.4x | 10.8x | 2.6x |
At four threads — the sweet spot where each thread gets its own counter and its
own physical core — the padded version delivers 139.6 million operations per
second versus 13.0 million for packed. That’s a 10.8x difference from adding
alignas(64) to four struct members. The total memory cost: 192 extra bytes of
padding.
At eight threads on four physical cores, hyperthreading muddies the picture: two logical threads per core share the L1 cache, so even the padded version sees some slowdown. But the packed version is still 2.6x worse.
The Checklist
Before you alignas(64) everything in sight:
-
Measure first.
perf c2c recordon Linux, VTune’s Memory Access analysis on any platform. If HITM counts are low, you don’t have false sharing. -
Only pad hot fields. Read-only data shared across threads doesn’t false-share. Contention requires concurrent writes to the same cache line.
-
Consider the memory cost. Padding a 16-million-element array of 8-byte counters to 64-byte alignment turns 128 MB into 1 GB. That’s probably not what you want.
-
Restructure before padding. Hot/cold separation often eliminates the problem without wasting memory — and it improves cache utilization for readers too.
-
Thread-local first. If the data is only aggregated periodically, per-thread accumulation eliminates coherence traffic entirely.
False sharing is a hardware-level performance bug with a software-level fix. The tooling finds it in seconds. The fixes are mechanical. The only hard part is knowing to look.
publish_date: 2026-04-19