The 64-Byte Grid Your Data Lives On

You write a struct with three fields. A char, a double, another char. Ten bytes of data.

sizeof returns 24.

Fourteen bytes appeared out of nowhere. To understand where they came from, you need to see the grid your data lives on — and the two opposing forces that grid creates.

Cache lines and the hardware grid

Your memory is not a flat sea of bytes. The hardware divides it into 64-byte blocks called cache lines.

When the CPU reads a single byte, it loads the entire 64-byte block that contains it. Every memory access operates on this grid — 64-byte-aligned boundaries, carved into your address space by the hardware.

Your data either fits inside one of these blocks or straddles the boundary between two. That difference has real cost.

Split loads

A double is 8 bytes. If it starts at an address that is a multiple of 8, it sits inside a single cache line. One fetch.

If that double starts at byte 61 of one cache line, three bytes land in one block, five in the next. The CPU fetches two cache lines, extracts the pieces from each, and reassembles them. That is a split load — twice the memory traffic. On some architectures it costs extra cycles. On others it is a fault.

The compiler prevents this with padding. It inserts gaps so every field starts at an address that is a multiple of its own size. A double aligns to 8 bytes. An int to 4. A char needs no alignment.

Those 14 bytes are the compiler making sure no field ever straddles a boundary that would split a load.

Padding made visible

Here is the struct with offsets annotated:

struct Sensor {
    char   id;       // offset 0   — 1 byte
    // 7 bytes padding (align double to offset 8)
    double reading;  // offset 8   — 8 bytes
    char   status;   // offset 16  — 1 byte
    // 7 bytes padding (round struct size to 8)
};
// sizeof(Sensor) == 24

The alignment rule: each field aligns to its own size, and the struct’s total size rounds up to the largest member’s alignment. That final padding ensures that in an array, the next element’s double also lands on an 8-byte boundary.

Reorder the fields largest-first and the padding collapses:

struct Sensor {
    double reading;  // offset 0   — 8 bytes
    char   id;       // offset 8   — 1 byte
    char   status;   // offset 9   — 1 byte
    // 6 bytes padding (round up to 8)
};
// sizeof(Sensor) == 16

Same data. Eight bytes smaller. In an array of a million sensors, that is 8 MB less memory — and more objects per cache line.

Empty members and hidden padding

Field reordering handles one source of waste. C++20 handles another.

Consider a struct that carries a stateless comparator or allocator — common in generic code:

struct Config {
    double threshold;          // 8 bytes
    std::less<> comparator;    // 0 bytes of state
};
// sizeof(Config) == 16

std::less<> stores nothing. It has no data members. But C++ requires every object to occupy at least one byte — and alignment padding rounds that single byte up to 8. The empty member costs you 8 bytes.

C++20’s [[no_unique_address]] removes that guarantee for members:

struct Config {
    double threshold;
    [[no_unique_address]] std::less<> comparator;
};
// sizeof(Config) == 8

The attribute tells the compiler that comparator does not need its own address. If the member is empty, the compiler can overlap it with the padding of a neighboring field — or eliminate its footprint entirely.

This matters most in templates. std::tuple, for instance, uses this (or equivalent techniques) to avoid paying for empty types — the empty base optimization applied to members instead of bases. If your generic container carries a stateless allocator, deleter, or comparator, [[no_unique_address]] can shave bytes off every instance. Across a million objects in a contiguous array, those bytes determine how many fit in each cache line the prefetcher loads.

Where alignment comes from

The compiler handles alignment within a struct. But who aligns the struct itself?

On the stack, the compiler adjusts the stack pointer automatically. Even alignas(64) variables on the stack get the right treatment without intervention from you.

The heap is different. malloc guarantees alignment sufficient for any fundamental type — typically 16 bytes on 64-bit systems. That covers double and long long. But 16 is not 64.

If your struct starts at byte 48 of a cache line, the first 16 bytes live in one block and the remaining bytes in the next. Field-level alignment is intact — no split loads on individual members. But the struct as a whole spans two cache lines. For a struct accessed millions of times in a tight loop, that penalty accumulates.

C11 introduced aligned_alloc — pass the alignment, the size, and you get a pointer on the boundary you asked for. Raw void*, manual lifetime.

C++17 went further. If your type has an over-aligned requirement, operator new respects it:

struct alignas(64) SensorBlock {
    double readings[8];
};

// C++17: alignment-aware new — no aligned_alloc needed.
auto* block = new SensorBlock;
auto  uptr  = std::make_unique<SensorBlock>();

Before C++17, new ignored over-alignment silently. Your alignas(64) compiled without warnings but new returned a 16-byte-aligned pointer. The type promised 64-byte alignment. The allocator did not deliver. C++17 closed that gap.

Everything above is single-threaded. Add a second core and the 64-byte grid creates a new problem.

Two threads. Two counters. No shared state, no locks, no atomics:

struct Counters {
    int a;  // Thread 1 writes here
    int b;  // Thread 2 writes here
};

Counters c{};

// Thread 1
for (int i = 0; i < N; ++i)
    c.a++;

// Thread 2
for (int i = 0; i < N; ++i)
    c.b++;

No data race. Each thread owns its counter. But a and b are 4 bytes apart — same cache line. And to the hardware, the cache line is the unit of sharing, not the variable.

You add a second thread and the code runs 10x slower than single-threaded.

The MESI protocol

Every modern multicore CPU guarantees cache coherence: if one core writes to a memory location, every other core will eventually see that write. The hardware tracks this per cache line using a protocol called MESI. Each line on each core sits in one of four states:

Modified — this core wrote to the line. No other core has a copy. The data has not reached main memory yet.

Exclusive — this core holds the only copy, matching main memory. A write can proceed without notifying anyone.

Shared — multiple cores hold clean copies. A write requires the others to invalidate first.

Invalid — this core’s copy is stale. The next access fetches a fresh copy from another core or memory.

The transitions between these states carry the cost. A write to a Shared line forces it to Invalid on every other core. They must re-fetch before they can read again.

The ping-pong

Back to the two counters, both in the same cache line.

Core 1 writes to a. The line moves to Modified on Core 1. Core 2’s copy becomes Invalid.

Core 2 writes to b. Its copy is Invalid, so it sends a request across the interconnect. Core 1 flushes the modified line. Core 2 gets a fresh copy. The line moves to Modified on Core 2. Core 1’s copy becomes Invalid.

Core 1 writes to a again. Same round trip.

Every iteration triggers an invalidation and a cross-core transfer — tens of nanoseconds each. The threads never touch each other’s data. The hardware shares the container the data lives in.

This is false sharing.

The fix: keep apart

You force each counter onto its own cache line:

#include <new>

struct Counters {
    alignas(std::hardware_destructive_interference_size) int a;
    alignas(std::hardware_destructive_interference_size) int b;
};

sizeof(Counters) jumps from 8 to 128. You pay 120 bytes of padding to eliminate cross-core invalidation. In a tight loop that trade is not close — the padding costs nothing at runtime, the false sharing costs an order of magnitude.

hardware_destructive_interference_size is the standard’s name for “the minimum offset between two objects to avoid false sharing.” On most current hardware: 64. The name is verbose but the intent is precise.

The mirror: keep together

hardware_destructive_interference_size answers one question: how far apart must two objects be so that writes to one do not invalidate the other?

C++17 defines a second constant: hardware_constructive_interference_size. It answers the opposite question: how close must two objects be so that loading one brings the other into cache for free?

#include <cstdint>
#include <new>

// Two fields that are always read together.
// Pack them within one cache line.
struct alignas(std::hardware_constructive_interference_size) HotPair {
    uint32_t key;
    uint32_t value;
};

static_assert(
    sizeof(HotPair) <= std::hardware_constructive_interference_size,
    "HotPair must fit in a single cache line"
);

When the CPU fetches key, the entire cache line comes with it. If value sits in that same line, you get it for free — no second memory access, no additional latency. If value spills into the next cache line, you pay for a second fetch every time you access the pair.

This is what the Sensor field reordering accomplished earlier. Shrinking from 24 to 16 bytes packs more sensors per cache line. Each prefetcher load carries more useful data. The mechanism is the same one constructive_interference_size names: co-locate data that you access together.

The two constants form a pair:

Constant	Question	Design move
`destructive_interference_size`	How far apart to avoid invalidation?	Keep apart
`constructive_interference_size`	How close to share a cache fetch?	Keep together

On most hardware both equal 64. The values match because the mechanism is the same — one cache line, 64 bytes. But the design intent is opposite, and naming that intent in your code tells the next reader why you chose the layout, not just what it is.

The full picture

One grid. 64-byte blocks. Three design moves that follow from it.

The compiler pads your struct fields to prevent split loads across boundaries. When two cores write to the same cache line, the MESI protocol forces an invalidation round trip on every write — you fix that by separating the fields onto different lines. When two fields are always read together, you pack them into the same line so one fetch carries both.

Keep apart. Keep together. Both decisions trace back to the same 64-byte boundary.

Pick a struct you use in a hot path. Run sizeof. Reorder the fields largest-first and run it again. Then check: do two threads write to fields in the same cache line? Separate them. Do you always read two fields together? Make sure they share one. Print the offsets with offsetof and the addresses with std::cout — the numbers will tell you which move to make.

Cache lines and the hardware grid#

Split loads#

Padding made visible#

Empty members and hidden padding#

Where alignment comes from#

When the grid turns against you: false sharing#

The MESI protocol#

The ping-pong#

The fix: keep apart#

The mirror: keep together#

The full picture#