
Optimizing Rust Data Layout for Cache Efficiency
Why cache efficiency matters
Modern CPUs are fast at computation and slow at memory access. When your code touches data that is scattered across memory, the CPU spends more time waiting than working. Cache efficiency is about arranging data so that the bytes needed together are stored together.
In Rust, the default structure choices often encourage clarity first:
Vec<T>is contiguous and usually cache-friendly.HashMap<K, V>is great for lookup, but not for linear scans.Vec<Box<T>>and linked structures spread data across the heap.- Large structs with rarely used fields can waste cache space.
The goal is not to avoid all abstraction. It is to make the memory access pattern match the workload.
The core principle: access patterns should shape layout
A good rule is simple: if your code reads fields together, store them together. If it processes many items in a tight loop, keep those items contiguous. If it only needs a subset of fields, avoid dragging unrelated data into the cache.
Consider a telemetry system that stores millions of sensor readings:
struct Reading {
timestamp: u64,
sensor_id: u32,
temperature: f32,
humidity: f32,
status: u8,
reserved: [u8; 7],
}This layout is fine if every operation needs all fields. But if one hot path only computes average temperature, loading the full struct for each reading may waste cache bandwidth. A more cache-aware design might separate frequently scanned numeric values from metadata.
Prefer contiguous collections for hot data
A contiguous Vec<T> is usually the best starting point. It gives the CPU a predictable access pattern and benefits from hardware prefetching.
Good fit: sequential processing
fn sum_temperatures(readings: &[Reading]) -> f32 {
let mut total = 0.0;
for reading in readings {
total += reading.temperature;
}
total
}This is cache-friendly because the loop walks memory linearly. The CPU can fetch the next cache line before it is needed.
Less cache-friendly: pointer chasing
struct Node {
value: f32,
next: Option<Box<Node>>,
}This linked-list style layout is easy to express, but each node may live anywhere on the heap. Traversal becomes a chain of dependent pointer loads, which is expensive on modern CPUs.
If you need ordered traversal, prefer Vec<T> or an index-based arena structure. For example, store nodes in a Vec<Node> and link them with indices instead of pointers.
Split hot and cold fields
One of the most effective layout optimizations is separating frequently accessed fields from rarely used ones. This is often called hot/cold splitting.
Before: mixed access patterns
struct Session {
user_id: u64,
last_seen: u64,
request_count: u32,
auth_token: String,
profile_json: String,
debug_notes: Option<String>,
}If a request handler only needs user_id, last_seen, and request_count, loading the large strings into cache is unnecessary.
After: split the structure
struct SessionHot {
user_id: u64,
last_seen: u64,
request_count: u32,
}
struct SessionCold {
auth_token: String,
profile_json: String,
debug_notes: Option<String>,
}
struct Session {
hot: SessionHot,
cold: SessionCold,
}This design keeps the hot fields compact and improves the chance that more active sessions fit in cache. It also makes it easier to move cold data out of the critical path.
When to split
Use hot/cold splitting when:
- one subset of fields is accessed in most requests,
- another subset is accessed rarely,
- the hot path iterates over many instances,
- the cold fields are large, variable-sized, or expensive to touch.
Avoid splitting if it makes the code awkward and the data set is small enough that cache pressure is irrelevant.
Choose array-of-structs or struct-of-arrays deliberately
A major layout decision is whether to store data as an array of structs or a struct of arrays.
| Layout | Best for | Tradeoff |
|---|---|---|
Array of structs | Vec<Particle> | Accessing many fields of each item | May load unused fields |
Struct of arrays | separate Vecs | Accessing one or two fields across many items | More bookkeeping, harder to keep in sync |
Array of structs
struct Particle {
x: f32,
y: f32,
vx: f32,
vy: f32,
}
fn integrate(particles: &mut [Particle], dt: f32) {
for p in particles {
p.x += p.vx * dt;
p.y += p.vy * dt;
}
}This is ideal when each iteration needs all fields.
Struct of arrays
struct Particles {
x: Vec<f32>,
y: Vec<f32>,
vx: Vec<f32>,
vy: Vec<f32>,
}
fn integrate(p: &mut Particles, dt: f32) {
for i in 0..p.x.len() {
p.x[i] += p.vx[i] * dt;
p.y[i] += p.vy[i] * dt;
}
}This layout is often better when a loop processes one field at a time, such as computing all positions or all velocities. Each vector is compact and easy for the CPU to stream through.
Practical guidance
- Use array-of-structs for object-centric logic.
- Use struct-of-arrays for numeric kernels and batch processing.
- Benchmark both if the workload is performance-critical.
Reduce padding and improve field ordering
Rust lays out struct fields with alignment in mind. Poor field ordering can introduce padding bytes that increase size and reduce cache density.
Example of suboptimal ordering
struct BadOrder {
a: u8,
b: u64,
c: u16,
d: u32,
}This may contain padding between fields to satisfy alignment requirements.
Better ordering
struct GoodOrder {
b: u64,
d: u32,
c: u16,
a: u8,
}Grouping larger fields first often reduces padding. Smaller structs mean more items per cache line, which can improve throughput when iterating over large collections.
Best practices
- Order fields from largest alignment to smallest.
- Measure
std::mem::size_of::<T>()when layout matters. - Be careful with
#[repr(C)]; it is useful for FFI and stable layout, but it may not always be the most compact arrangement.
Avoid unnecessary indirection
Every extra pointer can cost cache locality. A Box<T> is useful when you need ownership with stable addresses or recursive types, but it is not free.
Prefer inline storage when possible
struct Record {
id: u64,
payload: [u8; 32],
}This keeps the data together.
Use indirection only when it pays off
struct Record {
id: u64,
payload: Box<[u8]>,
}This may be appropriate for large, variable-sized payloads, but it adds an extra memory access and scatters data across the heap.
A good rule: if the data is small and frequently accessed, store it inline. If it is large, optional, or rarely touched, consider indirection.
Use arenas and indices for graph-like data
Graphs, trees, and other pointer-heavy structures are often cache-unfriendly when modeled with Box and references. An arena-backed design can improve locality by storing nodes in a contiguous vector.
struct Node {
value: i32,
children: Vec<usize>,
}
struct Arena {
nodes: Vec<Node>,
}
impl Arena {
fn add_node(&mut self, value: i32) -> usize {
let id = self.nodes.len();
self.nodes.push(Node {
value,
children: Vec::new(),
});
id
}
}This approach has several benefits:
- nodes are stored contiguously,
- traversal uses indices instead of pointers,
- allocation overhead is reduced,
- cache behavior is more predictable.
It is especially useful for compilers, ASTs, ECS systems, and dependency graphs.
Keep hot loops simple and linear
Even with a good layout, a complex access pattern can ruin cache efficiency. Hot loops should ideally:
- walk memory linearly,
- avoid jumping between unrelated collections,
- minimize random access,
- process data in batches.
For example, if you need to update counters and filter records, it is often better to do both in one pass over a contiguous slice than to make multiple passes over different structures.
fn process(readings: &[Reading]) -> (usize, f32) {
let mut count = 0;
let mut total = 0.0;
for r in readings {
if r.status == 1 {
count += 1;
total += r.temperature;
}
}
(count, total)
}This single pass keeps the working set small and avoids repeated scans.
Measure the impact instead of guessing
Cache optimizations are highly workload-dependent. A layout that helps one program may hurt another. Always verify with benchmarks and profiling.
Useful signals include:
- reduced runtime in a tight loop,
- lower memory bandwidth usage,
- fewer cache misses in a profiler,
- improved throughput under realistic data sizes.
In Rust, use benchmarking tools such as criterion and profiling tools such as perf, cargo flamegraph, or platform-specific profilers. Test with production-like data distributions, not just tiny synthetic inputs.
Common mistakes to avoid
Over-optimizing small data sets
If your collection has only a few hundred items, cache layout may not matter. Favor clarity unless measurements show a real bottleneck.
Prematurely using #[repr(packed)]
Packed structs can reduce size, but they may create unaligned accesses and make field access slower or unsafe in some contexts. They are usually not a general-purpose performance tool.
Mixing unrelated workloads
If one structure serves both a read-heavy analytics path and a write-heavy mutation path, the best layout may differ for each. Consider separating representations by use case.
Ignoring allocation patterns
Layout and allocation are related. A contiguous layout loses much of its advantage if every element contains multiple heap allocations.
A practical decision checklist
Before changing a data structure, ask:
- Is this code on a hot path?
- Do I iterate over many items sequentially?
- Are some fields rarely used?
- Can I reduce pointer chasing?
- Would a struct-of-arrays layout fit the access pattern better?
- Have I benchmarked the current and proposed designs?
If the answer to several of these is yes, data layout is likely worth optimizing.
Conclusion
Cache-efficient data layout is one of the highest-leverage performance techniques in Rust. By keeping hot data contiguous, splitting cold fields, reducing indirection, and matching structure to access pattern, you can often gain substantial speedups without unsafe code or algorithmic changes.
The key is to design around the way your program actually reads memory. When the CPU can stream through compact, predictable data, Rust programs become faster, more scalable, and easier to reason about.
