Why My O(1) Scheduler Was Slower Than O(log N) (And How I Fixed It)

Table of Contents

In the world of high-performance networking (think C10M TCP servers), timers are the silent killers. Every connection needs a timeout. If you have 1 million active connections, you have 1 million timers.

Standard library implementations usually rely on a Binary Heap (Priority Queue). They are great for scheduling (O(log N)), but they have a fatal flaw: Cancellation. To cancel a specific timer in a standard Heap, you often have to perform a linear scan (O(N)) to find it.

I decided to fix this by implementing the classic Varghese & Lauck (1987) Hierarchical Timing Wheel paper in Rust. The promise? O(1) insertion and O(1) cancellation.

I built the engine. I ran the benchmarks. I expected a blowout victory.

Instead, the Standard Library Heap destroyed my implementation on insertion. Here is the story of how I investigated the hardware, refactored my memory layout, and eventually achieved a 1,700x speedup on the metric that actually matters.

The Architecture: Data-Oriented Design
#

I knew that a naive implementation using Box<Node> (pointer chasing) would destroy the CPU cache. So, I opted for a Slab Allocator approach.

Instead of scattering nodes across the heap, I allocate one giant Vec<Entry>. The “pointers” are just usize indices into this vector.

// The Slab Entry (Initial Version)
pub struct TimerEntry<T> {
    pub task: T,
    pub deadline: u64,
    // Intrusive Doubly Linked List logic
    pub next: Option<usize>, 
    pub prev: Option<usize>,
}

The Timing Wheel itself is just a set of buckets (arrays) pointing to the head index of a list in the Slab.

The “Oh No” Moment
#

I fired up criterion to benchmark inserting 1,000,000 timers.

Theory: O(1) (Wheel) should crush O(log N) (Heap).
Reality:
- Binary Heap: 2.0 ms
- My Wheel: 64.5 ms

My O(1) implementation was 30x slower than the O(log N) implementation.

The Investigation: “Big O” vs. “The Metal”
#

I posted my findings to the Rust community, and a senior engineer (matthieum) pointed out a critical flaw in my benchmark methodology and my memory layout.

It turns out I was fighting two invisible enemies: The Hardware Prefetcher and Struct Padding.

1. The Prefetcher Cheat
#

My benchmark was inserting monotonically increasing deadlines (0, 1, 2...).

The Heap: When you insert sorted data into a Binary Heap, it often just appends to the end of the backing vector. The CPU prefetcher loves sequential access. It predicts the next write perfectly.
The Wheel: Even with sorted input, the bitwise hashing logic sends timers to random slots. This meant writing to random indices in my Slab vector. Random writes = L1 Cache Misses.

To fix this, I changed the benchmark to use Random Inputs.

2. The Memory Bloat
#

I fired up samply (a fantastic Rust profiler) to see where the CPU was spending time.

The flamegraph was dominated by a single assembly instruction:

mov dword [rsi + rbx * 1 + 0x10], edx

This instruction was writing the next pointer in my Slab. Because the index (rbx) was random, the CPU stalled waiting to fetch that cache line from RAM.

But why was it so slow? Struct Layout. In 64-bit Rust:

usize = 8 bytes.
Option<usize> = 16 bytes (8 bytes data + 8 bytes tag/padding).

My struct was massive (~48 bytes). This clogged the memory bandwidth.

The Optimization: NonZeroU32
#

To fix the memory bloat, I switched to NonZeroU32.

We don’t need 64-bit indices (4 billion timers is enough).
Option<NonZeroU32> optimizes down to just 4 bytes (Rust uses 0 to represent None).

// The Optimized Entry (24 bytes total)
pub struct TimerEntry<T> {
    pub task: T,        // 8 bytes
    pub deadline: u64,  // 8 bytes
    pub level: u8,      // 1 byte
    // 4-byte handles!
    pub next: Option<NonZeroU32>, 
    pub prev: Option<NonZeroU32>, 
}

This simple change reduced the struct size by 50%.

The Final Result
#

After applying the memory optimizations and fixing the benchmark to be “fair” (using random data for both), the results were staggering.

1. Insertion (Stress Test)
#

Binary Heap: Slowed down to 15.3ms (because random data forces re-balancing).
Timing Wheel: Improved to 46.0ms (better struct packing).
Verdict: The Heap still wins on raw insertion because Vec::push is hard to beat, but the gap narrowed significantly.

2. Cancellation (The “Money Shot”)
#

This is the metric that matters for networking.

Implementation	Time (10k items)
Binary Heap	51,600 µs (51ms)
Timing Wheel	30 µs

The Timing Wheel is 1,700x Faster at cancellation.

Conclusion
#

This project reinforced a core lesson in Systems Engineering: Data layout is performance.

Big O is a lie (sometimes). Constant factors like Cache Misses matter more than algorithmic complexity for “small” N (even up to 1 million).
Memory Layout: Shrinking structs to fit more into a Cache Line (64 bytes) yields massive gains.
Profiling: Tools like samply don’t lie. Seeing the CPU stall on a single mov instruction told me everything I needed to know.

You can find the full code and benchmarks on GitHub: sharded-timing-wheel

The Architecture: Data-Oriented Design #

The “Oh No” Moment #

The Investigation: “Big O” vs. “The Metal” #

1. The Prefetcher Cheat #

2. The Memory Bloat #

The Optimization: NonZeroU32 #

The Final Result #

1. Insertion (Stress Test) #

2. Cancellation (The “Money Shot”) #

Conclusion #