System Performance Deep Dive: From Mantras to Micro-architectures

Table of Contents

System Performance Reference - This article is part of a series.

Part 1: Building tsastat: Engineering a High-Resolution Linux Thread State Analyzer

Part 2: System Performance: Application Analysis (Chapter 5 Reference)

Part 3: This Article

Part 4: Cracking the Latency Wall: A Rustaceous Guide to CPU Performance

Part 5: The Invisible 50%: Decoding On-CPU vs. Off-CPU Performance in Rust

Understanding system performance requires looking at two worlds: the Application World (Chapter 5), where code logic and synchronization live, and the Hardware World (Chapter 6), where instructions are pipelined and executed.

Part 1: The Performance Mindset
#

Before touching a tool, performance engineers must follow the Performance Mantras. They are ranked from most to least effective.

Don’t do it. (Eliminate the work)
Do it, but don’t do it again. (Caching)
Do it less. (Batching/Reducing frequency)
Do it later. (Asynchronous/Queues)
Do it when they’re not looking. (Background/Pre-fetching)
Do it concurrently. (Parallelism/SMT)
Do it cheaper. (Optimization/Algorithms)

Part 2: Application Analysis (Chapter 5)
#

Profiling and the “99Hz Trick”
#

Profiling characterizes behavior by taking periodic snapshots (samples) of the Instruction Pointer and Stack Trace.

Why 99Hz? If you sample at 100Hz and your application has a heartbeat task that also runs at 100Hz (every 10ms), your samples might get in “lockstep” with the task. You will either overcount (hit it every time) or undercount (miss it every time). Using 99Hz ensures you drift across the activity, providing an unbiased statistical average.

Synchronization: The Traffic Lights
#

To prevent data corruption (race conditions), we use primitives.

Primitive	Behavior	Best Used For…
Mutex	Sleeps (Blocks) while waiting.	Long operations; prevents wasting CPU.
Spinlock	Spins (Busy-waits) on-CPU.	High-speed, short locks (Interrupts).
RW Lock	Multiple readers OR one writer.	Frequent reads, rare updates.
Hashed Locks	Map many objects to a fixed array of locks.	Balancing memory overhead vs. contention.

The Hashed Lock Analogy: Instead of 1 lock for a whole library (too slow) or 1 lock per book (too much memory), you have 64 locks. Every book is assigned to one of those 64 locks based on its ID.

Part 3: CPU Architecture (Chapter 6)
#

The Instruction Pipeline
#

Modern CPUs work like an assembly line. While one instruction is being Executed, the next is being Decoded, and the one after is being Fetched.

Branch Prediction: The Guessing Game
#

When code hits an if/else, the CPU guesses which path will be taken to keep the pipeline full.

Misprediction: If the guess is wrong, the pipeline is flushed (huge performance penalty).
Linux/C: Use likely() and unlikely() macros.
C++20: Use [[likely]] and [[unlikely]] attributes.
Rust: Uses PGO (Profile-Guided Optimization) to observe real traffic and rearrange binary code.
Java/C#: The JIT compiler re-writes code at runtime based on observed behavior.

Instruction Size: CISC vs. RISC
#

x86 (CISC): Instructions are 1–15 bytes. Hard to decode, but very flexible.
ARM (RISC): Instructions are fixed (usually 4 bytes). Easy to decode and power-efficient.

Part 4: The “CPU Utilization” Lie
#

If top shows 100% CPU, it doesn’t mean the CPU is working at max capacity. It just means it’s not idle. To understand reality, we look at IPC (Instructions Per Cycle).

Stalls vs. Progress
#

Low IPC (0.1 - 0.5): The CPU is Stalled. It is waiting for RAM, Disk, or a Lock. It is “Busy waiting.”
High IPC (1.0 - 2.0+): The CPU is Retiring instructions. It is doing real work.

SMT (Simultaneous Multithreading)
#

SMT (Hyper-Threading) hides stalls. When Thread A stalls on memory, the core switches to Thread B.

Note: Two hardware threads on one core share the same execution units. If both want to do heavy math, they will fight for the hardware.

The Two Flavors of CPU Saturation
#

Brendan Gregg highlights that saturation isn’t just about a 100% busy CPU. In modern cloud environments, you can be saturated while appearing idle.

1. Hardware Saturation (Physical)
#

Observation: CPU is at 100%.
Symptom: Run queues are long.
Cause: Too much work for the physical silicon.

2. Resource Limit Saturation (Software)
#

Observation: CPU might be at 20% or 50%.
Symptom: High “Throttled Time” or “Steal Time.”
Cause: You have hit your Cgroup Limit or Cloud Quota.

SRE Tip: If your application is slow but top shows low CPU usage, check for CPU Throttling. In Kubernetes, look at the metric container_cpu_cfs_throttled_seconds_total. In AWS, check the CPUCreditBalance or Steal Time.

Priority Inversion
#

One of the most dangerous performance bugs is Priority Inversion. This occurs when a high-priority task is indirectly blocked by a medium-priority task.

The Scenario:
#

Low Priority holds a lock.
Medium Priority preempts the Low task (hogging the CPU).
High Priority needs the lock but is forced to wait for the Medium task to finish so the Low task can finally release the lock.

The Fix: Priority Inheritance (PI)
#

In Linux, we use PI-Mutexes. The kernel temporarily “borrows” the high priority and gives it to the low-priority lock holder. This allows them to finish their work quickly, release the lock, and get out of the way of the high-priority task.

SRE Note: If you see a high-priority process “Sleeping” while a lower-priority process is “Running,” check for lock contention. You might be witnessing a priority inversion.

Observability Toolbox Cheat Sheet
#

Summary Tools
#

uptime: Check Load Averages.
pidstat 1: Per-process CPU breakdown.

Deep Dive Tools
#

strace -cp <PID>: Summarize system call latency and frequency.
perf record -F 99 -p <PID> -g -- sleep 60: Capture a profile for a Flame Graph.
perf stat -p <PID>: Crucial! This shows you the IPC (Instructions Per Cycle).

# Example of checking if your process is stalled or working:
perf stat -p 1234
# Output will show "insn per cycle" (IPC). 
# If < 0.5, you have a memory/stalling bottleneck.

Final Summary
#

Eliminate work first (Mantras).
Sample at odd rates (99Hz) to avoid bias.
Help the CPU guess (Branch prediction hints) to keep the pipeline moving.
Don’t trust 100% CPU usage. Check the IPC to see if the CPU is actually finishing instructions or just waiting for RAM.

Reference: Gregg, B. (2020). System Performance: Enterprise and the Cloud. 2nd Edition.

System Performance Reference - This article is part of a series.

Part 1: Building tsastat: Engineering a High-Resolution Linux Thread State Analyzer

Part 2: System Performance: Application Analysis (Chapter 5 Reference)

Part 3: This Article

Part 4: Cracking the Latency Wall: A Rustaceous Guide to CPU Performance

Part 5: The Invisible 50%: Decoding On-CPU vs. Off-CPU Performance in Rust

Part 1: The Performance Mindset #

Part 2: Application Analysis (Chapter 5) #

Profiling and the “99Hz Trick” #

Synchronization: The Traffic Lights #

Part 3: CPU Architecture (Chapter 6) #

The Instruction Pipeline #

Branch Prediction: The Guessing Game #

Instruction Size: CISC vs. RISC #

Part 4: The “CPU Utilization” Lie #

Stalls vs. Progress #

SMT (Simultaneous Multithreading) #

The Two Flavors of CPU Saturation #

1. Hardware Saturation (Physical) #

2. Resource Limit Saturation (Software) #

Priority Inversion #

The Scenario: #

The Fix: Priority Inheritance (PI) #

Observability Toolbox Cheat Sheet #

Summary Tools #

Deep Dive Tools #

Final Summary #