Understanding system performance requires looking at two worlds: the Application World (Chapter 5), where code logic and synchronization live, and the Hardware World (Chapter 6), where instructions are pipelined and executed.
Part 1: The Performance Mindset #
Before touching a tool, performance engineers must follow the Performance Mantras. They are ranked from most to least effective.
- Don’t do it. (Eliminate the work)
- Do it, but don’t do it again. (Caching)
- Do it less. (Batching/Reducing frequency)
- Do it later. (Asynchronous/Queues)
- Do it when they’re not looking. (Background/Pre-fetching)
- Do it concurrently. (Parallelism/SMT)
- Do it cheaper. (Optimization/Algorithms)
Part 2: Application Analysis (Chapter 5) #
Profiling and the “99Hz Trick” #
Profiling characterizes behavior by taking periodic snapshots (samples) of the Instruction Pointer and Stack Trace.
Why 99Hz? If you sample at 100Hz and your application has a heartbeat task that also runs at 100Hz (every 10ms), your samples might get in “lockstep” with the task. You will either overcount (hit it every time) or undercount (miss it every time). Using 99Hz ensures you drift across the activity, providing an unbiased statistical average.
Synchronization: The Traffic Lights #
To prevent data corruption (race conditions), we use primitives.
| Primitive | Behavior | Best Used For… |
|---|---|---|
| Mutex | Sleeps (Blocks) while waiting. | Long operations; prevents wasting CPU. |
| Spinlock | Spins (Busy-waits) on-CPU. | High-speed, short locks (Interrupts). |
| RW Lock | Multiple readers OR one writer. | Frequent reads, rare updates. |
| Hashed Locks | Map many objects to a fixed array of locks. | Balancing memory overhead vs. contention. |
The Hashed Lock Analogy: Instead of 1 lock for a whole library (too slow) or 1 lock per book (too much memory), you have 64 locks. Every book is assigned to one of those 64 locks based on its ID.
Part 3: CPU Architecture (Chapter 6) #
The Instruction Pipeline #
Modern CPUs work like an assembly line. While one instruction is being Executed, the next is being Decoded, and the one after is being Fetched.
Branch Prediction: The Guessing Game #
When code hits an if/else, the CPU guesses which path will be taken to keep the pipeline full.
- Misprediction: If the guess is wrong, the pipeline is flushed (huge performance penalty).
- Linux/C: Use
likely()andunlikely()macros. - C++20: Use
[[likely]]and[[unlikely]]attributes. - Rust: Uses PGO (Profile-Guided Optimization) to observe real traffic and rearrange binary code.
- Java/C#: The JIT compiler re-writes code at runtime based on observed behavior.
Instruction Size: CISC vs. RISC #
- x86 (CISC): Instructions are 1–15 bytes. Hard to decode, but very flexible.
- ARM (RISC): Instructions are fixed (usually 4 bytes). Easy to decode and power-efficient.
Part 4: The “CPU Utilization” Lie #
If top shows 100% CPU, it doesn’t mean the CPU is working at max capacity. It just means it’s not idle. To understand reality, we look at IPC (Instructions Per Cycle).
Stalls vs. Progress #
- Low IPC (0.1 - 0.5): The CPU is Stalled. It is waiting for RAM, Disk, or a Lock. It is “Busy waiting.”
- High IPC (1.0 - 2.0+): The CPU is Retiring instructions. It is doing real work.
SMT (Simultaneous Multithreading) #
SMT (Hyper-Threading) hides stalls. When Thread A stalls on memory, the core switches to Thread B.
- Note: Two hardware threads on one core share the same execution units. If both want to do heavy math, they will fight for the hardware.
Observability Toolbox Cheat Sheet #
Summary Tools #
uptime: Check Load Averages.pidstat 1: Per-process CPU breakdown.
Deep Dive Tools #
strace -cp <PID>: Summarize system call latency and frequency.perf record -F 99 -p <PID> -g -- sleep 60: Capture a profile for a Flame Graph.perf stat -p <PID>: Crucial! This shows you the IPC (Instructions Per Cycle).
# Example of checking if your process is stalled or working:
perf stat -p 1234
# Output will show "insn per cycle" (IPC).
# If < 0.5, you have a memory/stalling bottleneck.Final Summary #
- Eliminate work first (Mantras).
- Sample at odd rates (99Hz) to avoid bias.
- Help the CPU guess (Branch prediction hints) to keep the pipeline moving.
- Don’t trust 100% CPU usage. Check the IPC to see if the CPU is actually finishing instructions or just waiting for RAM.
Reference: Gregg, B. (2020). System Performance: Enterprise and the Cloud. 2nd Edition.