Application-level analysis is the most critical stage of performance engineering because the application is where the “work” begins. This post serves as a quick-reference guide to the methodologies, synchronization primitives, and observability tools found in Chapter 5 of System Performance.
The Performance Mantras #
When optimizing, follow these in order. The fastest way to finish a task is to never start it.
- Don’t do it. (Eliminate unnecessary work)
- Do it, but don’t do it again. (Caching)
- Do it less. (Batching/Frequency reduction)
- Do it later. (Asynchronous processing)
- Do it when they’re not looking. (Background/Idle processing)
- Do it concurrently. (Parallelism)
- Do it cheaper. (Algorithm/Hardware optimization)
Core Methodologies #
The USE Method #
For every resource (CPU, Memory, Disk), check:
- Utilization: How busy is the resource?
- Saturation: Is there a queue of work waiting?
- Errors: Are there explicit error counts (logs/counters)?
Workload Characterization #
Identify the nature of the load:
- Who: PID, User, Remote IP.
- Why: Code path, API endpoint, Database query.
- What: Throughput (Ops/sec), Data size (Bytes), Latency (ms).
Synchronization Primitives #
These “traffic lights” manage access to shared memory. Choosing the wrong one causes Lock Contention.
| Primitive | Behavior | Use Case |
|---|---|---|
| Mutex | Sleeps (Blocks) off-CPU while waiting. | Long operations; process context. |
| Spinlock | Spins (Busy-waits) on-CPU in a loop. | Very fast operations; Interrupt handlers. |
| RW Lock | Allows multiple readers OR one writer. | Data read often but modified rarely. |
| Semaphore | A counter allowing N parallel ops. | Managing pools of resources. |
Hashed Locks (The Middle Ground) #
Instead of one Global Lock (slow) or a lock for Every Object (high memory overhead), use a Hash Table of Locks.
- Mechanism:
LockIndex = ObjectAddress % NumberOfLocks - Benefit: Reduces contention while keeping memory usage fixed and predictable.
Observability Tools & Commands #
Use these tools to gather data without (usually) stopping the application.
1. Basic Process Counters #
| Tool | Command | Description |
|---|---|---|
| uptime | uptime |
Checks system load averages (1, 5, 15 min). |
| pidstat | pidstat 1 |
Per-process CPU usage every second. |
| pidstat I/O | pidstat -d 1 |
Identifies which process is hogging the Disk. |
| ps | ps -eo pid,ppid,cmd,%cpu,%mem |
Detailed process tree and resource consumption. |
2. Interface Tracing (Syscalls & Libraries) #
strace can slow down an application by 10x or more. Use it for debugging, not high-load production monitoring.
- System Call Summary:
# See which syscalls are most frequent/slowest strace -cp <PID> - Library Call Tracing:
# See calls to shared libraries like malloc() or strlen() ltrace -p <PID>
3. Profiling (Sampling) #
Profilers take snapshots of the CPU state at regular intervals.
- The 99Hz Rule: Always sample at an “odd” frequency (e.g., 99Hz or 49Hz) instead of 100Hz. This prevents the profiler from syncing up with internal timers, which causes biased data.
- CPU Profiling with
perf:# Record CPU stack traces for 60 seconds at 99Hz perf record -F 99 -p <PID> -g -- sleep 60 # View the results in-terminal perf report -n --stdio
4. Advanced BPF Tools (Zero Overhead) #
These tools use the Berkeley Packet Filter (BPF) for near-zero overhead observability.
| Tool | Command | What it shows |
|---|---|---|
| opensnoop | opensnoop |
Real-time file opens (shows filenames and latency). |
| execsnoop | execsnoop |
Shows every new process as it is created. |
| ext4slower | ext4slower 1 |
Lists Disk I/O slower than 1ms. |
Key Analysis Tips #
- Off-CPU Analysis: If the app is slow but CPU is at 0%, it is likely blocked on a lock, disk, or network. Traditional profilers won’t show this—you need Off-CPU tracing.
- The Streetlight Effect: Don’t look at
topjust because it’s easy. If the bottleneck is I/O,topwon’t help you. Follow the data, not the tool. - Instruction Pointers: A snapshot of the Instruction Pointer tells you what is running. A Stack Trace tells you why it was called.
Reference: Gregg, B. (2020). System Performance: Enterprise and the Cloud. 2nd Edition.