Skip to main content

System Performance: Application Analysis (Chapter 5 Reference)

·620 words·3 mins
Ankur Rathore
Author
Ankur Rathore
Senior Systems Engineer pivoting to High-Performance Infrastructure. Building zero-allocation network drivers and cache-friendly data structures.
System Performance Reference - This article is part of a series.
Part 1: This Article

Application-level analysis is the most critical stage of performance engineering because the application is where the “work” begins. This post serves as a quick-reference guide to the methodologies, synchronization primitives, and observability tools found in Chapter 5 of System Performance.


The Performance Mantras
#

When optimizing, follow these in order. The fastest way to finish a task is to never start it.

  1. Don’t do it. (Eliminate unnecessary work)
  2. Do it, but don’t do it again. (Caching)
  3. Do it less. (Batching/Frequency reduction)
  4. Do it later. (Asynchronous processing)
  5. Do it when they’re not looking. (Background/Idle processing)
  6. Do it concurrently. (Parallelism)
  7. Do it cheaper. (Algorithm/Hardware optimization)

Core Methodologies
#

The USE Method
#

For every resource (CPU, Memory, Disk), check:

  • Utilization: How busy is the resource?
  • Saturation: Is there a queue of work waiting?
  • Errors: Are there explicit error counts (logs/counters)?

Workload Characterization
#

Identify the nature of the load:

  • Who: PID, User, Remote IP.
  • Why: Code path, API endpoint, Database query.
  • What: Throughput (Ops/sec), Data size (Bytes), Latency (ms).

Synchronization Primitives
#

These “traffic lights” manage access to shared memory. Choosing the wrong one causes Lock Contention.

Primitive Behavior Use Case
Mutex Sleeps (Blocks) off-CPU while waiting. Long operations; process context.
Spinlock Spins (Busy-waits) on-CPU in a loop. Very fast operations; Interrupt handlers.
RW Lock Allows multiple readers OR one writer. Data read often but modified rarely.
Semaphore A counter allowing N parallel ops. Managing pools of resources.

Hashed Locks (The Middle Ground)
#

Instead of one Global Lock (slow) or a lock for Every Object (high memory overhead), use a Hash Table of Locks.

  • Mechanism: LockIndex = ObjectAddress % NumberOfLocks
  • Benefit: Reduces contention while keeping memory usage fixed and predictable.

Observability Tools & Commands
#

Use these tools to gather data without (usually) stopping the application.

1. Basic Process Counters
#

Tool Command Description
uptime uptime Checks system load averages (1, 5, 15 min).
pidstat pidstat 1 Per-process CPU usage every second.
pidstat I/O pidstat -d 1 Identifies which process is hogging the Disk.
ps ps -eo pid,ppid,cmd,%cpu,%mem Detailed process tree and resource consumption.

2. Interface Tracing (Syscalls & Libraries)
#

strace can slow down an application by 10x or more. Use it for debugging, not high-load production monitoring.

  • System Call Summary:
    # See which syscalls are most frequent/slowest
    strace -cp <PID>
  • Library Call Tracing:
    # See calls to shared libraries like malloc() or strlen()
    ltrace -p <PID>

3. Profiling (Sampling)
#

Profilers take snapshots of the CPU state at regular intervals.

  • The 99Hz Rule: Always sample at an “odd” frequency (e.g., 99Hz or 49Hz) instead of 100Hz. This prevents the profiler from syncing up with internal timers, which causes biased data.
  • CPU Profiling with perf:
    # Record CPU stack traces for 60 seconds at 99Hz
    perf record -F 99 -p <PID> -g -- sleep 60
    
    # View the results in-terminal
    perf report -n --stdio

4. Advanced BPF Tools (Zero Overhead)
#

These tools use the Berkeley Packet Filter (BPF) for near-zero overhead observability.

Tool Command What it shows
opensnoop opensnoop Real-time file opens (shows filenames and latency).
execsnoop execsnoop Shows every new process as it is created.
ext4slower ext4slower 1 Lists Disk I/O slower than 1ms.

Key Analysis Tips
#

  • Off-CPU Analysis: If the app is slow but CPU is at 0%, it is likely blocked on a lock, disk, or network. Traditional profilers won’t show this—you need Off-CPU tracing.
  • The Streetlight Effect: Don’t look at top just because it’s easy. If the bottleneck is I/O, top won’t help you. Follow the data, not the tool.
  • Instruction Pointers: A snapshot of the Instruction Pointer tells you what is running. A Stack Trace tells you why it was called.

Reference: Gregg, B. (2020). System Performance: Enterprise and the Cloud. 2nd Edition.

System Performance Reference - This article is part of a series.
Part 1: This Article