Skip to main content

Building tsastat: Engineering a High-Resolution Linux Thread State Analyzer

·826 words·4 mins
Ankur Rathore
Author
Ankur Rathore
Senior Systems Engineer pivoting to High-Performance Infrastructure. Building zero-allocation network drivers and cache-friendly data structures.
System Performance Reference - This article is part of a series.
Part 1: This Article

The Content
#

The Challenge: A Footnote in History
#

In the classic textbook Systems Performance: Enterprise and the Cloud, Brendan Gregg includes a footnote in Chapter 4 (Observability Tools) that serves as a call to action for systems engineers. He notes that while Thread State Analysis (TSA) is a vital methodology for identifying bottlenecks, a native, high-resolution Linux tool for this purpose has been historically missing. Gregg eventually wrote tstates.d for FreeBSD, but the Linux equivalent remained an exercise for the reader.

I took that challenge as a pivot point in my career. I wanted to build a tool that didn’t just parse text files in /proc, but interfaced directly with the Linux Kernel Scheduler to provide ground-truth telemetry.

The result is tsastat, recently recognized as Crate of the Week in This Week in Rust #645.

1. The Theory of Thread State Analysis (TSA)
#

Standard tools like htop show CPU Utilization, but utilization is a poor proxy for Saturation. A process can show 10% CPU usage while being pathologically slow because it is waiting 90% of the time for the CPU scheduler or Disk I/O.

TSA categorizes thread time into six distinct states. tsastat focuses on the “Big Four”:

  1. Executing (EXEC): On-CPU time.
  2. Runnable (CPU WAIT): Ready to run, but waiting for a CPU core (Scheduler Latency).
  3. Blocked (I/O WAIT): Waiting for synchronous block I/O (Disk).
  4. Anonymous Paging (SWAP WAIT): Waiting for memory pages to be swapped in.

To get this data, we have to bypass the /proc filesystem and use Linux Delay Accounting.

2. The Transport: Generic Netlink (Genetlink) #

Netlink is the Linux Kernel’s internal “Internet.” It is a socket-based IPC mechanism used to talk to kernel subsystems. Specifically, we use Generic Netlink (Genetlink), which allows the kernel to dynamically assign IDs to different modules at boot.

The Handshake Protocol
#

Before querying stats, tsastat must perform a “Discovery Handshake” with the GENL_ID_CTRL (the Netlink Controller) to find the dynamic Family ID for TASKSTATS.

The message format is a “Russian Doll” of headers:

  • nlmsghdr: The outer Netlink envelope (Length, Type, Flags).
  • genlmsghdr: The Generic Netlink envelope (Command, Version).
  • nlattr: The Payload, formatted as TLV (Type-Length-Value).

3. Deep Dive: Binary Deserialization and Memory Alignment
#

This is where the project moved from a high-level Rust app to a low-level systems tool. Netlink attributes are strictly aligned to 4-byte boundaries. However, the TaskStats C-struct used by the kernel contains u64 fields, which modern CPUs expect to be aligned to 8-byte boundaries.

The Alignment Math
#

When parsing the TLV stream, we must manually align the “cursor” after every attribute:

// Round length up to the nearest 4-byte boundary
let aligned_len = (len + 3) & !3;
offset += aligned_len;

Handling Pointer Misalignment
#

In my initial builds, the program panicked with a misaligned pointer dereference. Because the kernel packed the struct at a 4-byte offset, casting the pointer directly to a Rust struct violated memory safety.

The solution was using std::ptr::read_unaligned, which tells the CPU to perform a byte-by-byte copy into a properly aligned stack variable:

#[repr(C)]
pub struct TaskStats {
    pub version: u16,
    pub ac_exitcode: u32,
    // ... 300+ bytes of kernel fields ...
}

// Safely extract from the network buffer
let stats: TaskStats = unsafe {
    std::ptr::read_unaligned(data_ptr as *const TaskStats)
};

4. The “Zero Percent” Mystery: Kernel Heuristics
#

During testing with stress-ng, I encountered a bizarre bug: the tool reported 0.0% CPU Wait despite 100% core saturation.

By inspecting the raw hex dumps of the Netlink packets, I discovered that modern Linux kernels (5.15+) use a Lazy Accounting Heuristic. To minimize scheduler overhead, the kernel does not flush updated taskstats counters to the userspace interface for threads pinned in a tight loop.

The metrics are only “flushed” when:

  1. The thread voluntarily sleeps/yields.
  2. The thread undergoes a major state transition (e.g., Blocked on I/O).
  3. The thread terminates.

This realization proved that polling-based telemetry is dying. This finding is what eventually motivated me to move my research toward eBPF/XDP, which uses an event-driven model that doesn’t suffer from this “Lazy Evaluation” lag.

5. TUI Architecture: The Interactive Inspector
#

I used the Ratatui library to build the dashboard. To keep the TUI responsive while polling the kernel, I refactored the codebase into a state-machine architecture:

  • src/netlink.rs: Handles the raw binary communication and sequence number tracking to prevent socket desync.
  • src/app.rs: Manages the UI state, including the TableState for row selection.
  • src/ui.rs: Renders the dual-pane layout.

The Inspector Pane at the bottom allows users to select a specific thread and see fields like the Kernel ABI version, Nice value, and cumulative nanosecond counters—exposing the raw truth behind the percentages.

Final Thoughts
#

Building tsastat taught me that Mechanical Sympathy is not just about writing fast code; it’s about understanding the complex contract between the Operating System and the hardware.

If you’re debugging a “slow” Linux system today, remember: Utilization is not Saturation.

System Performance Reference - This article is part of a series.
Part 1: This Article