In distributed systems, application metrics and fancy Grafana dashboards are great—until they aren’t. Eventually, a server goes rogue, the dashboards go blind, and you are forced to SSH into a Linux box to figure out why your microservice is crawling or crashing.
Over the years of managing high-throughput pipelines and containerized workloads, I’ve noticed that production Linux issues usually fall into a few specific patterns.
Here are 5 real-world scenarios, how to diagnose them, and the exact commands to fix them.
Scenario 1: The “Ghost File” Eating Disk Space #
The Symptom:
Your monitoring system alerts you: No space left on device. You SSH in, run df -h, and confirm the root partition is 100% full. However, when you run du -sh /* to find the culprit, the directories only add up to a few gigabytes. The math doesn’t make sense. Where is the hidden data?
The Root Cause:
A junior developer (or an automated script) likely deleted a massive log file using rm. However, a running process (like a Python backend or Nginx) still has a File Descriptor open pointing to that file’s inode. The kernel deletes the file from the directory tree, but keeps the data on disk until the process stops writing to it.
The Fix: You need to find the process holding the deleted file open. Run:
lsof | grep deletedYou will see a list of processes holding onto “ghost” files. Once you identify the culprit (e.g., PID 1234), simply restart that service:
systemctl restart <service-name>
# or kill the specific PIDThe kernel will immediately free the file descriptor and drop the gigabytes of trapped data.
Scenario 2: High Load, Low CPU (The I/O Trap) #
The Symptom:
Your API latency spikes from 100ms to 15 seconds. You check top and see a massive Load Average (e.g., 25.00 on a 4-core machine), but your CPU usage is sitting at a relaxed 5%. If the CPU isn’t working, why is the server choking?
The Root Cause:
Look closely at the %Cpu(s) line in top. You will likely see something like 90.0 wa.
The wa stands for I/O Wait. Your processes are stuck in the D state (Uninterruptible Sleep). They are waiting for a slow hard drive, a dying EBS volume, or a stalled NFS network mount to return data. The CPU is idle because it has nothing to compute while it waits for the disk.
The Fix: Find out exactly which disk and which process is causing the bottleneck. First, check disk utilization:
iostat -xz 1Look at the %util column. If sda is at 100%, your disk is maxed out. Next, find the exact process hammering the disk:
sudo iotopThis will show you exactly which thread is aggressively writing or reading, allowing you to throttle it or move the workload.
Scenario 3: The Silent Assassin (OOM Killer) #
The Symptom:
A heavy data-processing script periodically vanishes. There is no Python stack trace, no try/except block catching an error, and no application log. The process simply stops existing.
The Root Cause:
The process tried to allocate more RAM than the system had available. To protect the OS from a total crash, the Linux kernel invoked the OOM (Out of Memory) Killer. The kernel calculates an oom_score for every process, picks the heaviest one, and sends a SIGKILL (Signal 9). Because it’s a kernel-level kill, the application has no chance to log its own death.
The Fix: To prove the kernel assassinated your process, check the kernel ring buffer:
dmesg -T | grep -i oomAlternatively, check the systemd journal:
journalctl -k | grep -i "killed process"Note: If this happens frequently to critical daemons, you can protect them by adjusting their oom_score_adj to a negative value, forcing the kernel to sacrifice less important sidecar containers first.
Scenario 4: The Inode Exhaustion #
The Symptom:
Your application crashes claiming No space left on device. You run df -h and see you have 50 GB of free space. You check for ghost files, but find nothing.
The Root Cause: Hard drives have two limits: Bytes (total size) and Inodes (the index that tracks files). If your application generates millions of tiny 1KB files (like old session tokens or cache fragments) and never cleans them up, you will run out of Inodes before you run out of Bytes. You have space, but the filesystem has run out of “names” to give to new files.
The Fix: Check your Inode usage:
df -iIf IUse% is at 100%, you need to find the directory hoarding millions of tiny files. Use this pipeline to count files per directory:
find / -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -nOnce located, clear them out (be careful using standard rm * as it will fail with “Argument list too long”; use find . -type f -delete instead).
Scenario 5: Parsing Massive Logs on the Fly #
The Symptom:
Your reverse proxy is throwing a wave of HTTP 500 errors. You have a 10 GB Nginx access.log file, and your manager needs to know which IP addresses are causing the most errors immediately.
The Solution: You don’t need to write a Python script for this. The classic Unix pipeline is incredibly efficient at slicing text. We use the Sort | Uniq | Sort pattern:
grep " 500 " /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -n 10How it works:
grep " 500 "filters the 10GB file down to just the error lines.awk '{print $1}'extracts just the first column (the IP address).sortgroups identical IPs together.uniq -ccounts the consecutive identical IPs.sort -nrsorts the output numerically and in reverse (highest count at the top).head -n 10gives you the top 10 worst offenders.
Conclusion #
Modern DevOps relies heavily on abstractions like Kubernetes and serverless platforms, but underneath it all, it’s just Linux. Knowing how the kernel handles file descriptors, memory pressure, and I/O scheduling isn’t just sysadmin trivia—it is the difference between an outage lasting 3 hours and an outage lasting 3 minutes.