Linux Performance Troubleshooting: CPU, Memory, Disk, and Network
When a production server grinds to a halt at 3 AM, you need a structured methodology, not a panicked scramble through random commands. Over years of managing Linux servers in production, I have learned that the difference between resolving an incident in five minutes versus five hours comes down to one thing: a systematic approach. This guide covers everything from the foundational USE method to advanced tools like perf and strace, complete with real-world scenarios drawn from production war stories.
The USE Method: Your Troubleshooting Framework
Brendan Gregg's USE method is the single most valuable framework for performance analysis. For every resource on a system, you check three things:
- Utilization: What percentage of the resource's capacity is being consumed? A CPU running at 95% utilization is nearly saturated.
- Saturation: Is there work waiting in a queue because the resource cannot keep up? The run queue for CPUs and the I/O wait queue for disks are classic examples.
- Errors: Are there error events? Disk I/O errors, network packet drops, and TCP retransmits all fall here.
The power of the USE method is that it gives you a checklist. Instead of guessing, you systematically walk through each resource and check all three dimensions.
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | top, mpstat, sar -u | Load average, run queue (vmstat r) | dmesg, perf stat |
| Memory | free -h, sar -r | Swap usage (vmstat si/so), OOM kills | dmesg, /var/log/syslog |
| Disk | iostat -x, df -h | await, aqu-sz in iostat | dmesg, smartctl |
| Network | iftop, sar -n DEV | TCP retransmits, socket backlog | netstat -s, ethtool -S, ss -ti |
Apply this to each resource in order. Whichever resource shows high utilization, saturation, or errors is your bottleneck. Do not skip ahead to "solutions" until you have identified the actual constraint.
CPU Analysis
Load Average: The First Signal
The very first command to run on a slow server:
uptime
# 10:23:45 up 42 days, 3:15, 2 users, load average: 4.52, 3.18, 2.01
Load average represents the average number of processes in a runnable or uninterruptible state over 1, 5, and 15 minutes. On a 4-core system:
- Load 4.0 means all cores are fully occupied with no queuing.
- Load 8.0 means 4 processes are running and 4 are waiting in the queue.
- Load 1.0 means 75% of CPU capacity is idle.
A critical nuance: load average includes processes in uninterruptible sleep (state D), which are typically waiting on disk I/O. A load of 20 on a 4-core system does not necessarily mean CPU is the problem. It could mean 16 processes are stuck waiting on a slow NFS mount. Always cross-reference with actual CPU utilization.
The three numbers tell a trend story. If the 1-minute average is much higher than the 15-minute, the load is increasing (you caught it early). If the 15-minute is highest, the load spike may already be subsiding.
Understanding /proc/cpuinfo
# How many logical CPUs
nproc
# 8
# CPU model and features
grep "model name" /proc/cpuinfo | head -1
# model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
# Physical cores vs. logical (hyperthreading)
grep "cpu cores" /proc/cpuinfo | head -1
# cpu cores : 4
# Check for CPU frequency scaling
grep "cpu MHz" /proc/cpuinfo
Knowing your core count is essential for interpreting load averages. If you have 8 logical CPUs (4 cores with hyperthreading), a load average of 8 is 100% utilization, not 200%.
top: The Swiss Army Knife
top
The header provides immediate insight:
top - 10:23:45 up 42 days, 3:15, 2 users, load average: 4.52, 3.18, 2.01
Tasks: 187 total, 3 running, 184 sleeping, 0 stopped, 0 zombie
%Cpu(s): 72.3 us, 12.1 sy, 0.0 ni, 14.2 id, 0.5 wa, 0.0 hi, 0.9 si, 0.0 st
MiB Mem : 7976.4 total, 512.3 free, 5234.1 used, 2230.0 buff/cache
MiB Swap: 2048.0 total, 1800.5 free, 247.5 used. 2102.3 avail Mem
The CPU percentage line is dense with information:
| Field | Meaning | What to Watch For |
|---|---|---|
us | User-space CPU (your applications) | High means apps are compute-heavy |
sy | Kernel/system CPU | High means lots of syscalls, context switches, or kernel work |
ni | Nice'd (low-priority) processes | Usually irrelevant |
id | Idle CPU | What's left over |
wa | I/O wait | CPU idle because processes are blocked on disk. High wa means disk is the bottleneck, not CPU |
hi | Hardware interrupts | Rarely high unless network card issues |
si | Software interrupts | High on busy network servers |
st | Steal time | Hypervisor taking CPU from your VM |
Essential top keyboard shortcuts:
- P -- sort by CPU usage
- M -- sort by memory usage
- 1 -- toggle per-CPU core breakdown
- c -- show full command line (critical for distinguishing Java processes)
- H -- show individual threads
- k -- kill a process by PID
- d -- change refresh interval
For batch mode (scripting or quick snapshots):
# One-shot snapshot, 20 lines
top -bn1 | head -20
# Capture 10 iterations at 2-second intervals to a file
top -bn10 -d2 > /tmp/top-capture.txt
htop: The Better Interactive Monitor
htop
htop provides colored CPU bars per core, a tree view with F5, search with F3, and the ability to filter processes by user or string with F4. Its biggest advantage over top is visual clarity: you can see at a glance which cores are busy, how memory is split between used and cached, and navigate the process list intuitively.
Install it if it is not present:
sudo apt install htop # Debian/Ubuntu
sudo yum install htop # RHEL/CentOS
mpstat: Per-CPU Core Statistics
mpstat -P ALL 1 5
CPU %usr %sys %iowait %idle
all 45.23 8.12 2.34 44.31
0 95.10 4.90 0.00 0.00
1 12.00 3.50 5.20 79.30
2 38.40 9.80 1.10 50.70
3 35.40 14.30 3.05 47.25
Core 0 at 95% while the others sit partly idle. This is the signature of a single-threaded bottleneck. The application is bound to one core and cannot use the rest. Common culprits include Node.js event loops, single-threaded database queries, and Python's GIL-bound workloads. The fix is usually architectural: worker processes, connection pooling, or horizontal scaling.
pidstat: Per-Process CPU Breakdown
# CPU usage per process, every 2 seconds, 5 samples
pidstat -u 2 5
# CPU usage for a specific PID
pidstat -p 12345 1 5
# Include threads
pidstat -t -p 12345 1 5
Average: UID PID %usr %system %guest %wait %CPU CPU Command
Average: 1000 12345 85.20 4.80 0.00 2.50 90.00 - java
Average: 33 8901 8.50 3.20 0.00 0.50 11.70 - nginx
Average: 999 4567 5.10 1.80 0.00 0.30 6.90 - postgres
pidstat comes from the sysstat package. It is cleaner than parsing top output and works well in scripts.
CPU Steal in Virtual Machines
On cloud VMs (AWS EC2, GCP, Azure), the st (steal) percentage in top deserves special attention. Steal time means the hypervisor is using CPU cycles that your VM requested. In other words, your VM wants to run but the physical host is overcommitted.
# Check steal time
mpstat 1 5 | awk '/all/ {print "steal:", $NF == "0.00" ? "none" : $9"%"}'
# Or from vmstat
vmstat 1 5
# Look at the 'st' column
If steal is consistently above 5%, you are on an oversubscribed host. Options include migrating the instance, upgrading to a dedicated host, or moving to a burstable instance type that matches your workload.
perf: Deep CPU Profiling
perf is the Linux kernel's built-in profiler. It tells you not just that the CPU is busy, but exactly where the cycles are going.
# Install perf
sudo apt install linux-tools-common linux-tools-$(uname -r)
# Count CPU events for 10 seconds
sudo perf stat -a sleep 10
# Record CPU samples for a running process
sudo perf record -p $(pgrep myapp) -g sleep 30
# View the profile
sudo perf report
# Quick: which functions use the most CPU system-wide
sudo perf top
perf is invaluable when top tells you a process is using 100% CPU but you need to know which function or code path is responsible. The output of perf report shows a call graph ranked by sample count.
Memory Analysis
free: The Memory Overview
free -h
total used free shared buff/cache available
Mem: 7.8Gi 5.1Gi 500Mi 128Mi 2.2Gi 2.1Gi
Swap: 2.0Gi 247Mi 1.8Gi
Understanding each column is critical:
- total: Physical RAM installed.
- used: Memory allocated to processes (includes shared memory, excludes buffers/cache on modern kernels).
- free: Completely unused memory. On a healthy Linux system, this is often near zero and that is normal.
- shared: Memory used by tmpfs filesystems and shared memory segments.
- buff/cache: Memory used by the kernel for disk caching. Buffers are metadata caches (directory entries, inodes), while cache holds actual file contents (page cache).
- available: An estimate of how much memory is available for new applications without swapping. This is the number that matters most.
The biggest misconception in Linux memory management: seeing "free" near zero and thinking the server is out of memory. Linux deliberately uses free RAM for page cache because unused RAM is wasted RAM. The available column accounts for reclaimable cache and is your real indicator of memory headroom.
vmstat: Real-Time Memory Dynamics
vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 1 247500 512300 102400 2150000 0 5 12 250 1500 3200 45 8 44 2 0
5 2 247500 498200 102400 2148000 0 0 8 1200 2100 4500 72 12 14 1 0
2 0 247500 510100 102500 2151000 0 0 4 150 1400 3100 40 7 50 2 0
Key columns decoded:
| Column | Meaning | Red Flag |
|---|---|---|
r | Run queue (runnable processes) | Consistently exceeds CPU core count |
b | Blocked processes (uninterruptible sleep, usually I/O) | Sustained values above 0 |
swpd | Swap used (KB) | Growing over time |
si | Swap in from disk (KB/s) | Any non-zero value |
so | Swap out to disk (KB/s) | Any non-zero value |
bi | Blocks read from disk (KB/s) | Context-dependent |
bo | Blocks written to disk (KB/s) | Context-dependent |
in | Interrupts per second | Baseline-dependent |
cs | Context switches per second | Sudden spikes indicate contention |
Active swapping (si and so both non-zero) is one of the strongest signals of memory pressure. The system is thrashing: reading pages from swap only to write other pages out.
/proc/meminfo: The Full Picture
cat /proc/meminfo
Key entries beyond what free shows:
- Slab: Memory used by kernel data structures (dentries, inodes, etc.). Can grow large on systems with millions of files.
- SReclaimable: Portion of Slab that can be reclaimed under pressure.
- SUnreclaim: Slab memory that cannot be reclaimed.
- PageTables: Memory used for page table mappings. Grows with the number of processes and their virtual address space.
- Committed_AS: Total memory committed to processes. If this exceeds physical RAM + swap, the system is overcommitted.
# Check slab usage
slabtop -o | head -20
# Or from /proc
grep -i slab /proc/meminfo
# Slab: 350000 kB
# SReclaimable: 280000 kB
# SUnreclaim: 70000 kB
Page Cache vs Buffers
The page cache stores file contents read from disk. When you read a file, the kernel keeps it in memory so subsequent reads are served from RAM. Buffers store filesystem metadata like directory listings and inode information.
# Drop page cache (emergency only, impacts performance)
echo 3 | sudo tee /proc/sys/vm/drop_caches
# 1 = page cache only, 2 = dentries and inodes, 3 = both
Never drop caches on a production database server unless you understand the consequences. The database will need to re-read its entire working set from disk.
The OOM Killer: When Memory Runs Out
When the system exhausts both physical memory and swap, the OOM (Out of Memory) killer activates. It scores every process and kills the one with the highest score to free memory.
# Check for recent OOM kills
dmesg | grep -i "oom"
journalctl -k | grep -i "oom"
# Example OOM log entries:
# [11686.040460] java invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE)
# [11686.040462] Out of memory: Kill process 12345 (java) score 850 or sacrifice child
# [11686.040464] Killed process 12345 (java) total-vm:8234560kB, anon-rss:6120000kB
The OOM score is calculated based on the proportion of memory a process consumes. You can view and adjust it:
# View a process's current OOM score
cat /proc/$(pgrep -f "my-app")/oom_score
# Protect a critical process (range: -1000 to 1000)
echo -1000 > /proc/$(pgrep -f "my-critical-app")/oom_score_adj
# In a systemd unit file
# [Service]
# OOMScoreAdjust=-900
Setting OOMScoreAdjust to -1000 makes a process immune to the OOM killer. Use this sparingly and only for truly critical services like your init process or database.
Swap Usage Investigation
# Overall swap usage
swapon --show
# NAME TYPE SIZE USED PRIO
# /dev/dm-1 partition 2G 247.5M -2
# Find processes using the most swap
for pid in /proc/[0-9]*; do
SWAP=$(awk '/VmSwap/{print $2}' "$pid/status" 2>/dev/null || echo 0)
if [ "$SWAP" -gt 0 ] 2>/dev/null; then
NAME=$(awk '/Name/{print $2}' "$pid/status" 2>/dev/null)
echo "${SWAP} kB - PID $(basename $pid) - $NAME"
fi
done | sort -rn | head -10
Tuning swap behavior:
# Check current swappiness (default is 60)
cat /proc/sys/vm/swappiness
# For database servers, reduce swappiness to prefer dropping cache over swapping
sudo sysctl vm.swappiness=10
# Make it persistent
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.d/99-tuning.conf
A swappiness of 10 tells the kernel to strongly prefer reclaiming page cache over swapping application memory. For database servers where keeping data in process memory is critical, this is almost always the right setting.
Disk I/O Analysis
df and du: Space Usage
# Filesystem space usage
df -h
# Filesystem Size Used Avail Use% Mounted on
# /dev/sda1 50G 42G 5.8G 88% /
# /dev/sdb1 200G 120G 71G 63% /data
# Inode usage (can run out even with free space)
df -i
# Find the largest directories under /var
du -sh /var/* 2>/dev/null | sort -rh | head -10
# Find files larger than 100MB
find / -xdev -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head -10
A full filesystem (100%) is one of the most common causes of application failures, and inode exhaustion (lots of tiny files) can happen even when df -h shows free space. Monitor both.
lsblk and blkid: Disk Topology
# Block device tree
lsblk -f
# NAME FSTYPE LABEL UUID MOUNTPOINT
# sda
# +-sda1 ext4 a1b2c3d4-e5f6-7890-abcd-ef1234567890 /
# +-sda2 swap f1e2d3c4-b5a6-7890-dcba-0987654321fe [SWAP]
# sdb
# +-sdb1 xfs 12345678-abcd-ef01-2345-678901234567 /data
# Identify filesystems
blkid
iostat: The Disk Performance Microscope
# Install sysstat if not present
sudo apt install sysstat
# Extended statistics, 1-second intervals, 5 samples
iostat -xz 1 5
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s r_await w_await aqu-sz %util
sda 12.50 85.30 200.0 4250.0 0.50 15.20 0.80 5.20 0.45 20.50
nvme0n1 5.20 120.40 320.0 8500.0 0.00 8.50 0.15 0.35 0.04 1.00
The critical metrics:
| Metric | Meaning | Concern Thresholds |
|---|---|---|
r_await / w_await | Average time (ms) for read/write requests | SSD: above 5ms; HDD: above 20ms |
%util | Percentage of time the device had I/O in progress | Above 80% sustained |
aqu-sz | Average queue length | Above 1 consistently |
rrqm/s / wrqm/s | Merged requests per second | Low values may indicate random I/O |
A common mistake is looking only at %util. A device at 100% utilization that handles all requests with sub-millisecond latency is not a problem. The await values tell you if the disk is actually struggling.
I/O Scheduler Tuning
# Check current scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none
# Change scheduler (does not persist across reboot)
echo "kyber" | sudo tee /sys/block/sda/queue/scheduler
General guidance:
- mq-deadline: Good default for most workloads. Prevents starvation.
- kyber: Designed for fast devices (NVMe SSDs). Low overhead.
- bfq: Best for interactive/desktop workloads. Provides fair I/O bandwidth.
- none: For NVMe devices that have their own internal queuing. Lowest overhead.
For NVMe SSDs in server environments, none or kyber is typically optimal. For spinning disks, mq-deadline works well.
iowait Explained
When a CPU reports iowait, it means the CPU is idle AND there are processes waiting for I/O to complete. It is a subset of idle time, not additional overhead. High iowait means the disk (or NFS, or other block device) is the bottleneck.
# Confirm iowait with vmstat
vmstat 1 5
# Look at the 'wa' column in the cpu section
# Identify which processes are in D state (uninterruptible sleep, usually I/O)
ps aux | awk '$8 == "D" || $8 == "D+"'
iotop: Identify the I/O Hog
# Show only processes doing I/O, accumulated totals
sudo iotop -oa
# Total DISK READ: 5.20 M/s | Total DISK WRITE: 25.80 M/s
# PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
# 12345 be/4 mysql 4.80 M/s 22.50 M/s 0.00 % 65.00 % mysqld
# 8901 be/4 root 0.40 M/s 3.30 M/s 0.00 % 12.00 % rsync
The -o flag filters to only processes actively doing I/O, cutting through noise. The -a flag accumulates totals so you see the bigger picture, not just instantaneous rates.
Network Performance
ss: The Modern Socket Tool
ss has replaced netstat on modern Linux systems. It is faster and provides more detail.
# Connection summary
ss -s
# Total: 1250
# TCP: 980 (estab 750, closed 100, orphaned 12, timewait 95)
# All TCP connections with process names
ss -tlnp
# Find all connections to a specific port
ss -tn dst :3306
# Count connections per state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# 750 ESTAB
# 100 TIME-WAIT
# 95 CLOSE-WAIT
# 20 LISTEN
# 15 FIN-WAIT-2
# Detailed TCP info (retransmits, RTT, congestion window)
ss -ti dst :443
Connection state red flags:
- Many TIME-WAIT: Short-lived connections being created and destroyed. Fix with connection pooling or
net.ipv4.tcp_tw_reuse=1. - Growing CLOSE-WAIT: The application is not closing sockets properly. This is a bug in your code.
- SYN-RECV buildup: Potential SYN flood attack or backlog too small.
iftop and nethogs: Real-Time Bandwidth
# Bandwidth per connection pair
sudo iftop -i eth0
# Bandwidth per process
sudo nethogs eth0
iftop shows bandwidth between host pairs, which is great for spotting unexpected traffic or DDoS. nethogs groups traffic by process, instantly answering "which process is consuming bandwidth."
Bandwidth Testing with iperf3
# On the server side
iperf3 -s
# On the client side
iperf3 -c server-ip -t 30 -P 4
# -t 30 = test for 30 seconds
# -P 4 = 4 parallel streams
# Sample output
# [SUM] 0.00-30.00 sec 3.28 GBytes 940 Mbits/sec sender
# [SUM] 0.00-30.00 sec 3.27 GBytes 938 Mbits/sec receiver
iperf3 tests the raw network capacity between two points. If iperf3 shows full bandwidth but your application is slow, the bottleneck is in the application layer, not the network.
Packet Drops and TCP Retransmits
# Interface-level statistics
ip -s link show eth0
# Look for RX/TX errors and dropped counts
# TCP retransmit statistics
netstat -s | grep -i retrans
# 12543 segments retransmitted
# 85 fast retransmits
# Per-connection retransmit info
ss -ti | grep -A1 retrans
# Watch packet drops in real time
watch -n1 'cat /proc/net/dev'
TCP retransmits indicate packet loss somewhere in the path. A retransmit rate above 1-2% degrades throughput significantly because TCP backs off its sending rate. Common causes include network congestion, faulty cables, and overloaded switches.
Process Analysis
ps: Process Snapshots
# Top CPU consumers
ps aux --sort=-%cpu | head -10
# Top memory consumers
ps aux --sort=-%mem | head -10
# Tree view showing parent-child relationships
ps auxf
# Find zombie processes
ps aux | awk '$8 ~ /Z/ {print}'
# Long-running processes
ps -eo pid,user,etime,args --sort=-etime | head -10
pstree: Visual Process Hierarchy
pstree -p
# systemd(1)---nginx(1234)---nginx(1235)
# | ---nginx(1236)
# ---sshd(800)---sshd(12340)---bash(12345)---top(12400)
pstree is invaluable for understanding which processes spawned which. If you see hundreds of child processes under a single parent, that parent may be fork-bombing.
strace: System Call Tracing
When you need to understand exactly what a stuck or slow process is doing:
# Summary of syscalls for a running process
sudo strace -p $(pgrep myapp) -c -S time
# % time seconds usecs/call calls errors syscall
# ------ ----------- ----------- --------- --------- --------
# 65.12 2.450000 245 10000 read
# 20.30 0.764000 764 1000 write
# 8.50 0.320000 32 10000 8000 open
# 6.08 0.229000 23 10000 close
# Trace only file operations
sudo strace -p $(pgrep myapp) -e trace=open,read,write,close
# Trace network activity
sudo strace -p $(pgrep myapp) -e trace=network
# Follow child processes and threads, with timestamps
sudo strace -p $(pgrep myapp) -f -tt -T
# Trace a command from start
strace -c curl https://example.com 2>&1
strace adds significant overhead (it pauses the process on every syscall). Never use it on latency-sensitive production processes unless you have no alternative. For lighter profiling, consider perf or bpftrace.
lsof: Open Files and File Descriptors
# Files opened by a specific process
lsof -p $(pgrep nginx)
# Who has a specific file open
lsof /var/log/syslog
# Deleted files still holding disk space
lsof +L1
# Network connections for a process
lsof -i -a -p $(pgrep myapp)
# Count open file descriptors per process
lsof -p $(pgrep myapp) | wc -l
# Check process limits vs actual usage
cat /proc/$(pgrep myapp)/limits | grep "open files"
ls /proc/$(pgrep myapp)/fd | wc -l
A classic gotcha: deleted log files that are still open by a process continue consuming disk space. df shows the space as used, but du cannot account for it because the directory entry is gone. lsof +L1 reveals these phantom files. Restarting the holding process or sending it a signal to reopen its log files frees the space.
/proc/PID: The Process Filesystem
Every running process has a directory under /proc containing detailed information:
PID=$(pgrep myapp)
# Process status (memory, threads, signals)
cat /proc/$PID/status
# Memory map
cat /proc/$PID/maps | head -20
# Current working directory
ls -la /proc/$PID/cwd
# Environment variables
cat /proc/$PID/environ | tr '\0' '\n'
# Open file descriptors
ls -la /proc/$PID/fd | head -20
# I/O statistics
cat /proc/$PID/io
# rchar: 1234567890 (bytes read)
# wchar: 9876543210 (bytes written)
# read_bytes: 500000000 (actual disk reads)
# write_bytes: 800000000 (actual disk writes)
System Overview Tools
sar: Historical Performance Data
sar (System Activity Reporter) is part of the sysstat package and is indispensable for post-incident analysis.
sudo apt install sysstat
# Enable data collection
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl enable --now sysstat
# CPU utilization for today
sar -u
# Memory usage
sar -r
# Disk I/O
sar -d
# Network throughput
sar -n DEV
# Yesterday's data
sar -u -f /var/log/sysstat/sa$(date -d yesterday +%d)
# Specific time range
sar -u -s 09:00:00 -e 12:00:00
# All metrics for a time range (comprehensive)
sar -A -s 02:00:00 -e 04:00:00
sar data is collected every 10 minutes by default. When you are analyzing a 3 AM incident the next morning, sar is often the only source of historical data. Install it on every server.
dstat: Real-Time Everything
# Default: CPU, disk, net, paging, system
dstat
# Custom selection
dstat -tcmndg --top-cpu --top-mem
dstat combines vmstat, iostat, and ifstat into a single colorized output. It is excellent for watching correlations between subsystems in real time during load testing or incident response.
nmon and glances
# nmon: Interactive performance monitor
nmon
# Press c for CPU, m for memory, d for disks, n for network
# glances: Modern system monitor with web UI option
pip install glances
glances
glances -w # Start web server mode on port 61208
glances is particularly useful because it highlights values in red/yellow when they cross thresholds, and it can export to InfluxDB, Elasticsearch, or other backends.
Log Analysis for Troubleshooting
journalctl: systemd Journal
# Logs for a specific service
journalctl -u nginx --since "1 hour ago"
# Kernel messages only
journalctl -k
# Follow mode (like tail -f)
journalctl -f -u myapp
# Logs around a specific time
journalctl --since "2026-03-23 02:00:00" --until "2026-03-23 04:00:00"
# Error and critical priority only
journalctl -p err..crit --since "today"
# Disk usage by journal
journalctl --disk-usage
Traditional Log Files
# System messages
tail -100 /var/log/syslog # Debian/Ubuntu
tail -100 /var/log/messages # RHEL/CentOS
# Authentication events
tail -50 /var/log/auth.log
# Kernel ring buffer (most recent hardware and kernel events)
dmesg --human --level=err,warn
# OOM events
dmesg | grep -i "oom\|out of memory"
# Disk errors
dmesg | grep -i "error\|fault\|fail" | grep -i "sd\|nvme\|disk\|ext4\|xfs"
cgroups and Resource Limits
Control groups (cgroups) allow you to limit, prioritize, and account for resource usage per process group. systemd uses cgroups extensively.
# View cgroup resource usage for systemd services
systemd-cgtop
# Check a service's cgroup limits
systemctl show myapp.service | grep -i "memory\|cpu"
# View cgroup hierarchy
cat /proc/$(pgrep myapp)/cgroup
Setting resource limits in a systemd unit file:
# /etc/systemd/system/myapp.service
# [Service]
# MemoryMax=2G
# MemoryHigh=1.5G
# CPUQuota=200%
# TasksMax=512
# IOWeight=100
- MemoryMax: Hard limit. Process gets OOM-killed if exceeded.
- MemoryHigh: Soft limit. Kernel throttles the process but does not kill it.
- CPUQuota: 200% means the process can use up to 2 full CPU cores.
- TasksMax: Maximum number of tasks (threads + processes).
# Apply changes
sudo systemctl daemon-reload
sudo systemctl restart myapp
cgroups are your safety net in production. Even if a Java process has a memory leak, MemoryMax ensures it gets killed before it takes down the entire server.
Practical Troubleshooting Scenarios
Scenario 1: High Load Average but Low CPU Usage
You see load average at 20 on a 4-core box, but CPU usage is only 15%.
# Check for processes in D state (uninterruptible sleep)
ps aux | awk '$8 == "D" || $8 == "D+"'
# Confirm I/O wait
vmstat 1 5
# Look for high 'wa' and high 'b' (blocked processes)
# Identify what's doing the I/O
sudo iotop -oa
The diagnosis: 16 processes are stuck waiting on an NFS mount that has become unresponsive. They show as D state and inflate the load average, but they are not using CPU. The fix is to address the NFS server, not add more CPU.
Production war story: A web application server showed load average of 45 with CPUs nearly idle. Every request was hanging. Investigation revealed a saturated SAN storage device that had increased I/O latency from 2ms to 800ms. All application threads were blocked on disk reads. The fix was migrating the hot data to local NVMe storage.
Scenario 2: OOM Kills in Production
Your application keeps getting killed. dmesg confirms OOM kills.
# Find the OOM kill details
dmesg | grep -A 5 "Out of memory"
# Check current memory state
free -h
cat /proc/meminfo | grep -E "MemTotal|MemAvailable|SwapTotal|SwapFree|Committed"
# Identify memory hogs
ps aux --sort=-%rss | head -10
# Check if there is a leak (watch RSS grow over time)
while true; do
ps -o pid,rss,vsz,comm -p $(pgrep myapp)
sleep 60
done
Immediate mitigations:
# Add emergency swap if none exists
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# Protect the most critical process
echo -1000 | sudo tee /proc/$(pgrep -f "my-database")/oom_score_adj
# Set memory limits to contain the leak
sudo systemctl set-property myapp.service MemoryMax=4G
Long-term: profile the application for memory leaks with language-specific tools (Valgrind for C/C++, heap dumps for Java, tracemalloc for Python).
Scenario 3: Disk Space Emergency
Application is throwing "No space left on device" errors.
# Step 1: What's full?
df -h
df -i # Check inodes too
# Step 2: Find the biggest consumers
du -sh /* 2>/dev/null | sort -rh | head -10
du -sh /var/* 2>/dev/null | sort -rh | head -10
du -sh /var/log/* 2>/dev/null | sort -rh | head -10
# Step 3: Check for deleted files still holding space
lsof +L1
# Step 4: Quick wins for freeing space
sudo journalctl --vacuum-size=100M # Trim systemd journal
sudo apt clean # Clear apt cache
sudo find /var/log -name "*.gz" -mtime +30 -delete # Old compressed logs
sudo find /tmp -mtime +7 -delete # Old temp files
Production war story: A server showed 100% disk usage but du -sh / only accounted for 30GB of a 50GB disk. The culprit was a runaway application writing to a log file that had been deleted while the process still had it open. lsof +L1 revealed a 20GB deleted file still held by the process. A kill -HUP to the process (which made it reopen its log file) instantly freed the space.
Scenario 4: Network Latency Spikes
Users report intermittent slowness.
# Check for packet loss
ping -c 100 target-host
# Look for packet loss percentage and latency variance
# Check interface errors
ip -s link show eth0
# TCP retransmits
netstat -s | grep -i retrans
ss -ti dst :443 | grep retrans
# Check for bandwidth saturation
sudo iftop -i eth0
# Check socket buffer overflows
netstat -s | grep -i overflow
cat /proc/net/netstat | grep -i drop
Common fixes:
# Increase socket buffer sizes
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
# Increase connection backlog
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535
Scenario 5: Zombie Processes
top shows zombie processes accumulating.
# Find zombies
ps aux | awk '$8 == "Z"'
# Find the parent of zombies
ps -eo pid,ppid,stat,args | awk '$3 ~ /Z/'
# Then check the parent
ps -p PARENT_PID -o pid,args
# The parent is not calling wait() on its children
# Option 1: Send SIGCHLD to the parent
kill -SIGCHLD PARENT_PID
# Option 2: If that does not work, kill the parent
# The zombies will be adopted by init and reaped
kill PARENT_PID
Zombies themselves consume no resources (just a process table entry), but they indicate a buggy parent process. If they accumulate into the thousands, they can exhaust the PID space.
Scenario 6: Memory Leak Detection
You suspect a process is leaking memory.
# Track RSS growth over time
pidstat -r -p $(pgrep myapp) 60
# Watch the RSS column. If it grows steadily without plateau, it's leaking.
# Sample output showing a leak:
# Time PID minflt/s majflt/s VSZ RSS %MEM Command
# 10:00:00 1234 150.00 0.00 4500000 1200000 15.0 myapp
# 10:01:00 1234 180.00 0.00 4600000 1350000 16.9 myapp
# 10:02:00 1234 200.00 0.00 4750000 1520000 19.0 myapp
# 10:03:00 1234 210.00 0.00 4900000 1700000 21.3 myapp
# Check the memory map for clues
cat /proc/$(pgrep myapp)/smaps_rollup
# Look at the Rss, Pss, and Anonymous fields
# For native code, use valgrind
valgrind --leak-check=full ./myapp
# For Java, capture a heap dump
jmap -dump:format=b,file=heap.hprof $(pgrep java)
# For Python, enable tracemalloc in your code or use:
# python -m tracemalloc your_script.py
A useful heuristic: if RSS grows linearly with time regardless of load, it is almost certainly a leak. If RSS grows with load and stabilizes, it is normal working set expansion.
Performance Baselines and Monitoring
Troubleshooting without a baseline is guesswork. You need to know what "normal" looks like.
Establishing Baselines
# Capture a baseline during normal operation
sar -A > /root/baseline-$(date +%Y%m%d).txt
# Key metrics to baseline
echo "=== CPU ===" && mpstat -P ALL 1 5
echo "=== Memory ===" && free -h && cat /proc/meminfo | head -20
echo "=== Disk ===" && iostat -xz 1 5
echo "=== Network ===" && ss -s
echo "=== Connections ===" && ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
Monitoring Stack Recommendations
| Tool | Use Case | Complexity |
|---|---|---|
| sar (sysstat) | Historical data on every server, zero dependencies | Low |
| Prometheus + node_exporter | Metrics collection with alerting | Medium |
| Grafana | Visualization dashboards | Medium |
| Netdata | Real-time, auto-configured dashboards | Low |
| Datadog / New Relic | Full-stack observability (SaaS) | Low (but costs money) |
At minimum, every production server should have sysstat installed and enabled. It costs nothing and provides historical data that is invaluable during incident response.
Alerting Thresholds
Start with these thresholds and tune based on your baseline:
| Metric | Warning | Critical |
|---|---|---|
| CPU utilization | 80% sustained 5min | 95% sustained 5min |
| Memory available | 20% of total | 10% of total |
| Disk space used | 80% | 90% |
| Disk I/O await | 10ms (SSD) | 50ms (SSD) |
| Swap usage | Any increase | Above 500MB |
| Load average | 2x CPU count | 4x CPU count |
| TCP retransmits | 1% of segments | 5% of segments |
Common Performance Anti-Patterns
| Anti-Pattern | Symptom | Fix |
|---|---|---|
| No swap on production servers | OOM kills under memory spikes | Add 1-2 GB swap as a safety net |
| Default swappiness=60 on DB servers | Database page cache gets swapped out | Set vm.swappiness=10 |
| Running out of file descriptors | "Too many open files" errors | Increase LimitNOFILE in systemd unit |
| Disk full from unrotated logs | Application crashes on write errors | Configure logrotate, monitor at 80% |
| TIME-WAIT connection flooding | Port exhaustion, connection failures | Connection pooling, tcp_tw_reuse=1 |
| Single-threaded app on 16 cores | 1 core at 100%, 15 idle | Scale with workers or horizontal instances |
| No monitoring or baseline data | Incidents take hours to diagnose | Deploy Prometheus + Grafana or at minimum sar |
| No resource limits (cgroups) | One runaway process takes down the server | Set MemoryMax and CPUQuota in systemd units |
Key Takeaways
Performance troubleshooting is a methodology, not a bag of commands. The USE method gives you a framework: check utilization, saturation, and errors for each resource. The tools are mostly straightforward, with top, free, iostat, and ss covering 80% of scenarios. What separates effective troubleshooting from flailing is discipline: measure first, form a hypothesis, verify, then act.
Install sysstat and enable sar data collection on every server before you need it. Set up resource limits with cgroups so one misbehaving process cannot bring down the entire system. Establish baselines during normal operation so you know what "healthy" looks like. And never optimize based on a guess. A wrong diagnosis wastes time and can make the problem worse.
The production incidents that resolve quickly share a common pattern: the engineer had the right tools installed, knew what normal looked like, and followed a systematic method to identify the bottleneck. Everything in this guide is aimed at making you that engineer.
Senior Kubernetes Architect
10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.
Related Articles
Linux Networking Commands: Cheat Sheet
Linux networking commands cheat sheet for troubleshooting — interfaces, routing, DNS lookups, connections, iptables firewalls, and tcpdump packet capture.
Linux Fundamentals: File System Navigation and Permissions
Navigate the Linux file system hierarchy, master essential commands, understand file permissions and ownership, and work with links, pipes, and redirection.
Linux Networking & Firewall: Configuration and Troubleshooting
Configure Linux network interfaces, set up iptables/nftables/firewalld rules, troubleshoot connectivity with ss, ip, dig, and tcpdump, and secure your servers.