DevOpsil
Linux
94%
Fresh

Linux Performance Troubleshooting: CPU, Memory, Disk, and Network

Aareez AsifAareez Asif31 min read

When a production server grinds to a halt at 3 AM, you need a structured methodology, not a panicked scramble through random commands. Over years of managing Linux servers in production, I have learned that the difference between resolving an incident in five minutes versus five hours comes down to one thing: a systematic approach. This guide covers everything from the foundational USE method to advanced tools like perf and strace, complete with real-world scenarios drawn from production war stories.

The USE Method: Your Troubleshooting Framework

Brendan Gregg's USE method is the single most valuable framework for performance analysis. For every resource on a system, you check three things:

  • Utilization: What percentage of the resource's capacity is being consumed? A CPU running at 95% utilization is nearly saturated.
  • Saturation: Is there work waiting in a queue because the resource cannot keep up? The run queue for CPUs and the I/O wait queue for disks are classic examples.
  • Errors: Are there error events? Disk I/O errors, network packet drops, and TCP retransmits all fall here.

The power of the USE method is that it gives you a checklist. Instead of guessing, you systematically walk through each resource and check all three dimensions.

ResourceUtilizationSaturationErrors
CPUtop, mpstat, sar -uLoad average, run queue (vmstat r)dmesg, perf stat
Memoryfree -h, sar -rSwap usage (vmstat si/so), OOM killsdmesg, /var/log/syslog
Diskiostat -x, df -hawait, aqu-sz in iostatdmesg, smartctl
Networkiftop, sar -n DEVTCP retransmits, socket backlognetstat -s, ethtool -S, ss -ti

Apply this to each resource in order. Whichever resource shows high utilization, saturation, or errors is your bottleneck. Do not skip ahead to "solutions" until you have identified the actual constraint.

CPU Analysis

Load Average: The First Signal

The very first command to run on a slow server:

uptime
#  10:23:45 up 42 days,  3:15,  2 users,  load average: 4.52, 3.18, 2.01

Load average represents the average number of processes in a runnable or uninterruptible state over 1, 5, and 15 minutes. On a 4-core system:

  • Load 4.0 means all cores are fully occupied with no queuing.
  • Load 8.0 means 4 processes are running and 4 are waiting in the queue.
  • Load 1.0 means 75% of CPU capacity is idle.

A critical nuance: load average includes processes in uninterruptible sleep (state D), which are typically waiting on disk I/O. A load of 20 on a 4-core system does not necessarily mean CPU is the problem. It could mean 16 processes are stuck waiting on a slow NFS mount. Always cross-reference with actual CPU utilization.

The three numbers tell a trend story. If the 1-minute average is much higher than the 15-minute, the load is increasing (you caught it early). If the 15-minute is highest, the load spike may already be subsiding.

Understanding /proc/cpuinfo

# How many logical CPUs
nproc
# 8

# CPU model and features
grep "model name" /proc/cpuinfo | head -1
# model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz

# Physical cores vs. logical (hyperthreading)
grep "cpu cores" /proc/cpuinfo | head -1
# cpu cores : 4

# Check for CPU frequency scaling
grep "cpu MHz" /proc/cpuinfo

Knowing your core count is essential for interpreting load averages. If you have 8 logical CPUs (4 cores with hyperthreading), a load average of 8 is 100% utilization, not 200%.

top: The Swiss Army Knife

top

The header provides immediate insight:

top - 10:23:45 up 42 days, 3:15, 2 users, load average: 4.52, 3.18, 2.01
Tasks: 187 total,   3 running, 184 sleeping,   0 stopped,   0 zombie
%Cpu(s): 72.3 us, 12.1 sy,  0.0 ni, 14.2 id,  0.5 wa,  0.0 hi,  0.9 si,  0.0 st
MiB Mem :  7976.4 total,   512.3 free,  5234.1 used,  2230.0 buff/cache
MiB Swap:  2048.0 total,  1800.5 free,   247.5 used.  2102.3 avail Mem

The CPU percentage line is dense with information:

FieldMeaningWhat to Watch For
usUser-space CPU (your applications)High means apps are compute-heavy
syKernel/system CPUHigh means lots of syscalls, context switches, or kernel work
niNice'd (low-priority) processesUsually irrelevant
idIdle CPUWhat's left over
waI/O waitCPU idle because processes are blocked on disk. High wa means disk is the bottleneck, not CPU
hiHardware interruptsRarely high unless network card issues
siSoftware interruptsHigh on busy network servers
stSteal timeHypervisor taking CPU from your VM

Essential top keyboard shortcuts:

  • P -- sort by CPU usage
  • M -- sort by memory usage
  • 1 -- toggle per-CPU core breakdown
  • c -- show full command line (critical for distinguishing Java processes)
  • H -- show individual threads
  • k -- kill a process by PID
  • d -- change refresh interval

For batch mode (scripting or quick snapshots):

# One-shot snapshot, 20 lines
top -bn1 | head -20

# Capture 10 iterations at 2-second intervals to a file
top -bn10 -d2 > /tmp/top-capture.txt

htop: The Better Interactive Monitor

htop

htop provides colored CPU bars per core, a tree view with F5, search with F3, and the ability to filter processes by user or string with F4. Its biggest advantage over top is visual clarity: you can see at a glance which cores are busy, how memory is split between used and cached, and navigate the process list intuitively.

Install it if it is not present:

sudo apt install htop    # Debian/Ubuntu
sudo yum install htop    # RHEL/CentOS

mpstat: Per-CPU Core Statistics

mpstat -P ALL 1 5
CPU    %usr   %sys  %iowait   %idle
all   45.23   8.12     2.34   44.31
  0   95.10   4.90     0.00    0.00
  1   12.00   3.50     5.20   79.30
  2   38.40   9.80     1.10   50.70
  3   35.40   14.30    3.05   47.25

Core 0 at 95% while the others sit partly idle. This is the signature of a single-threaded bottleneck. The application is bound to one core and cannot use the rest. Common culprits include Node.js event loops, single-threaded database queries, and Python's GIL-bound workloads. The fix is usually architectural: worker processes, connection pooling, or horizontal scaling.

pidstat: Per-Process CPU Breakdown

# CPU usage per process, every 2 seconds, 5 samples
pidstat -u 2 5

# CPU usage for a specific PID
pidstat -p 12345 1 5

# Include threads
pidstat -t -p 12345 1 5
Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000     12345   85.20    4.80    0.00    2.50   90.00     -  java
Average:       33      8901    8.50    3.20    0.00    0.50   11.70     -  nginx
Average:      999      4567    5.10    1.80    0.00    0.30    6.90     -  postgres

pidstat comes from the sysstat package. It is cleaner than parsing top output and works well in scripts.

CPU Steal in Virtual Machines

On cloud VMs (AWS EC2, GCP, Azure), the st (steal) percentage in top deserves special attention. Steal time means the hypervisor is using CPU cycles that your VM requested. In other words, your VM wants to run but the physical host is overcommitted.

# Check steal time
mpstat 1 5 | awk '/all/ {print "steal:", $NF == "0.00" ? "none" : $9"%"}'

# Or from vmstat
vmstat 1 5
# Look at the 'st' column

If steal is consistently above 5%, you are on an oversubscribed host. Options include migrating the instance, upgrading to a dedicated host, or moving to a burstable instance type that matches your workload.

perf: Deep CPU Profiling

perf is the Linux kernel's built-in profiler. It tells you not just that the CPU is busy, but exactly where the cycles are going.

# Install perf
sudo apt install linux-tools-common linux-tools-$(uname -r)

# Count CPU events for 10 seconds
sudo perf stat -a sleep 10

# Record CPU samples for a running process
sudo perf record -p $(pgrep myapp) -g sleep 30

# View the profile
sudo perf report

# Quick: which functions use the most CPU system-wide
sudo perf top

perf is invaluable when top tells you a process is using 100% CPU but you need to know which function or code path is responsible. The output of perf report shows a call graph ranked by sample count.

Memory Analysis

free: The Memory Overview

free -h
              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       5.1Gi       500Mi       128Mi       2.2Gi       2.1Gi
Swap:         2.0Gi       247Mi       1.8Gi

Understanding each column is critical:

  • total: Physical RAM installed.
  • used: Memory allocated to processes (includes shared memory, excludes buffers/cache on modern kernels).
  • free: Completely unused memory. On a healthy Linux system, this is often near zero and that is normal.
  • shared: Memory used by tmpfs filesystems and shared memory segments.
  • buff/cache: Memory used by the kernel for disk caching. Buffers are metadata caches (directory entries, inodes), while cache holds actual file contents (page cache).
  • available: An estimate of how much memory is available for new applications without swapping. This is the number that matters most.

The biggest misconception in Linux memory management: seeing "free" near zero and thinking the server is out of memory. Linux deliberately uses free RAM for page cache because unused RAM is wasted RAM. The available column accounts for reclaimable cache and is your real indicator of memory headroom.

vmstat: Real-Time Memory Dynamics

vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  1 247500 512300 102400 2150000    0    5    12   250 1500 3200 45  8 44  2  0
 5  2 247500 498200 102400 2148000    0    0     8  1200 2100 4500 72 12 14  1  0
 2  0 247500 510100 102500 2151000    0    0     4   150 1400 3100 40  7 50  2  0

Key columns decoded:

ColumnMeaningRed Flag
rRun queue (runnable processes)Consistently exceeds CPU core count
bBlocked processes (uninterruptible sleep, usually I/O)Sustained values above 0
swpdSwap used (KB)Growing over time
siSwap in from disk (KB/s)Any non-zero value
soSwap out to disk (KB/s)Any non-zero value
biBlocks read from disk (KB/s)Context-dependent
boBlocks written to disk (KB/s)Context-dependent
inInterrupts per secondBaseline-dependent
csContext switches per secondSudden spikes indicate contention

Active swapping (si and so both non-zero) is one of the strongest signals of memory pressure. The system is thrashing: reading pages from swap only to write other pages out.

/proc/meminfo: The Full Picture

cat /proc/meminfo

Key entries beyond what free shows:

  • Slab: Memory used by kernel data structures (dentries, inodes, etc.). Can grow large on systems with millions of files.
  • SReclaimable: Portion of Slab that can be reclaimed under pressure.
  • SUnreclaim: Slab memory that cannot be reclaimed.
  • PageTables: Memory used for page table mappings. Grows with the number of processes and their virtual address space.
  • Committed_AS: Total memory committed to processes. If this exceeds physical RAM + swap, the system is overcommitted.
# Check slab usage
slabtop -o | head -20

# Or from /proc
grep -i slab /proc/meminfo
# Slab:            350000 kB
# SReclaimable:    280000 kB
# SUnreclaim:       70000 kB

Page Cache vs Buffers

The page cache stores file contents read from disk. When you read a file, the kernel keeps it in memory so subsequent reads are served from RAM. Buffers store filesystem metadata like directory listings and inode information.

# Drop page cache (emergency only, impacts performance)
echo 3 | sudo tee /proc/sys/vm/drop_caches

# 1 = page cache only, 2 = dentries and inodes, 3 = both

Never drop caches on a production database server unless you understand the consequences. The database will need to re-read its entire working set from disk.

The OOM Killer: When Memory Runs Out

When the system exhausts both physical memory and swap, the OOM (Out of Memory) killer activates. It scores every process and kills the one with the highest score to free memory.

# Check for recent OOM kills
dmesg | grep -i "oom"
journalctl -k | grep -i "oom"

# Example OOM log entries:
# [11686.040460] java invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE)
# [11686.040462] Out of memory: Kill process 12345 (java) score 850 or sacrifice child
# [11686.040464] Killed process 12345 (java) total-vm:8234560kB, anon-rss:6120000kB

The OOM score is calculated based on the proportion of memory a process consumes. You can view and adjust it:

# View a process's current OOM score
cat /proc/$(pgrep -f "my-app")/oom_score

# Protect a critical process (range: -1000 to 1000)
echo -1000 > /proc/$(pgrep -f "my-critical-app")/oom_score_adj

# In a systemd unit file
# [Service]
# OOMScoreAdjust=-900

Setting OOMScoreAdjust to -1000 makes a process immune to the OOM killer. Use this sparingly and only for truly critical services like your init process or database.

Swap Usage Investigation

# Overall swap usage
swapon --show
# NAME      TYPE      SIZE   USED PRIO
# /dev/dm-1 partition   2G  247.5M   -2

# Find processes using the most swap
for pid in /proc/[0-9]*; do
    SWAP=$(awk '/VmSwap/{print $2}' "$pid/status" 2>/dev/null || echo 0)
    if [ "$SWAP" -gt 0 ] 2>/dev/null; then
        NAME=$(awk '/Name/{print $2}' "$pid/status" 2>/dev/null)
        echo "${SWAP} kB - PID $(basename $pid) - $NAME"
    fi
done | sort -rn | head -10

Tuning swap behavior:

# Check current swappiness (default is 60)
cat /proc/sys/vm/swappiness

# For database servers, reduce swappiness to prefer dropping cache over swapping
sudo sysctl vm.swappiness=10

# Make it persistent
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.d/99-tuning.conf

A swappiness of 10 tells the kernel to strongly prefer reclaiming page cache over swapping application memory. For database servers where keeping data in process memory is critical, this is almost always the right setting.

Disk I/O Analysis

df and du: Space Usage

# Filesystem space usage
df -h
# Filesystem      Size  Used Avail Use% Mounted on
# /dev/sda1        50G   42G  5.8G  88% /
# /dev/sdb1       200G  120G   71G  63% /data

# Inode usage (can run out even with free space)
df -i

# Find the largest directories under /var
du -sh /var/* 2>/dev/null | sort -rh | head -10

# Find files larger than 100MB
find / -xdev -type f -size +100M -exec ls -lh {} \; 2>/dev/null | sort -k5 -rh | head -10

A full filesystem (100%) is one of the most common causes of application failures, and inode exhaustion (lots of tiny files) can happen even when df -h shows free space. Monitor both.

lsblk and blkid: Disk Topology

# Block device tree
lsblk -f
# NAME   FSTYPE LABEL UUID                                 MOUNTPOINT
# sda
# +-sda1 ext4         a1b2c3d4-e5f6-7890-abcd-ef1234567890 /
# +-sda2 swap         f1e2d3c4-b5a6-7890-dcba-0987654321fe [SWAP]
# sdb
# +-sdb1 xfs          12345678-abcd-ef01-2345-678901234567 /data

# Identify filesystems
blkid

iostat: The Disk Performance Microscope

# Install sysstat if not present
sudo apt install sysstat

# Extended statistics, 1-second intervals, 5 samples
iostat -xz 1 5
Device   r/s     w/s    rkB/s    wkB/s  rrqm/s  wrqm/s  r_await  w_await  aqu-sz  %util
sda     12.50   85.30   200.0   4250.0    0.50    15.20     0.80     5.20    0.45   20.50
nvme0n1  5.20  120.40   320.0   8500.0    0.00     8.50     0.15     0.35    0.04    1.00

The critical metrics:

MetricMeaningConcern Thresholds
r_await / w_awaitAverage time (ms) for read/write requestsSSD: above 5ms; HDD: above 20ms
%utilPercentage of time the device had I/O in progressAbove 80% sustained
aqu-szAverage queue lengthAbove 1 consistently
rrqm/s / wrqm/sMerged requests per secondLow values may indicate random I/O

A common mistake is looking only at %util. A device at 100% utilization that handles all requests with sub-millisecond latency is not a problem. The await values tell you if the disk is actually struggling.

I/O Scheduler Tuning

# Check current scheduler
cat /sys/block/sda/queue/scheduler
# [mq-deadline] kyber bfq none

# Change scheduler (does not persist across reboot)
echo "kyber" | sudo tee /sys/block/sda/queue/scheduler

General guidance:

  • mq-deadline: Good default for most workloads. Prevents starvation.
  • kyber: Designed for fast devices (NVMe SSDs). Low overhead.
  • bfq: Best for interactive/desktop workloads. Provides fair I/O bandwidth.
  • none: For NVMe devices that have their own internal queuing. Lowest overhead.

For NVMe SSDs in server environments, none or kyber is typically optimal. For spinning disks, mq-deadline works well.

iowait Explained

When a CPU reports iowait, it means the CPU is idle AND there are processes waiting for I/O to complete. It is a subset of idle time, not additional overhead. High iowait means the disk (or NFS, or other block device) is the bottleneck.

# Confirm iowait with vmstat
vmstat 1 5
# Look at the 'wa' column in the cpu section

# Identify which processes are in D state (uninterruptible sleep, usually I/O)
ps aux | awk '$8 == "D" || $8 == "D+"'

iotop: Identify the I/O Hog

# Show only processes doing I/O, accumulated totals
sudo iotop -oa

# Total DISK READ:       5.20 M/s | Total DISK WRITE:      25.80 M/s
# PID   PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO    COMMAND
# 12345 be/4  mysql      4.80 M/s   22.50 M/s  0.00 %  65.00 % mysqld
# 8901  be/4  root       0.40 M/s    3.30 M/s  0.00 %  12.00 % rsync

The -o flag filters to only processes actively doing I/O, cutting through noise. The -a flag accumulates totals so you see the bigger picture, not just instantaneous rates.

Network Performance

ss: The Modern Socket Tool

ss has replaced netstat on modern Linux systems. It is faster and provides more detail.

# Connection summary
ss -s
# Total: 1250
# TCP:   980 (estab 750, closed 100, orphaned 12, timewait 95)

# All TCP connections with process names
ss -tlnp

# Find all connections to a specific port
ss -tn dst :3306

# Count connections per state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
#    750 ESTAB
#    100 TIME-WAIT
#     95 CLOSE-WAIT
#     20 LISTEN
#     15 FIN-WAIT-2

# Detailed TCP info (retransmits, RTT, congestion window)
ss -ti dst :443

Connection state red flags:

  • Many TIME-WAIT: Short-lived connections being created and destroyed. Fix with connection pooling or net.ipv4.tcp_tw_reuse=1.
  • Growing CLOSE-WAIT: The application is not closing sockets properly. This is a bug in your code.
  • SYN-RECV buildup: Potential SYN flood attack or backlog too small.

iftop and nethogs: Real-Time Bandwidth

# Bandwidth per connection pair
sudo iftop -i eth0

# Bandwidth per process
sudo nethogs eth0

iftop shows bandwidth between host pairs, which is great for spotting unexpected traffic or DDoS. nethogs groups traffic by process, instantly answering "which process is consuming bandwidth."

Bandwidth Testing with iperf3

# On the server side
iperf3 -s

# On the client side
iperf3 -c server-ip -t 30 -P 4
# -t 30 = test for 30 seconds
# -P 4 = 4 parallel streams

# Sample output
# [SUM]   0.00-30.00  sec  3.28 GBytes   940 Mbits/sec   sender
# [SUM]   0.00-30.00  sec  3.27 GBytes   938 Mbits/sec   receiver

iperf3 tests the raw network capacity between two points. If iperf3 shows full bandwidth but your application is slow, the bottleneck is in the application layer, not the network.

Packet Drops and TCP Retransmits

# Interface-level statistics
ip -s link show eth0
# Look for RX/TX errors and dropped counts

# TCP retransmit statistics
netstat -s | grep -i retrans
#     12543 segments retransmitted
#     85 fast retransmits

# Per-connection retransmit info
ss -ti | grep -A1 retrans

# Watch packet drops in real time
watch -n1 'cat /proc/net/dev'

TCP retransmits indicate packet loss somewhere in the path. A retransmit rate above 1-2% degrades throughput significantly because TCP backs off its sending rate. Common causes include network congestion, faulty cables, and overloaded switches.

Process Analysis

ps: Process Snapshots

# Top CPU consumers
ps aux --sort=-%cpu | head -10

# Top memory consumers
ps aux --sort=-%mem | head -10

# Tree view showing parent-child relationships
ps auxf

# Find zombie processes
ps aux | awk '$8 ~ /Z/ {print}'

# Long-running processes
ps -eo pid,user,etime,args --sort=-etime | head -10

pstree: Visual Process Hierarchy

pstree -p
# systemd(1)---nginx(1234)---nginx(1235)
#           |               ---nginx(1236)
#           ---sshd(800)---sshd(12340)---bash(12345)---top(12400)

pstree is invaluable for understanding which processes spawned which. If you see hundreds of child processes under a single parent, that parent may be fork-bombing.

strace: System Call Tracing

When you need to understand exactly what a stuck or slow process is doing:

# Summary of syscalls for a running process
sudo strace -p $(pgrep myapp) -c -S time
# % time     seconds  usecs/call     calls    errors syscall
# ------ ----------- ----------- --------- --------- --------
#  65.12    2.450000         245     10000           read
#  20.30    0.764000         764      1000           write
#   8.50    0.320000          32     10000      8000 open
#   6.08    0.229000          23     10000           close

# Trace only file operations
sudo strace -p $(pgrep myapp) -e trace=open,read,write,close

# Trace network activity
sudo strace -p $(pgrep myapp) -e trace=network

# Follow child processes and threads, with timestamps
sudo strace -p $(pgrep myapp) -f -tt -T

# Trace a command from start
strace -c curl https://example.com 2>&1

strace adds significant overhead (it pauses the process on every syscall). Never use it on latency-sensitive production processes unless you have no alternative. For lighter profiling, consider perf or bpftrace.

lsof: Open Files and File Descriptors

# Files opened by a specific process
lsof -p $(pgrep nginx)

# Who has a specific file open
lsof /var/log/syslog

# Deleted files still holding disk space
lsof +L1

# Network connections for a process
lsof -i -a -p $(pgrep myapp)

# Count open file descriptors per process
lsof -p $(pgrep myapp) | wc -l

# Check process limits vs actual usage
cat /proc/$(pgrep myapp)/limits | grep "open files"
ls /proc/$(pgrep myapp)/fd | wc -l

A classic gotcha: deleted log files that are still open by a process continue consuming disk space. df shows the space as used, but du cannot account for it because the directory entry is gone. lsof +L1 reveals these phantom files. Restarting the holding process or sending it a signal to reopen its log files frees the space.

/proc/PID: The Process Filesystem

Every running process has a directory under /proc containing detailed information:

PID=$(pgrep myapp)

# Process status (memory, threads, signals)
cat /proc/$PID/status

# Memory map
cat /proc/$PID/maps | head -20

# Current working directory
ls -la /proc/$PID/cwd

# Environment variables
cat /proc/$PID/environ | tr '\0' '\n'

# Open file descriptors
ls -la /proc/$PID/fd | head -20

# I/O statistics
cat /proc/$PID/io
# rchar: 1234567890    (bytes read)
# wchar: 9876543210    (bytes written)
# read_bytes: 500000000  (actual disk reads)
# write_bytes: 800000000 (actual disk writes)

System Overview Tools

sar: Historical Performance Data

sar (System Activity Reporter) is part of the sysstat package and is indispensable for post-incident analysis.

sudo apt install sysstat
# Enable data collection
sudo sed -i 's/ENABLED="false"/ENABLED="true"/' /etc/default/sysstat
sudo systemctl enable --now sysstat
# CPU utilization for today
sar -u

# Memory usage
sar -r

# Disk I/O
sar -d

# Network throughput
sar -n DEV

# Yesterday's data
sar -u -f /var/log/sysstat/sa$(date -d yesterday +%d)

# Specific time range
sar -u -s 09:00:00 -e 12:00:00

# All metrics for a time range (comprehensive)
sar -A -s 02:00:00 -e 04:00:00

sar data is collected every 10 minutes by default. When you are analyzing a 3 AM incident the next morning, sar is often the only source of historical data. Install it on every server.

dstat: Real-Time Everything

# Default: CPU, disk, net, paging, system
dstat

# Custom selection
dstat -tcmndg --top-cpu --top-mem

dstat combines vmstat, iostat, and ifstat into a single colorized output. It is excellent for watching correlations between subsystems in real time during load testing or incident response.

nmon and glances

# nmon: Interactive performance monitor
nmon
# Press c for CPU, m for memory, d for disks, n for network

# glances: Modern system monitor with web UI option
pip install glances
glances
glances -w   # Start web server mode on port 61208

glances is particularly useful because it highlights values in red/yellow when they cross thresholds, and it can export to InfluxDB, Elasticsearch, or other backends.

Log Analysis for Troubleshooting

journalctl: systemd Journal

# Logs for a specific service
journalctl -u nginx --since "1 hour ago"

# Kernel messages only
journalctl -k

# Follow mode (like tail -f)
journalctl -f -u myapp

# Logs around a specific time
journalctl --since "2026-03-23 02:00:00" --until "2026-03-23 04:00:00"

# Error and critical priority only
journalctl -p err..crit --since "today"

# Disk usage by journal
journalctl --disk-usage

Traditional Log Files

# System messages
tail -100 /var/log/syslog          # Debian/Ubuntu
tail -100 /var/log/messages        # RHEL/CentOS

# Authentication events
tail -50 /var/log/auth.log

# Kernel ring buffer (most recent hardware and kernel events)
dmesg --human --level=err,warn

# OOM events
dmesg | grep -i "oom\|out of memory"

# Disk errors
dmesg | grep -i "error\|fault\|fail" | grep -i "sd\|nvme\|disk\|ext4\|xfs"

cgroups and Resource Limits

Control groups (cgroups) allow you to limit, prioritize, and account for resource usage per process group. systemd uses cgroups extensively.

# View cgroup resource usage for systemd services
systemd-cgtop

# Check a service's cgroup limits
systemctl show myapp.service | grep -i "memory\|cpu"

# View cgroup hierarchy
cat /proc/$(pgrep myapp)/cgroup

Setting resource limits in a systemd unit file:

# /etc/systemd/system/myapp.service
# [Service]
# MemoryMax=2G
# MemoryHigh=1.5G
# CPUQuota=200%
# TasksMax=512
# IOWeight=100
  • MemoryMax: Hard limit. Process gets OOM-killed if exceeded.
  • MemoryHigh: Soft limit. Kernel throttles the process but does not kill it.
  • CPUQuota: 200% means the process can use up to 2 full CPU cores.
  • TasksMax: Maximum number of tasks (threads + processes).
# Apply changes
sudo systemctl daemon-reload
sudo systemctl restart myapp

cgroups are your safety net in production. Even if a Java process has a memory leak, MemoryMax ensures it gets killed before it takes down the entire server.

Practical Troubleshooting Scenarios

Scenario 1: High Load Average but Low CPU Usage

You see load average at 20 on a 4-core box, but CPU usage is only 15%.

# Check for processes in D state (uninterruptible sleep)
ps aux | awk '$8 == "D" || $8 == "D+"'

# Confirm I/O wait
vmstat 1 5
# Look for high 'wa' and high 'b' (blocked processes)

# Identify what's doing the I/O
sudo iotop -oa

The diagnosis: 16 processes are stuck waiting on an NFS mount that has become unresponsive. They show as D state and inflate the load average, but they are not using CPU. The fix is to address the NFS server, not add more CPU.

Production war story: A web application server showed load average of 45 with CPUs nearly idle. Every request was hanging. Investigation revealed a saturated SAN storage device that had increased I/O latency from 2ms to 800ms. All application threads were blocked on disk reads. The fix was migrating the hot data to local NVMe storage.

Scenario 2: OOM Kills in Production

Your application keeps getting killed. dmesg confirms OOM kills.

# Find the OOM kill details
dmesg | grep -A 5 "Out of memory"

# Check current memory state
free -h
cat /proc/meminfo | grep -E "MemTotal|MemAvailable|SwapTotal|SwapFree|Committed"

# Identify memory hogs
ps aux --sort=-%rss | head -10

# Check if there is a leak (watch RSS grow over time)
while true; do
    ps -o pid,rss,vsz,comm -p $(pgrep myapp)
    sleep 60
done

Immediate mitigations:

# Add emergency swap if none exists
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Protect the most critical process
echo -1000 | sudo tee /proc/$(pgrep -f "my-database")/oom_score_adj

# Set memory limits to contain the leak
sudo systemctl set-property myapp.service MemoryMax=4G

Long-term: profile the application for memory leaks with language-specific tools (Valgrind for C/C++, heap dumps for Java, tracemalloc for Python).

Scenario 3: Disk Space Emergency

Application is throwing "No space left on device" errors.

# Step 1: What's full?
df -h
df -i  # Check inodes too

# Step 2: Find the biggest consumers
du -sh /* 2>/dev/null | sort -rh | head -10
du -sh /var/* 2>/dev/null | sort -rh | head -10
du -sh /var/log/* 2>/dev/null | sort -rh | head -10

# Step 3: Check for deleted files still holding space
lsof +L1

# Step 4: Quick wins for freeing space
sudo journalctl --vacuum-size=100M    # Trim systemd journal
sudo apt clean                         # Clear apt cache
sudo find /var/log -name "*.gz" -mtime +30 -delete  # Old compressed logs
sudo find /tmp -mtime +7 -delete      # Old temp files

Production war story: A server showed 100% disk usage but du -sh / only accounted for 30GB of a 50GB disk. The culprit was a runaway application writing to a log file that had been deleted while the process still had it open. lsof +L1 revealed a 20GB deleted file still held by the process. A kill -HUP to the process (which made it reopen its log file) instantly freed the space.

Scenario 4: Network Latency Spikes

Users report intermittent slowness.

# Check for packet loss
ping -c 100 target-host
# Look for packet loss percentage and latency variance

# Check interface errors
ip -s link show eth0

# TCP retransmits
netstat -s | grep -i retrans
ss -ti dst :443 | grep retrans

# Check for bandwidth saturation
sudo iftop -i eth0

# Check socket buffer overflows
netstat -s | grep -i overflow
cat /proc/net/netstat | grep -i drop

Common fixes:

# Increase socket buffer sizes
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

# Increase connection backlog
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535

Scenario 5: Zombie Processes

top shows zombie processes accumulating.

# Find zombies
ps aux | awk '$8 == "Z"'

# Find the parent of zombies
ps -eo pid,ppid,stat,args | awk '$3 ~ /Z/'
# Then check the parent
ps -p PARENT_PID -o pid,args

# The parent is not calling wait() on its children
# Option 1: Send SIGCHLD to the parent
kill -SIGCHLD PARENT_PID

# Option 2: If that does not work, kill the parent
# The zombies will be adopted by init and reaped
kill PARENT_PID

Zombies themselves consume no resources (just a process table entry), but they indicate a buggy parent process. If they accumulate into the thousands, they can exhaust the PID space.

Scenario 6: Memory Leak Detection

You suspect a process is leaking memory.

# Track RSS growth over time
pidstat -r -p $(pgrep myapp) 60
# Watch the RSS column. If it grows steadily without plateau, it's leaking.

# Sample output showing a leak:
# Time       PID  minflt/s  majflt/s     VSZ      RSS   %MEM  Command
# 10:00:00  1234   150.00      0.00  4500000  1200000  15.0  myapp
# 10:01:00  1234   180.00      0.00  4600000  1350000  16.9  myapp
# 10:02:00  1234   200.00      0.00  4750000  1520000  19.0  myapp
# 10:03:00  1234   210.00      0.00  4900000  1700000  21.3  myapp

# Check the memory map for clues
cat /proc/$(pgrep myapp)/smaps_rollup
# Look at the Rss, Pss, and Anonymous fields

# For native code, use valgrind
valgrind --leak-check=full ./myapp

# For Java, capture a heap dump
jmap -dump:format=b,file=heap.hprof $(pgrep java)

# For Python, enable tracemalloc in your code or use:
# python -m tracemalloc your_script.py

A useful heuristic: if RSS grows linearly with time regardless of load, it is almost certainly a leak. If RSS grows with load and stabilizes, it is normal working set expansion.

Performance Baselines and Monitoring

Troubleshooting without a baseline is guesswork. You need to know what "normal" looks like.

Establishing Baselines

# Capture a baseline during normal operation
sar -A > /root/baseline-$(date +%Y%m%d).txt

# Key metrics to baseline
echo "=== CPU ===" && mpstat -P ALL 1 5
echo "=== Memory ===" && free -h && cat /proc/meminfo | head -20
echo "=== Disk ===" && iostat -xz 1 5
echo "=== Network ===" && ss -s
echo "=== Connections ===" && ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

Monitoring Stack Recommendations

ToolUse CaseComplexity
sar (sysstat)Historical data on every server, zero dependenciesLow
Prometheus + node_exporterMetrics collection with alertingMedium
GrafanaVisualization dashboardsMedium
NetdataReal-time, auto-configured dashboardsLow
Datadog / New RelicFull-stack observability (SaaS)Low (but costs money)

At minimum, every production server should have sysstat installed and enabled. It costs nothing and provides historical data that is invaluable during incident response.

Alerting Thresholds

Start with these thresholds and tune based on your baseline:

MetricWarningCritical
CPU utilization80% sustained 5min95% sustained 5min
Memory available20% of total10% of total
Disk space used80%90%
Disk I/O await10ms (SSD)50ms (SSD)
Swap usageAny increaseAbove 500MB
Load average2x CPU count4x CPU count
TCP retransmits1% of segments5% of segments

Common Performance Anti-Patterns

Anti-PatternSymptomFix
No swap on production serversOOM kills under memory spikesAdd 1-2 GB swap as a safety net
Default swappiness=60 on DB serversDatabase page cache gets swapped outSet vm.swappiness=10
Running out of file descriptors"Too many open files" errorsIncrease LimitNOFILE in systemd unit
Disk full from unrotated logsApplication crashes on write errorsConfigure logrotate, monitor at 80%
TIME-WAIT connection floodingPort exhaustion, connection failuresConnection pooling, tcp_tw_reuse=1
Single-threaded app on 16 cores1 core at 100%, 15 idleScale with workers or horizontal instances
No monitoring or baseline dataIncidents take hours to diagnoseDeploy Prometheus + Grafana or at minimum sar
No resource limits (cgroups)One runaway process takes down the serverSet MemoryMax and CPUQuota in systemd units

Key Takeaways

Performance troubleshooting is a methodology, not a bag of commands. The USE method gives you a framework: check utilization, saturation, and errors for each resource. The tools are mostly straightforward, with top, free, iostat, and ss covering 80% of scenarios. What separates effective troubleshooting from flailing is discipline: measure first, form a hypothesis, verify, then act.

Install sysstat and enable sar data collection on every server before you need it. Set up resource limits with cgroups so one misbehaving process cannot bring down the entire system. Establish baselines during normal operation so you know what "healthy" looks like. And never optimize based on a guess. A wrong diagnosis wastes time and can make the problem worse.

The production incidents that resolve quickly share a common pattern: the engineer had the right tools installed, knew what normal looked like, and followed a systematic method to identify the bottleneck. Everything in this guide is aimed at making you that engineer.

Share:
Aareez Asif
Aareez Asif

Senior Kubernetes Architect

10+ years orchestrating containers in production. Battle-tested opinions on everything from pod scheduling to service mesh. I've seen clusters burn and helped rebuild them better.

Related Articles

LinuxQuick RefFresh

Linux Networking Commands: Cheat Sheet

Linux networking commands cheat sheet for troubleshooting — interfaces, routing, DNS lookups, connections, iptables firewalls, and tcpdump packet capture.

Aareez Asif·
3 min read