Troubleshooting Workflow

Page 17 — A repeatable process for diagnosing Linux service problems.

The short version
  • check status
  • check logs
  • check config
  • check network
  • check auth/certs/time
  • change one thing at a time

Step 1: Define the symptom

Not "it's broken." Say exactly what is wrong:

Write it down. A clear description of the symptom prevents you from chasing the wrong problem. It also helps when asking someone else for help.

Step 2: Check service state

systemctl status SERVICE
journalctl -u SERVICE -n 50
journalctl -u SERVICE --since today

Why: if the service is dead, fix that first. Logs often tell you exactly why startup failed — look for failed, error, or the last line before it stopped.

Step 3: Validate config

Use the service's built-in config validation tool:

nginx -t
apachectl configtest
postfix check
doveconf

Why: many failures are just bad config syntax or invalid values. Fixing a syntax error is faster than debugging a running service.

Step 4: Check network and ports

ss -tulpn
nc -zv host port
curl -vk https://host
ip a
ip r

Questions to ask:

Step 5: Check DNS

dig host
host host
nslookup host

Questions to ask:

Step 6: Check auth, certs, and time

kinit
klist
openssl x509 -in cert.crt -noout -dates
chronyc tracking
timedatectl

Things to verify:

Step 7: Check permissions and files

ls -l
namei -l /path/to/file

Common permission issues:

Step 8: Change one thing at a time

Discipline: Change one thing, then retest before changing anything else. If you change multiple things at once, you will not know which fix mattered — or what broke something new.

After each change:

  1. Restart or reload the relevant service
  2. Re-run your test
  3. Check logs again
  4. Note what changed and what effect it had

Common fault domains

When stuck, work through this list:

Resource exhaustion

Services fail silently or in confusing ways when a system runs out of disk space, inodes, memory, or file descriptors. These are easy to miss because the error messages often point elsewhere.

Disk space

df -h      # show disk usage by filesystem (human-readable)
df -i      # show inode usage — a full inode table also stops writes

A filesystem can have free space but exhausted inodes — you will not be able to create new files. Always check both.

Common symptom: A service fails to write logs, start, or create temp files with "No space left on device" — even if df -h shows space available. Check df -i.

Find what is consuming space:

du -sh /*             # top-level usage
du -sh /var/log/*    # check log directories specifically

OOM (Out of Memory)

dmesg | grep -i oom           # kernel OOM killer log
journalctl -k | grep -i oom   # same via systemd journal

The OOM killer terminates processes when memory is critically low. If a service or process disappeared without explanation, check for OOM kills first.

free -h    # check current memory and swap usage

File descriptor limits

ulimit -n                    # your shell's current limit
cat /proc/PID/limits         # limits for a running process (replace PID)

High-traffic services (nginx, databases) can hit the open file descriptor limit under load. When they do, they fail to accept new connections even though the service is technically running. Raise the limit in the service unit file with LimitNOFILE=65535.