Troubleshooting Workflow
- check status
- check logs
- check config
- check network
- check auth/certs/time
- change one thing at a time
Step 1: Define the symptom
Not "it's broken." Say exactly what is wrong:
- nginx won't start
- website returns 502 Bad Gateway
- SSH key auth fails
- mail is stuck in the queue
- host cannot resolve DNS
- cert appears expired
- log forwarding stopped
Step 2: Check service state
systemctl status SERVICE
journalctl -u SERVICE -n 50
journalctl -u SERVICE --since today
Why: if the service is dead, fix that first. Logs often tell you exactly why startup failed — look for failed, error, or the last line before it stopped.
Step 3: Validate config
Use the service's built-in config validation tool:
nginx -t
apachectl configtest
postfix check
doveconf
Why: many failures are just bad config syntax or invalid values. Fixing a syntax error is faster than debugging a running service.
Step 4: Check network and ports
ss -tulpn
nc -zv host port
curl -vk https://host
ip a
ip r
Questions to ask:
- Is the service listening on the expected port? (
ss -tulpn) - Is the port blocked by a firewall? (
firewall-cmd --list-all) - Can you reach the upstream or backend from this host?
- Is the routing correct?
Step 5: Check DNS
dig host
host host
nslookup host
Questions to ask:
- Does the hostname resolve to the expected IP?
- Is the DNS server configured correctly (
/etc/resolv.conf)? - Is there a reverse DNS entry if needed?
- Is a search domain causing unexpected name resolution?
Step 6: Check auth, certs, and time
kinit
klist
openssl x509 -in cert.crt -noout -dates
chronyc tracking
timedatectl
Things to verify:
- Kerberos ticket is valid and not expired
- Certificate is within validity dates
- Certificate CN / SAN matches the hostname being connected to
- System time is in sync — Kerberos requires less than 5 minutes clock skew
Step 7: Check permissions and files
ls -l
namei -l /path/to/file
Common permission issues:
- Config file not readable by the service user
- Private key permissions too open (SSH / TLS services refuse these)
- Log directory not writable by the service
- SELinux label wrong:
ls -lZ,ausearch -m avc
Step 8: Change one thing at a time
After each change:
- Restart or reload the relevant service
- Re-run your test
- Check logs again
- Note what changed and what effect it had
Common fault domains
When stuck, work through this list:
- Service down — not started, crashed, or failed on startup
- Bad config — syntax error, wrong path, missing directive
- Bad permissions — file or directory not accessible to the service
- Port blocked — firewall rule, SELinux, wrong bind address
- DNS wrong — hostname does not resolve, or resolves to the wrong IP
- Cert expired or mismatched — check dates and hostname
- Auth broken — wrong credentials, expired ticket, missing key
- Time skew — Kerberos will fail if clocks are too far apart
- Automation rendered wrong file — a template produced unexpected output
- Resource exhaustion — disk full, inode exhaustion, OOM — see below
Resource exhaustion
Services fail silently or in confusing ways when a system runs out of disk space, inodes, memory, or file descriptors. These are easy to miss because the error messages often point elsewhere.
Disk space
df -h # show disk usage by filesystem (human-readable)
df -i # show inode usage — a full inode table also stops writes
A filesystem can have free space but exhausted inodes — you will not be able to create new files. Always check both.
df -h shows space available. Check df -i.
Find what is consuming space:
du -sh /* # top-level usage
du -sh /var/log/* # check log directories specifically
OOM (Out of Memory)
dmesg | grep -i oom # kernel OOM killer log
journalctl -k | grep -i oom # same via systemd journal
The OOM killer terminates processes when memory is critically low. If a service or process disappeared without explanation, check for OOM kills first.
free -h # check current memory and swap usage
File descriptor limits
ulimit -n # your shell's current limit
cat /proc/PID/limits # limits for a running process (replace PID)
High-traffic services (nginx, databases) can hit the open file descriptor limit under load. When they do, they fail to accept new connections even though the service is technically running. Raise the limit in the service unit file with LimitNOFILE=65535.