Service-Specific Troubleshooting
On this page
General first steps — always
Before diving into service-specific steps, do these first for any broken service:
# 1. Is the service running?
systemctl status SERVICE_NAME
# 2. What do the logs say?
journalctl -u SERVICE_NAME -n 50
journalctl -u SERVICE_NAME --since "10 minutes ago"
# 3. Is it listening on the right port?
ss -tlnp | grep SERVICE_NAME
# 4. Are there recent errors at the system level?
journalctl -p err -b --no-pager | head -30
# 5. Did something change recently?
git log --oneline -10 # in the ansible repo
nginx not responding
# Step 1: Is it running?
systemctl status nginx
journalctl -u nginx -n 30
# Step 2: Config syntax error?
nginx -t
# If syntax error, it will point to the file and line
# Step 3: Is it listening?
ss -tlnp | grep nginx
# If not listening: service probably failed to start — read the logs
# Step 4: Is the port open in the firewall?
firewall-cmd --list-all | grep -E "ports|services"
# Step 5: Is SELinux blocking it?
grep "type=AVC" /var/log/audit/audit.log | grep nginx | tail -10
# Step 6: Test from outside
curl -v http://HOSTNAME/
curl -vk https://HOSTNAME/ # -k skips cert check
# Step 7: Check if upstream backend is running (for reverse proxy)
curl -v http://127.0.0.1:8080/ # test backend directly
Common nginx failures:
- 502 Bad Gateway — nginx is up, backend is down or not listening on the expected port
- 504 Gateway Timeout — backend is too slow or hanging; check proxy_read_timeout
- 413 Request Entity Too Large — increase
client_max_body_size - SSL_ERROR_RX_RECORD_TOO_LONG — HTTPS client connecting to HTTP port; check listen and port config
Postfix mail not sending
# Step 1: Is postfix running?
systemctl status postfix
journalctl -u postfix -n 30
# Step 2: What is in the queue?
postqueue -p
# Look at the "stuck" messages — the reason is shown
# Step 3: Try to flush the queue and watch what happens
postqueue -f
journalctl -u postfix -f # watch in another terminal
# Step 4: Test sending manually
echo "test" | mail -s "test" you@example.com
journalctl -u postfix -n 20 # check what happened
Reading queue errors in postqueue -p:
# Connection refused to relayhost
connect to smtp.example.com[10.0.0.2]:25: Connection refused
→ relayhost is down or wrong port in main.cf
# Authentication failure
SASL authentication failed
→ Wrong credentials in sasl_passwd, or sasl_passwd.db not updated (run postmap)
# TLS required but not offered
server requires encryption
→ set smtp_tls_security_level = encrypt (or = may for opportunistic TLS)
# DNS lookup failed
Host or domain name not found. Name service error
→ relayhost hostname does not resolve; add [] brackets to skip MX lookup
# Check DNS resolution of relayhost
dig +short smtp.example.com
# Check TCP connectivity
nc -zv smtp.example.com 587
# Check SASL credentials file
postconf smtp_sasl_password_maps
ls -la /etc/postfix/sasl_passwd.db # must exist and be newer than sasl_passwd
SSH connection failing
# Step 1: Test with verbose output
ssh -vvv user@host 2>&1 | head -50
# Look for:
# - "Connecting to host port 22" — network connectivity
# - "Authentications that can continue" — what the server accepts
# - "No more authentication methods to try" — key not accepted
# Step 2: Is sshd running on the target?
systemctl status sshd
# Step 3: Is port 22 open?
ss -tlnp | grep sshd
firewall-cmd --list-all | grep ssh
# Step 4: Key issues
# Check the key is in authorized_keys
cat ~/.ssh/authorized_keys | grep "$(cut -d' ' -f2 ~/.ssh/id_ed25519.pub)"
# Check permissions (must be exact)
ls -la ~/.ssh/ # dir: 700
ls -la ~/.ssh/authorized_keys # file: 600
# Step 5: Check SELinux
restorecon -Rv ~/.ssh/ # fix any context issues
# Step 6: Check sshd logs on target
journalctl -u sshd -n 30
Common SSH error messages:
- Connection refused — sshd not running, or wrong port, or firewall blocking
- Connection timed out — network unreachable or firewall silently dropping
- Permission denied (publickey) — key not in authorized_keys, wrong permissions, or key type not accepted
- Host key verification failed — host key changed (or known_hosts is stale); remove the old entry with
ssh-keygen -R hostname
Time sync problems
# Step 1: Is chrony running?
systemctl status chronyd
# Step 2: Is it synced?
chronyc tracking
# Look for "System time" — should be small (milliseconds)
# Look for "Leap status: Normal" — not "Not synchronised"
# Step 3: What sources is it using?
chronyc sources -v
# '*' = currently synced source
# '+' = acceptable source
# '?' = unreachable source
# Step 4: Can it reach the NTP servers?
chronyc sourcestats
ping ntp1.example.com
# Step 5: Force a sync (if clock is far off)
chronyc makestep
# or
chronyc -a makestep
# Step 6: Check the config
cat /etc/chrony.conf | grep server
If all NTP sources show ? (unreachable):
# DNS check
dig +short ntp1.example.com
# Connectivity check (NTP uses UDP port 123)
nc -zuv ntp1.example.com 123
# Firewall check
firewall-cmd --list-all | grep -E "ntp|123"
Login / authentication failing
# Step 1: Is SSSD running?
systemctl status sssd
# Step 2: Can SSSD resolve the user?
id username@example.com
# Step 3: Test HBAC rules
ipa hbactest --user=username --host=$(hostname) --service=sshd --detail
# Step 4: Check Kerberos
kinit username@EXAMPLE.COM
klist # see if a ticket was issued
# Step 5: Check time sync (Kerberos fails with clock skew > 5 min)
chronyc tracking | grep "System time"
date # compare with date on the IPA server
# Step 6: SSSD logs
tail -f /var/log/sssd/sssd_example.com.log
journalctl -u sssd -n 50
# Step 7: PAM auth logs
journalctl -u sshd -n 20 # sshd pam logs
tail -f /var/log/secure
DNS resolution failing
# Step 1: Basic test
dig example.com
nslookup example.com
# Step 2: Which resolver is being used?
cat /etc/resolv.conf
resolvectl status # on systemd-resolved systems
# Step 3: Test with a specific resolver
dig example.com @8.8.8.8
dig example.com @10.0.0.10 # your internal DNS
# Step 4: Is the resolver reachable?
nc -zuv 10.0.0.10 53 # UDP
nc -zv 10.0.0.10 53 # TCP (used for large responses)
# Step 5: Check /etc/hosts for overrides
grep example.com /etc/hosts
# Step 6: Check nsswitch.conf
grep hosts /etc/nsswitch.conf # should be: files dns
Blocked by SELinux
# Step 1: Is SELinux in enforcing mode?
getenforce
# Step 2: Check for recent denials
ausearch -m avc -ts recent | tail -20
grep "type=AVC" /var/log/audit/audit.log | tail -10
# Step 3: Explain the denial
ausearch -m avc -ts recent | audit2why
# Step 4: Quick test — switch to permissive temporarily
setenforce 0
# retry the operation
# if it works in permissive, SELinux is the cause
setenforce 1
# Step 5: Fix it properly
# Check for a boolean that covers this use case:
getsebool -a | grep relevant_keyword
setsebool -P boolean_name on
# Or fix a file context:
semanage fcontext -a -t correct_type_t "/path/to/files(/.*)?"
restorecon -Rv /path/to/files/
Blocked by firewall
# Step 1: Check what is allowed
firewall-cmd --list-all
# Step 2: Test from the client side
nc -zv targethost port
curl -v http://targethost:port
# Step 3: Verify traffic is reaching the server at all
tcpdump -i eth0 port PORT # run on the server; check if packets arrive
# If packets arrive but are rejected:
# → service is down or listening on wrong interface (not a firewall issue)
# If no packets arrive:
# → firewall is blocking (on this host or upstream)
# Step 4: Add the rule
firewall-cmd --permanent --add-port=PORT/tcp
firewall-cmd --reload
Disk full / inodes exhausted
# Step 1: Check disk space
df -h
df -i # check inode usage
# Step 2: Find what is using space
du -sh /var/log/* | sort -rh | head
du -sh /var/spool/* | sort -rh | head
# Large log files
find /var/log -size +100M
# Step 3: Check mail queue size
postqueue -p | wc -l # count queued messages
ls /var/spool/postfix/deferred/ | wc -l
# Step 4: Rotate or truncate logs safely
journalctl --vacuum-size=2G
logrotate -f /etc/logrotate.conf
# Step 5: Truncate a large log (preserves file, empties content)
> /var/log/some.log # truncates without removing
# Do NOT delete log files that services have open — they will keep writing to the old inode
Disk full kills services silently. nginx cannot write access logs, postfix cannot write to the queue, and SSSD cannot cache. Always check
df -h and df -i early in any diagnosis.
Postfix queue management
# Show the queue (deferred, active, hold)
postqueue -p
mailq # same, shorthand
# Count queued messages
postqueue -p | grep -c "^[0-9A-F]"
# Flush: attempt to deliver all deferred messages now
postqueue -f
# Delete a specific message by queue ID
postsuper -d QUEUEID
# Delete all deferred messages (use with care)
postsuper -d ALL deferred
# Inspect a specific message including headers
postcat -q QUEUEID
# Delete all messages in queue (emergency only)
postsuper -d ALL
postsuper -d ALL deferred only deletes messages stuck in the deferred queue — messages being actively delivered are unaffected. Use this when a large backlog of undeliverable messages is consuming disk space.
Advanced systemd troubleshooting
# See all failed units
systemctl list-units --failed
# Find which service is slow at boot
systemd-analyze blame
# See the full critical chain for boot time
systemd-analyze critical-chain
# Show full unit properties (all settings, including computed defaults)
systemctl show nginx
# Show the dependency tree of a unit
systemctl list-dependencies nginx
systemctl list-dependencies nginx --reverse # who depends ON nginx
systemctl show nginx outputs every key=value pair for the unit — useful when a setting from a drop-in is not being picked up, or you want to confirm the actual Restart= or ExecStart= value that systemd is using (not just what the file says).
You've reached the end of the guides. Head back to Home → to browse all topics.