How to Troubleshoot a 'Website Down' Incident
The Worst Words in an On-Call Rotation
Your phone buzzes. A customer has emailed. Your monitoring fires. "The website is down." In that moment, the difference between a ten-minute fix and a two-hour scramble is usually method — not luck.
This guide teaches a repeatable, layered approach to diagnosing a web outage. You work the request path from the outside in: DNS → TCP → TLS → HTTP → application → database. You stop when you find the broken layer, gather evidence, then fix. No guessing, no random restarts.
Step 0 — Confirm and Scope
Before touching anything, answer two questions: Is it just me, or everyone? and Is the whole site affected, or just part?
# Check from your machine first
curl -svo /dev/null https://example.com 2>&1 | head -40
If it is only you, the problem may be local DNS cache, a corporate proxy, or an IP block. Flush your local cache and test again. If it is everyone, proceed down the stack.
The Symptom → Layer Quick-Reference Table
| What you see | Likely broken layer | First command |
|---|---|---|
Could not resolve host | DNS | dig example.com |
Connection timed out | Firewall / routing / server down | ping, then traceroute |
Connection refused | Nothing listening on that port | ss -tulpn on the server |
| Cert warning / TLS error | TLS / certificate | openssl s_client -connect host:443 |
HTTP 403 Forbidden | App config / permissions / WAF | Check access logs, WAF rules |
HTTP 502 Bad Gateway | Reverse proxy cannot reach app | Check app process, socket/port |
HTTP 503 Service Unavailable | App overloaded or in maintenance | Check app health, queue depth |
HTTP 504 Gateway Timeout | App or DB too slow | Check slow query logs, DB connections |
Keep this table open on a second screen. It will save you time.
Layer 1 — DNS Resolves
If DNS is broken, nothing else matters.
dig example.com +short # basic resolution
dig NS example.com +short # who is authoritative?
dig @ns1.example.com example.com # query authoritative directly, bypass cache
Look for the expected IP, consistent answers from authoritative nameservers, and a sensible TTL. Learn how resolution works end-to-end in the DNS concept guide.
Layer 2 — TCP Connects
DNS returned an IP. Can you actually reach it?
ping -c 4 example.com # ICMP (some hosts block this)
nc -zvw3 example.com 443 # attempt a TCP handshake on 443
Connection refused means something is listening on the IP but nothing is bound to that port — jump to the server and check what's running. Connection timed out means packets are not arriving or returning — suspect a firewall rule, a routing problem, or the server being down. Check security groups, network ACLs, and the host firewall (iptables -L -n or ufw status).
Layer 3 — TLS Negotiates
A valid certificate is easy to overlook until it expires at 3 a.m. on a bank holiday.
# Inspect the certificate chain and expiry
openssl s_client -connect example.com:443 -servername example.com </dev/null 2>/dev/null \
| openssl x509 -noout -dates -subject -issuer
Look for notAfter in the future, the correct subject (matching your domain), and a trusted issuer. A Let's Encrypt certificate failing to auto-renew is the single most common certificate incident — check certbot renew --dry-run and its logs. For a deeper understanding of the handshake, see the TLS handshake explainer.
Layer 4 — HTTP Responds
TLS is fine. What does the application actually return?
curl -svo /dev/null https://example.com 2>&1 # full verbose exchange
curl -sLo /dev/null -w "%{http_code} %{url_effective}\n" https://example.com
403 — understood the request but refused it. Check filesystem permissions, nginx deny directives, or a WAF rule blocking legitimate traffic. 502 — the reverse proxy got no valid response from the upstream; the app process has likely crashed or isn't listening. 503 — the app is explicitly reporting unavailability (pool exhausted, queue full, maintenance flag). 504 — the upstream responded too slowly; start looking at the database.
Layer 5 — Application and Reverse Proxy
Now you are on the server.
systemctl status myapp.service # is the app running?
journalctl -u myapp.service -n 100 --no-pager
ss -tulpn | grep ':8080' # is anything bound to the port?
tail -100 /var/log/nginx/error.log # reverse-proxy errors
df -h # a full disk kills more services than you'd expect
If the process has crashed, read the logs before restarting. A restart clears transient state and may fix the symptom, but you need to know why it crashed to prevent the next one.
Layer 6 — Backend and Database
If the app is running but returning 504s, the database is the prime suspect.
# PostgreSQL — active connections by state
psql -U appuser -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Is the DB service healthy?
systemctl status postgresql
Look for connections piling up in a "waiting" state, long-running queries blocking others, or the count approaching max_connections. A missing index on a hot table can turn a routine traffic spike into an outage.
Collect Evidence Before You Restart
It is tempting to systemctl restart everything the moment you identify the broken service. Resist. Take 60 seconds: copy the tail of relevant logs, note process memory and CPU, record connection counts, snapshot the relevant dashboard. A restart without evidence is a gamble that the problem won't recur.
Practice Makes the Method Automatic
When your pager fires at 2 a.m., you want these checks to be muscle memory. The Black Box Lab drops you into a simulated broken server with no hints — a shell, symptoms, and a time limit. The puzzles reinforce the individual commands, and the daily challenge keeps your diagnostic instincts sharp with a fresh scenario every morning. If you want structured progression through Linux and networking fundamentals, join the ShellQuest waitlist.
Liked this? ShellQuest turns these mental models into puzzles and labs you can actually practise.
Join the waitlist