← All articles
TroubleshootingNetworkingIncident

How to Troubleshoot a 'Website Down' Incident

20 May 2026 · 10 min

The Worst Words in an On-Call Rotation

Your phone buzzes. A customer has emailed. Your monitoring fires. "The website is down." In that moment, the difference between a ten-minute fix and a two-hour scramble is usually method — not luck.

This guide teaches a repeatable, layered approach to diagnosing a web outage. You work the request path from the outside in: DNS → TCP → TLS → HTTP → application → database. You stop when you find the broken layer, gather evidence, then fix. No guessing, no random restarts.


Step 0 — Confirm and Scope

Before touching anything, answer two questions: Is it just me, or everyone? and Is the whole site affected, or just part?

# Check from your machine first
curl -svo /dev/null https://example.com 2>&1 | head -40

If it is only you, the problem may be local DNS cache, a corporate proxy, or an IP block. Flush your local cache and test again. If it is everyone, proceed down the stack.


The Symptom → Layer Quick-Reference Table

What you seeLikely broken layerFirst command
Could not resolve hostDNSdig example.com
Connection timed outFirewall / routing / server downping, then traceroute
Connection refusedNothing listening on that portss -tulpn on the server
Cert warning / TLS errorTLS / certificateopenssl s_client -connect host:443
HTTP 403 ForbiddenApp config / permissions / WAFCheck access logs, WAF rules
HTTP 502 Bad GatewayReverse proxy cannot reach appCheck app process, socket/port
HTTP 503 Service UnavailableApp overloaded or in maintenanceCheck app health, queue depth
HTTP 504 Gateway TimeoutApp or DB too slowCheck slow query logs, DB connections

Keep this table open on a second screen. It will save you time.


Layer 1 — DNS Resolves

If DNS is broken, nothing else matters.

dig example.com +short              # basic resolution
dig NS example.com +short           # who is authoritative?
dig @ns1.example.com example.com    # query authoritative directly, bypass cache

Look for the expected IP, consistent answers from authoritative nameservers, and a sensible TTL. Learn how resolution works end-to-end in the DNS concept guide.


Layer 2 — TCP Connects

DNS returned an IP. Can you actually reach it?

ping -c 4 example.com          # ICMP (some hosts block this)
nc -zvw3 example.com 443       # attempt a TCP handshake on 443

Connection refused means something is listening on the IP but nothing is bound to that port — jump to the server and check what's running. Connection timed out means packets are not arriving or returning — suspect a firewall rule, a routing problem, or the server being down. Check security groups, network ACLs, and the host firewall (iptables -L -n or ufw status).


Layer 3 — TLS Negotiates

A valid certificate is easy to overlook until it expires at 3 a.m. on a bank holiday.

# Inspect the certificate chain and expiry
openssl s_client -connect example.com:443 -servername example.com </dev/null 2>/dev/null \
  | openssl x509 -noout -dates -subject -issuer

Look for notAfter in the future, the correct subject (matching your domain), and a trusted issuer. A Let's Encrypt certificate failing to auto-renew is the single most common certificate incident — check certbot renew --dry-run and its logs. For a deeper understanding of the handshake, see the TLS handshake explainer.


Layer 4 — HTTP Responds

TLS is fine. What does the application actually return?

curl -svo /dev/null https://example.com 2>&1                       # full verbose exchange
curl -sLo /dev/null -w "%{http_code} %{url_effective}\n" https://example.com

403 — understood the request but refused it. Check filesystem permissions, nginx deny directives, or a WAF rule blocking legitimate traffic. 502 — the reverse proxy got no valid response from the upstream; the app process has likely crashed or isn't listening. 503 — the app is explicitly reporting unavailability (pool exhausted, queue full, maintenance flag). 504 — the upstream responded too slowly; start looking at the database.


Layer 5 — Application and Reverse Proxy

Now you are on the server.

systemctl status myapp.service          # is the app running?
journalctl -u myapp.service -n 100 --no-pager
ss -tulpn | grep ':8080'                # is anything bound to the port?
tail -100 /var/log/nginx/error.log      # reverse-proxy errors
df -h                                   # a full disk kills more services than you'd expect

If the process has crashed, read the logs before restarting. A restart clears transient state and may fix the symptom, but you need to know why it crashed to prevent the next one.


Layer 6 — Backend and Database

If the app is running but returning 504s, the database is the prime suspect.

# PostgreSQL — active connections by state
psql -U appuser -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Is the DB service healthy?
systemctl status postgresql

Look for connections piling up in a "waiting" state, long-running queries blocking others, or the count approaching max_connections. A missing index on a hot table can turn a routine traffic spike into an outage.


Collect Evidence Before You Restart

It is tempting to systemctl restart everything the moment you identify the broken service. Resist. Take 60 seconds: copy the tail of relevant logs, note process memory and CPU, record connection counts, snapshot the relevant dashboard. A restart without evidence is a gamble that the problem won't recur.


Practice Makes the Method Automatic

When your pager fires at 2 a.m., you want these checks to be muscle memory. The Black Box Lab drops you into a simulated broken server with no hints — a shell, symptoms, and a time limit. The puzzles reinforce the individual commands, and the daily challenge keeps your diagnostic instincts sharp with a fresh scenario every morning. If you want structured progression through Linux and networking fundamentals, join the ShellQuest waitlist.

Liked this? ShellQuest turns these mental models into puzzles and labs you can actually practise.

Join the waitlist