InterviewCareerTroubleshooting

Infrastructure Interview Questions (With Answers)

10 June 2026 · 14 min

If you have spent any time on the hiring side of infrastructure interviews, you know the candidates who stand out are rarely the ones with the most memorised facts. They are the ones who think out loud, catch their own assumptions, and know when they do not know something.

This guide covers the questions that come up again and again in infrastructure, DevOps, and SRE loops — along with what a strong answer sounds like and what the interviewer is quietly listening for. If you want to benchmark where you are first, try the ShellQuest diagnostic.

Quick reference: what these questions are really testing

Question	What they're probing
What happens when you type a URL?	Systems breadth, narrating a mental model
High load, low CPU — why?	I/O intuition, diagnostic instinct
Service won't start — walk me through it	Methodical debugging, not guessing
Process vs thread	Kernel fundamentals, not just definitions
How does DNS work?	Protocol depth, real-world application
TCP vs UDP	Trade-off thinking, not flashcard recall
App is slow — you're paged at 2am	Incident prioritisation, communication
What's using this port?	Practical tooling knowledge
Explain Linux permissions	Comfort with the model, security thinking

Linux fundamentals

"Explain Linux file permissions."

Strong answer: Linux permissions operate on three axes — owner, group, and other — each with read, write, and execute bits. The octal representation makes this concrete: 755 means the owner can read, write, and execute; group and others can read and execute. For directories, execute means the ability to traverse (enter) the directory, which surprises people who only think about files. Then there are the special bits — setuid, setgid, and the sticky bit. Setuid on an executable means it runs as the file's owner, not the caller; that is how sudo and passwd work.

What they're probing: Whether you understand the model rather than just the syntax. Weak answers recite chmod 777 without any discussion of what the bits mean or why that would be a bad idea in production.

"What's the difference between a process and a thread?"

Strong answer: A process is an independent unit of execution with its own address space, file descriptors, and resources. A thread is a lightweight unit of execution within a process — threads share the same address space and file descriptors, which makes communication cheap but introduces data races if you are not careful. On Linux, both are created via clone() with different flags; the distinction is largely about what gets shared. This is why a bug in one thread can corrupt memory for the entire process, whereas a crashing child process leaves the parent unaffected.

What they're probing: Kernel intuition and awareness of concurrency trade-offs. The weak answer is "threads are lighter than processes" and nothing more.

"Walk me through troubleshooting a service that won't start."

Strong answer: I start with the exit status and the logs before touching anything. systemctl status <service> gives me the last log lines and whether it exited cleanly or was killed. journalctl -u <service> -n 100 --no-pager gives more context. From there I am looking for: a missing binary or library, a port binding failure, a missing config file, a permissions error, or an environment variable that was not set. I would check dmesg if I suspected an OOM kill. The key is I do not make changes until I have a hypothesis — otherwise I am just randomly poking things.

What they're probing: Methodical thinking under pressure. The weak answer jumps straight to "I would restart it". See the interview prep track for hands-on practice with this exact scenario.

Networking and DNS

"How does DNS resolution work?"

Strong answer: When your browser needs to resolve example.com, it checks the local cache, then the OS resolver cache, then queries the configured recursive resolver. If the resolver does not have the answer cached, it starts from the root nameservers, gets referred to the .com TLD nameservers, then to the authoritative nameservers for example.com, and finally retrieves the A or AAAA record. All of this is governed by TTLs at each layer. A great follow-up to be ready for: negative caching — NXDOMAIN responses also have a TTL, which is why a DNS fix sometimes takes time to be visible.

What they're probing: Whether you understand the recursive lookup chain and caching behaviour, not just "DNS turns names into IP addresses". Deep dive: how DNS works.

"What's the difference between TCP and UDP?"

Strong answer: TCP is a reliable, ordered, connection-oriented protocol — delivery guarantees via the three-way handshake, sequence numbers, acknowledgements, and retransmission, at the cost of latency and overhead. UDP is connectionless and fire-and-forget. The right choice depends on what you are building: DNS queries use UDP because the round-trip overhead of TCP is not worth it for a small query/response; video and gaming use UDP because a dropped packet beats a stalled stream; TCP is right when you cannot afford data loss — databases, HTTP, file transfers.

What they're probing: Trade-off reasoning. A weak answer just lists properties without explaining when you'd choose one. More at how TCP works.

"What happens when you type a URL and press enter?"

Strong answer: This is a systems question disguised as trivia. Browser checks its cache, then DNS resolution to get the IP, then a TCP connection to port 443, TLS handshake to establish an encrypted session, HTTP request sent, server processes and returns a response, browser parses HTML and issues sub-requests for assets, page renders. A great candidate picks one layer — TLS session resumption, HTTP/2 multiplexing, CDN edge caching — and goes deeper rather than covering everything superficially.

What they're probing: Breadth of systems knowledge and the ability to organise a complex answer clearly.

Troubleshooting judgement

"A server has high load but low CPU — what's going on?"

Strong answer: Load average measures runnable and uninterruptible-sleep processes, not just CPU usage. High load with low CPU almost always means processes blocked waiting for I/O — usually disk or network. I would check iostat -x 1 for disk utilisation and await, iotop to see the responsible process, and vmstat 1 for I/O wait percentage. It could also be NFS hangs or excessive swapping. The point to emphasise: CPU utilisation and load are related but distinct, and conflating them is a common mistake.

What they're probing: Whether you understand what load average actually measures — a great signal for experience level.

"You get paged at 2am — the app is slow. What do you do?"

Strong answer:

uptime                  # load average trend
free -h                 # memory pressure
df -h                   # not out of disk?
vmstat 1 5              # cpu/io/memory in motion
ss -s                   # connection counts
tail -n 200 /var/log/app/error.log

Before running a single command I would also check whether there was a recent deployment, a known upstream outage, or a traffic spike in metrics. The worst thing you can do is start changing things before you understand the blast radius. I would communicate early — even "investigating, will update in 10 minutes" is valuable. Once I have a hypothesis I make one change at a time and observe the effect.

What they're probing: Incident-response discipline. The weak answer dives into commands without mentioning communication, a baseline, or recent changes.

Scripting and tooling

"How would you find what's using a specific port?"

Strong answer: I reach for ss on modern Linux — faster than netstat and available by default. ss -tlnp sport = :8080 shows the listening process including PID and name. Alternatively lsof -i :8080. With a PID I can inspect further: ls -la /proc/<pid>/exe or cat /proc/<pid>/cmdline to understand exactly what is running.

What they're probing: Practical tooling fluency. Knowing only netstat is not wrong, but demonstrating ss and /proc shows you keep your tools current.

The meta-point: judgement over trivia

The strongest infrastructure candidates treat every question as a debugging problem. They state assumptions, reason out loud, catch themselves when they drift, and ask clarifying questions when the problem is ambiguous. "It depends" is a valid and often correct answer in infrastructure — what matters is being able to articulate what it depends on. Interviewers are not expecting encyclopaedic recall; they are looking for evidence that you will not panic when something breaks at 3am.

Practice every day

ShellQuest's daily challenge gives you one focused infrastructure question each day — command-line problems, troubleshooting scenarios, and architecture puzzles. Ten minutes a day compounds quickly. If you are preparing for a specific role, the interview prep track sequences exactly the skills hiring teams test for. Not sure where you stand? Take the diagnostic, then join the waitlist to be first in line when new tracks open.

Liked this? ShellQuest turns these mental models into puzzles and labs you can actually practise.

Join the waitlist