← All concepts
Networking

How TCP works: handshakes, windows and reliable delivery

The protocol that turns a lossy, out-of-order network into a tidy, in-order byte stream.

18 min read · updated 20 Jun 2026
This concept is explained in five layers — from a simple analogy up to a deep technical dive. Read top-to-bottom, or jump to your level.
On this page
Level 1·Posting a story one numbered page at a time.

Explain like I'm 5

Imagine you want to send a long bedtime story to a friend, but you can only post one page at a time, and the post office sometimes loses pages or delivers them in the wrong order. How do you make sure your friend gets the whole story, in the right order?

First you phone them: "Can I send you a story?" They say "Yes, go ahead!" and you say "Great, here it comes!" Now you both know the other one is ready. That little back-and-forth is the handshake.

Then you number every page. When your friend gets page 4, they post back a note: "Got everything up to page 4, send me page 5." If a page never arrives, they keep asking for it, and you post that page again. Nobody reads the story until the pages are back in order.

The whole idea in one line

Number everything, confirm what arrived, resend what got lost, and only hand over the story once it is complete and in order. That is TCP.

Your friend can also say "Slow down, my desk is full!" so you do not bury them in pages faster than they can stack them. And when you reach the end, you both say a polite goodbye instead of just hanging up.

Level 2·Ports, segments and what "reliable" actually means.

Beginner

TCP (Transmission Control Protocol) is one of the main transport protocols of the internet. Its job is to take a stream of bytes from an application and deliver them to another application on another machine reliably and in order, on top of IP, which itself makes no such promises.

To know which application, TCP uses ports. An IP address gets you to the right machine; a port gets you to the right program on it (for example, web servers usually listen on port 443 for HTTPS). The combination of IP address + port on each side, plus the protocol, identifies a socket — one unique conversation.

  • Connection-oriented: a connection is set up before any data flows, and torn down afterwards.
  • Reliable: lost data is detected and resent; corrupted data is thrown away and resent.
  • Ordered: bytes are delivered to the application in exactly the order they were sent.
  • Flow-controlled: a fast sender will not overwhelm a slow receiver.

TCP chops the byte stream into chunks called segments, each wrapped in a TCP header (with ports, sequence numbers and flags) and handed to IP for delivery. The receiver reassembles the segments back into the original stream.

TCP vs the network underneath

IP packets can be lost, duplicated, delayed or reordered. TCP hides all of that, so the application sees a clean, ordered pipe of bytes. That illusion is the entire point of TCP.

Its counterpart, UDP, skips all of this: no handshake, no acknowledgements, no ordering. That makes UDP faster and leaner, which suits things like live video and DNS lookups where a little loss beats waiting around.

Level 3·The handshake, ACKs and the sliding window.

Intermediate

A TCP connection begins with the three-way handshake. Each side picks a random starting sequence number (ISN) and announces it, so both ends agree where the byte numbering starts before any data flows.

ClientServerSYNseq=xSYN-ACKseq=y, ack=x+1ACKack=y+1connection ESTABLISHED
The TCP three-way handshake establishes a connection.
  1. SYN: the client sends a segment with the SYN flag set and its initial sequence number x.
  2. SYN-ACK: the server replies with SYN set (its own ISN y) and ACK set, acknowledging x+1.
  3. ACK: the client acknowledges the server's ISN with y+1. The connection is now ESTABLISHED and data can flow.

Sequence numbers count bytes, not segments: each segment's sequence number is the position of its first byte in the stream. Acknowledgement numbers are cumulative — an ACK of 1461 means "I have everything up to but not including byte 1461; send me that next."

If a segment is lost, the receiver keeps acknowledging the last in-order byte it has. The sender notices the missing ACK (via a timer or duplicate ACKs) and retransmits. Every segment also carries a checksum; a segment that fails it is silently dropped and treated as lost.

To avoid sending one segment and waiting for each ACK, TCP uses a sliding window: the sender may have several segments "in flight" (sent but not yet acknowledged) at once. The receiver advertises a receive window (rwnd) in every segment — how much buffer space it has free. The sender must never have more unacknowledged data outstanding than that window.

Sent + ACKed
1
2
In flight
3unACKed
4unACKed
Can send now
5
6
Blocked
7past window
The send window slides forward as ACKs arrive.
Flow control vs congestion control

Flow control (the receive window) stops you overwhelming the receiver. Congestion control (covered next) stops you overwhelming the network in between. The amount you may send is the minimum of the two.

Level 4·Congestion control, teardown and TIME_WAIT.

Advanced

The receive window protects the receiver, but nothing in it protects the routers between the two endpoints. Congestion control adds a second, sender-side limit called the congestion window (cwnd). The sender may send no more than min(rwnd, cwnd) of unacknowledged data.

Classic TCP congestion control has two phases, governed by a slow start threshold (ssthresh):

  • Slow start: cwnd begins small (a few segments) and roughly doubles every round-trip — exponential growth — until it reaches ssthresh or a loss occurs.
  • Congestion avoidance: above ssthresh, cwnd grows linearly (about one segment per RTT). This is the additive increase half of AIMD.
  • On loss: the sender backs off multiplicatively — the multiplicative decrease half of AIMD — because loss is read as a sign of congestion.

How it reacts depends on how loss was detected. A timeout is treated as serious: cwnd collapses back to one and slow start restarts. But three duplicate ACKs (the receiver repeatedly asking for the same byte) trigger fast retransmit — resend the missing segment immediately without waiting for the timer — followed by fast recovery, which halves cwnd rather than collapsing it, since ACKs are still arriving so the pipe clearly is not dead.

Head-of-line blocking

Because TCP guarantees in-order delivery, a single lost segment stalls everything behind it: later segments may have arrived and sit in the receiver's buffer, but the application cannot read them until the gap is filled. This head-of-line blocking is a core motivation for QUIC (deep dive).

Closing a connection is a four-way exchange, because each direction is shut down independently. One side sends FIN, the other ACKs it (that direction is now half-closed), then sends its own FIN, which is ACKed in turn.

InitiatorPeerFINACKFINACKinitiator waits in TIME_WAIT
Graceful connection teardown, ending in TIME_WAIT.

The side that sends the final ACK enters TIME_WAIT and lingers there for 2 × MSL (maximum segment lifetime, often around 60 seconds total). This exists for two reasons: to re-send the final ACK if it was lost (otherwise the peer keeps retransmitting its FIN), and to ensure any stray, delayed segments from this connection die out before the same four-tuple could be reused, preventing them from corrupting a new connection.

Why TIME_WAIT piles up on busy servers

Whichever side initiates the close holds TIME_WAIT. On a busy proxy or load balancer that closes many short-lived connections, thousands of sockets sit in TIME_WAIT. Prefer connection reuse (keep-alive) over tuning kernel knobs; reusing connections avoids the churn entirely.

PropertyTCPUDP
ConnectionConnection-oriented (handshake)Connectionless
ReliabilityGuaranteed delivery + retransmissionBest-effort; app handles loss
OrderingIn-order byte streamNo ordering
Flow / congestion controlYesNo (app's problem)
Header size20 bytes (more with options)8 bytes
BoundariesStream (no message boundaries)Preserves datagram boundaries
Typical useHTTP(S), SSH, SMTP, file transferDNS, VoIP, gaming, video, QUIC
Level 5·State machine, SACK, Nagle, CUBIC/BBR and QUIC.

Deep dive

Underneath the friendly "ESTABLISHED" lies a formal state machine. Every socket is in exactly one state, and segments (or the application's connect/listen/close calls) drive transitions.

  1. 1
    CLOSED
    no connection
  2. 2
    LISTEN
    server awaiting SYN
  3. 3
    SYN-SENT / SYN-RECEIVED
    handshake in progress
  4. 4
    ESTABLISHED
    data flows freely
  5. 5
    FIN-WAIT / CLOSE-WAIT
    one or both sides closing
  6. 6
    TIME-WAIT
    2×MSL drain
  7. 7
    CLOSED
    socket released
Principal TCP states from listen to close.
  • LISTEN: server has called listen() and is waiting for incoming SYNs.
  • SYN-SENT: client has sent a SYN and awaits the SYN-ACK.
  • SYN-RECEIVED: server received a SYN and replied with SYN-ACK, awaiting the final ACK.
  • ESTABLISHED: the fully open, data-transfer state for both ends.
  • FIN-WAIT-1 / FIN-WAIT-2: the active closer has sent its FIN and is awaiting the ACK, then the peer's FIN.
  • CLOSE-WAIT: the passive closer received a FIN; the application must still call close() before sending its own FIN.
  • TIME-WAIT: the active closer waits 2×MSL before fully releasing the four-tuple.
  • CLOSED: the connection no longer exists.
A pile of CLOSE_WAIT means a bug

Sockets stuck in CLOSE_WAIT mean the peer closed but your application never called close() on its end. That is an application file-descriptor leak, not a network problem — hunt down the unclosed socket.

SACK (Selective Acknowledgement) patches a weakness of cumulative ACKs. Without it, after a single loss the sender can only learn "everything up to byte N arrived" and may needlessly resend data the receiver already has. With the SACK option, the receiver reports the specific non-contiguous blocks it received, so the sender retransmits only the genuine gaps. SACK is negotiated in the SYN and is enabled by default on all modern stacks.

Nagle's algorithm reduces tiny-packet overhead by holding back small writes until either the outstanding data is ACKed or a full segment's worth has accumulated. Delayed ACK independently holds back ACKs (up to ~200 ms) hoping to piggyback them on return data. Together they can interact pathologically: Nagle waits for an ACK that delayed ACK is deliberately sitting on, adding latency spikes to small request/response traffic. Latency-sensitive apps set TCP_NODELAY to disable Nagle.

AlgorithmSignal it reacts toNotes
Reno / NewRenoPacket loss (timeout or 3 dup-ACKs)The classic AIMD baseline.
CUBICPacket lossLinux default since 2.6.19; window grows on a cubic curve, better for high-bandwidth, high-latency paths.
BBRMeasured bandwidth + RTTModels the bottleneck rather than treating loss as the only signal; performs well on lossy or buffer-bloated links.

Even with all of this, TCP cannot escape head-of-line blocking at the connection level, and its handshake plus TLS adds round-trips before data flows. QUIC (the transport under HTTP/3) tackles both: it runs over UDP, rebuilding reliability, ordering and congestion control in user space, and carries multiple independent streams in one connection so a loss in stream 3 does not stall streams 1 and 2. It also folds the cryptographic handshake into the transport handshake (see the TLS handshake), cutting setup latency to roughly one round-trip for new connections and zero for resumptions.

For troubleshooting, three tools cover most situations:

# All TCP sockets + states (modern, fast)
ss -tan

# Add internal congestion info: cwnd, RTT, retransmits
ss -ti

# Listening sockets with owning process
ss -tlnp

# Count sockets by state (spot TIME_WAIT / CLOSE_WAIT pile-ups)
ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c

# Older equivalent, present on almost every host
netstat -tan

# Watch the actual handshake and data on the wire
sudo tcpdump -ni eth0 'tcp port 443'

# Show only connection-lifecycle segments (SYN, FIN, RST)
sudo tcpdump -ni eth0 'tcp[tcpflags] & (tcp-syn|tcp-fin|tcp-rst) != 0'
Inspecting TCP connections and the wire
Reading a tcpdump capture quickly

The flags column tells the story: [S] SYN, [S.] SYN-ACK, [.] bare ACK, [P.] push with data, [F.] FIN, [R] RST. A burst of duplicate [.] ACKs followed by a retransmission is fast retransmit in action. A flood of [R] resets usually means something is refusing connections or a firewall is killing them mid-stream.

Note that DNS resolves the name to an IP address before any of this begins, and whether two IP addresses can even reach each other is an IP routing concern — see subnetting and CIDR.

The one-paragraph summary

TCP turns IP's unreliable, unordered packet delivery into a reliable, in-order byte stream between two sockets (IP:port pairs). A three-way handshake (SYN, SYN-ACK, ACK) synchronises starting sequence numbers; data is then numbered per byte and confirmed with cumulative ACKs, with lost or corrupt segments retransmitted. A sliding window does flow control (the receiver's advertised rwnd) while a congestion window with slow start and AIMD — today CUBIC or BBR on Linux — does congestion control, the sender limited by min(rwnd, cwnd). SACK lets the receiver report specific gaps so only genuine losses are retransmitted; disabling Nagle with TCP_NODELAY eliminates latency from its interaction with delayed ACK. Connections close with a four-way FIN exchange and the initiator lingers in TIME_WAIT for 2×MSL to absorb stray segments. Its core weakness — in-order head-of-line blocking — is precisely what QUIC/HTTP3 sidesteps by running multiplexed, independently-reliable streams over UDP. When things go wrong, ss -ti shows states and congestion windows, and tcpdump shows the raw handshake, retransmissions and resets.

Frequently asked questions

What is the difference between TCP and UDP?

TCP is connection-oriented and gives you reliable, in-order delivery with flow and congestion control, at the cost of handshake overhead. UDP is connectionless and best-effort: no handshake, no retransmission, no ordering. Use TCP for web, SSH and file transfer; use UDP for DNS, live video, VoIP and gaming where speed matters more than perfect delivery.

Why does TCP use a three-way handshake instead of two?

Both sides must agree on each other's initial sequence number before data flows. The client's SYN tells the server its ISN; the server's SYN-ACK both acknowledges that and announces the server's own ISN; the client's final ACK confirms the server's ISN. Two messages cannot reliably synchronise both directions, so three is the minimum.

What is TIME_WAIT and is it a problem?

TIME_WAIT is the state the side that closes a connection enters for roughly 2×MSL (often ~60 s total). It lets a lost final ACK be re-sent and ensures stray old segments expire before the same four-tuple is reused. It is normal and protective. It only becomes a concern on hosts that open huge numbers of short-lived connections; the right fix is connection reuse (keep-alive), not aggressive kernel tuning.

What causes TCP head-of-line blocking?

TCP delivers bytes strictly in order, so if one segment is lost, every segment that arrived after it must wait in the receiver's buffer until the gap is retransmitted and filled. The application is blocked on that missing byte. QUIC/HTTP3 avoids this by carrying independent streams over UDP so a loss in one stream does not stall the others.

What is the default TCP congestion control algorithm on Linux?

CUBIC has been the Linux default since kernel 2.6.19; it grows the congestion window along a cubic curve to make better use of high-bandwidth, high-latency links. BBR is a popular alternative that models the path's bandwidth and round-trip time rather than reacting only to loss, often performing better on lossy or heavily buffered links.

How do I see TCP connection states and problems on Linux?

Use `ss -tan` for all TCP sockets and their states, `ss -ti` to add congestion-window and RTT internals, and `ss -tlnp` for listening sockets with owning processes. The older `netstat -tan` works everywhere. To watch the actual handshake, retransmissions and resets on the wire, run `sudo tcpdump -ni <iface> 'tcp port <port>'` and read the flags column.

ShellQuest turns concepts like this into bite-sized lessons, puzzles and labs you actually practise.

Join the waitlist