Three years after the "Attention Is All You Need" paper and a few months into the world that GPT-3 has reshaped, the transformer has quietly become the default substrate for almost every serious language and sequence model in production. Understanding the architecture is no longer optional for anyone building modern AI systems.
This piece works through the transformer from first principles, then turns to the design choices that actually matter once you start serving these models at scale.
Why attention replaced recurrence
Recurrent networks carried sequence information through hidden state, which made them inherently sequential and notoriously difficult to train on long contexts. Attention lets every token look at every other token directly, in parallel, with a cost that scales quadratically with sequence length but linearly with hardware. On modern accelerators, parallelism beats algorithmic elegance.
The core operation is scaled dot-product attention: queries, keys, and values are projected from the input, the query-key inner product determines how much weight to place on each value, and a softmax normalizes the weights. Multi-head attention runs several of these in parallel with smaller dimensions, giving the model room to attend to different kinds of structure at once.
The block: attention plus a feed-forward network
A transformer block wraps multi-head attention and a position-wise feed-forward network, each inside a residual connection with layer normalization. This residual path is not incidental. It is what lets the network train to real depth. Stack enough blocks and you get the backbone of BERT, GPT-2, T5, and GPT-3. The architecture is strikingly uniform, and that uniformity is precisely why it scales.
Positional information
Attention is permutation invariant, which means position has to be added back in explicitly. The original paper used sinusoidal encodings. Later work moved toward learned embeddings, and more recent designs explore relative and rotary schemes that extrapolate better to longer contexts than seen at training time. The choice interacts with how far you expect to push context length in production.
From research code to serving
The gap between a working research implementation and a transformer that serves real traffic is wider than it looks. Several concerns dominate.
Memory is usually the first constraint. Attention activations grow with the square of sequence length, and the key-value cache during autoregressive decoding can dwarf the parameter memory of the model itself. Techniques like KV caching, flash attention, and memory-efficient attention kernels turn what looks like an impossible serving bill into a tractable one.
Throughput is the second. Naive batching leaves accelerators idle. Continuous batching, token-level scheduling, and speculative decoding have become the standard toolkit for anyone running an inference fleet. These are not optimizations for the last ten percent. They often make the difference between a one-hundred millisecond response and a one-second response.
Quantization is the third. Post-training quantization to eight or even four bits can cut memory and latency dramatically with modest quality loss, provided you measure carefully on the tasks that matter. The quality hit is almost never uniform across tasks, and this is where honest evaluation earns its keep.
What the architecture does not solve
Transformers are not a free lunch. Long-context reasoning degrades in ways that are still not fully understood. Hallucination is an artifact of how these models are trained, not a bug that a bigger model quietly removes. And the economics of training from scratch now sit firmly outside the reach of almost every team, which is why the serious design questions have shifted toward fine-tuning, adaptation, and retrieval on top of pre-trained checkpoints.
The architecture has won. The interesting work is in everything that surrounds it.