Distributed Training at Scale: Data Parallelism to Pipeline Parallelism

The era when a single GPU could train a state-of-the-art model has ended. Every frontier result now involves dozens to thousands of accelerators, and the techniques used to coordinate them have become essential knowledge for anyone building serious ML systems. Megatron-LM, GShard, and DeepSpeed have each pushed the boundary over the past two years, and the design space is finally legible enough to navigate.

Data parallelism: the starting point

Data parallelism is the simplest distributed strategy and remains the workhorse. Each worker holds a full copy of the model, processes a different shard of the batch, and averages gradients at the end of each step. Collective operations like all-reduce make this efficient on modern interconnects, and frameworks handle most of the choreography for you.

The limit is memory. Once a model no longer fits on a single accelerator, data parallelism alone cannot help. That limit arrived for language models years ago, and the answer is to shard the model itself.

Tensor parallelism: splitting layers across devices

Tensor parallelism slices individual weight matrices across devices. A single matrix multiply becomes a coordinated computation where each device handles a portion of the rows or columns and communicates partial results. Megatron-LM demonstrated this at scale for transformer layers, and the pattern is now standard for large language model training.

The catch is communication cost. Tensor parallelism requires frequent, latency-sensitive collective operations, which means it only works well within a tightly coupled pod. Cross-node tensor parallelism is almost never a good idea.

Pipeline parallelism: splitting layers across stages

Pipeline parallelism takes a different slice. Different layers of the model live on different devices, and micro-batches flow through them like an assembly line. The naive version leaves devices idle while waiting for upstream outputs, which is why modern pipeline schedules like 1F1B and interleaved pipelines exist: they keep more stages busy simultaneously.

Pipeline parallelism tolerates higher communication latency than tensor parallelism, which makes it the right tool for spanning nodes or racks.

ZeRO and optimizer state sharding

DeepSpeed's ZeRO family of techniques attacks a different part of the memory problem: the optimizer states. For Adam on a large model, optimizer memory can exceed parameter memory by a factor of two or more. ZeRO shards optimizer states, gradients, and eventually parameters themselves across data-parallel workers, dramatically reducing per-device memory with modest communication overhead. For many training runs, ZeRO stages unlock model sizes that would otherwise require tensor or pipeline parallelism.

The 3D parallelism recipe

At frontier scale, the answer is almost always a combination. Tensor parallelism within a node to fit the model, pipeline parallelism across nodes to stack depth, and data parallelism over the whole thing for throughput. Picking the split ratios is a nontrivial optimization problem that depends on model shape, interconnect topology, and batch size.

The practical lesson is that the choice is not ideological. It is a hardware-aware decision, and benchmarking on the actual cluster beats any amount of theoretical analysis.

Operational realities

Distributed training introduces a new class of failures. Silent data corruption, stragglers, deadlocks during all-reduce, and NCCL errors that surface only at scale are all routine at a thousand accelerators. Checkpointing strategy becomes a production concern, not an afterthought. Runs that do not checkpoint frequently lose days of work to a single bad node.

Distributed training is where systems engineering and machine learning meet most directly. The teams that do it well are the ones who treat both disciplines with equal seriousness.