Skip to content

The LLM Engineering Field Guide: 45 Concepts Every Practitioner Needs

Every production LLM system sits at the intersection of GPU physics, systems engineering, and ML theory. Whether you are optimizing Time to First Token for a chatbot or configuring DeepSpeed ZeRO for a fine-tuning run, the same interconnected set of concepts keeps surfacing. This guide distills them into a single practical reference.

TL;DR: I organized the 45 essential concepts every LLM practitioner encounters into eight parts — hardware foundations, inference fundamentals, inference optimizations, model architecture, training and alignment, scaling and deployment, applications, and production operations. Each entry covers what the concept is, why it matters, the key numbers, and how it connects to everything else. All data reflects the 2024–early 2026 landscape with verified references.

This guide assumes familiarity with basic ML concepts (backpropagation, gradient descent, softmax) and some systems knowledge (memory hierarchies, networking basics).

A note on scope

This is the biggest post on this blog — and by a wide margin. You don't have to read it cover to cover. Use the table below to jump straight to the parts that interest you.

Part Topics Sections
I — Hardware Foundations Roofline model, GPU memory, hardware glossary 1–3
II — Inference Fundamentals Latency, throughput, KV cache, attention, quantization 4–9
III — Inference Optimizations CUDA kernels, FlashAttention, batching, PagedAttention, speculative decoding 10–17
IV — Model Architecture Transformer internals, decoder-only, MoE, tokenization, context windows 18–22
V — Training and Alignment Pretraining, LoRA, mixed precision, ZeRO, scaling laws, RLHF/DPO/GRPO, distillation 23–32
VI — Scaling and Deployment Parallelism, serving frameworks, GPU selection, routing 33–36
VII — Applications Embeddings, RAG, agents, prompt engineering 37–40
VIII — Production Operations Rate limiting, failure modes, monitoring, cost, capacity planning 41–45

Part I — Hardware Foundations

Before diving into inference or training, I find it essential to build a mental model of the hardware these systems run on. The concepts in this section — arithmetic intensity, the GPU memory hierarchy, and key hardware terminology — are prerequisites that surface repeatedly throughout this guide.

1. Memory-Bound vs Compute-Bound and the Roofline Model

To map out LLM performance, let's start with arithmetic intensity: for every byte of data the GPU loads from memory, how many useful calculations does it actually perform? This single metric determines whether an operation is compute-bound (waiting on the processor) or memory-bound (waiting on data to load).

Every GPU has a "critical intensity" threshold—the point where its compute capabilities exactly balance its memory bandwidth. For an NVIDIA H100, calculating this is straightforward (Datasheet, 2023):

\[\frac{989 \text{ TFLOPS}}{3.35 \text{ TB/s}} \approx 295 \text{ FLOPs/byte}\]

Roofline Model

The two phases of LLM inference sit on opposite sides of this threshold:

  • Decode is strictly memory-bound. Generating tokens one by one means loading the entire multi-gigabyte weight matrix from memory just to multiply it against a single new token. In 16-bit precision (2 bytes per parameter), this yields exactly 1 FLOP/byte—nearly 300x below the H100's threshold. The massive compute units sit idle >99% of the time waiting for memory to arrive.
  • Prefill is strictly compute-bound. When processing the input prompt, weights are loaded once but multiplied against hundreds or thousands of tokens simultaneously. This pushes the intensity well above 295, fully saturating the compute units.

The practical implication: To speed up decode, you must optimize memory bandwidth—shrink the weights with quantization, reduce KV memory overhead with GQA and PagedAttention, and raise intensity with batching. To speed up prefill, you must optimize raw computation—use faster GPUs or FP8 compute.

2. GPU Memory Hierarchy

A GPU has four layers of memory, working like a pyramid: at the bottom is a massive but relatively slow main memory (HBM), and at the very top are tiny, lightning-fast registers. Moving data up and down this pyramid is the biggest traffic jam in AI. The crucial bottleneck happens between the main memory (HBM) and the fast working memory (SRAM), where the working memory is roughly 10x faster.

GPU Memory Hierarchy

Going from fastest to slowest on an H100 GPU:

  1. Registers: The absolute fastest memory, attached directly to the processing threads. This is where the actual math happens — data must be loaded here for the Tensor Cores to use it.
  2. SRAM (Shared Memory): The tiny working memory that delivers a blazing ~33 TB/s of speed.
  3. L2 Cache: A middle ground (50 MB) providing ~12 TB/s. It acts as a safety buffer so that if multiple processors need the same weights, they don't all have to fetch them from the slow main memory.
  4. HBM3: The massive 80 GB main memory where model weights and KV cache live, delivering only ~3.35 TB/s.

Every major software trick in AI — like FlashAttention, kernel fusion, or PagedAttention — exists entirely to keep data up in the fast SRAM as long as possible, avoiding the 10x slower round-trip down to HBM.

3. GPU Hardware Glossary

The terms below appear throughout this guide. Each is a building block of the GPU hardware stack that determines LLM performance.

HBM (High Bandwidth Memory) — Stacked DRAM dies connected via through-silicon vias (TSVs), mounted on-package next to the GPU die. Generations: HBM2e (A100, 2 TB/s), HBM3 (H100, 3.35 TB/s), HBM3e (H200/B200, 4.8–8 TB/s). Why it matters for LLMs: decode is memory-bandwidth-bound, so HBM bandwidth directly determines TPOT.

GDDR (Graphics DDR) — Traditional graphics memory (GDDR6, GDDR6X) used in consumer GPUs (RTX 4090, L40S). Lower bandwidth than HBM but cheaper per GB. GDDR6X on RTX 4090 delivers ~1 TB/s versus H100's 3.35 TB/s HBM3.

SM (Streaming Multiprocessor) — The basic compute building block of NVIDIA GPUs. Each SM contains CUDA cores, Tensor Cores, shared memory (SRAM), and a warp scheduler. H100 has 132 SMs; A100 has 108.

Tensor Cores — Specialized matrix-multiply-accumulate units inside each SM. They accelerate mixed-precision matmuls (FP16, BF16, FP8, INT8) that dominate transformer computation. H100 Tensor Cores deliver 989 TFLOPS in TF32 versus ~67 TFLOPS from CUDA cores alone.

CUDA Cores — General-purpose floating-point and integer units. They handle element-wise operations, activation functions, and non-matmul work. Tensor Cores handle the heavy lifting for LLMs; CUDA cores handle everything else.

Warp — A group of 32 threads that execute in lockstep on an SM. The smallest scheduling unit on NVIDIA GPUs. Warp specialization assigns different warps to different tasks (data loading vs computation) for pipelining.

NVLink — High-speed GPU-to-GPU interconnect within a node. NVLink 4.0 (H100) delivers 900 GB/s bidirectional; NVLink 5.0 (B200) reaches 1.8 TB/s. Essential for tensor parallelism where GPUs must exchange activations every layer.

InfiniBand — High-speed network fabric for inter-node GPU communication. NVIDIA ConnectX-7 delivers 400 Gb/s per port. Used for pipeline parallelism and distributed training across nodes.

RDMA (Remote Direct Memory Access) — Allows one GPU to read/write another machine's memory without involving the CPU, minimizing latency. GPUDirect RDMA enables direct GPU-to-GPU transfers across nodes. Used in disaggregated serving for KV cache transfers.

NVMe (Non-Volatile Memory Express) — High-speed SSD interface used for KV cache offloading and ZeRO-Infinity parameter offloading when GPU/CPU memory is insufficient. Sequential read speeds of 5–7 GB/s per drive (PCIe Gen 4), with newer Gen 5 drives reaching 10–14 GB/s.

TFLOPS / PFLOPS — Tera/Peta floating-point operations per second. 1 TFLOPS = 10¹² FLOPS. The standard unit for measuring GPU compute throughput. H100 delivers 989 TFLOPS (TF32); FlashAttention-3 achieves ~1.2 PFLOPS in FP8.


Part II — Inference Fundamentals

Inference is where your model meets users. I've organized the foundational concepts here — latency metrics, the KV cache, the two-phase execution model, and the attention and quantization techniques that shape them — because understanding these determines how fast, how cheap, and how reliably you can serve LLM requests at scale.

4. Latency: TTFT, TPOT, and Percentiles

Time to First Token (TTFT) measures the delay from request submission to the first output token arriving. It is determined by the prefill phase — the model must process the entire input prompt before generating anything. Longer prompts mean higher TTFT. Production targets range from <100 ms for code completion to <500 ms for chatbots. MLPerf Inference v5.0 sets P99 TTFT at \(\leq\) 450 ms for Llama 2 70B.

Time Per Output Token (TPOT) is the average interval between consecutive tokens after the first. It maps directly to the decode phase, where each step is memory-bandwidth-bound. Formula:

\[\text{TPOT} = \frac{\text{E2E Latency} - \text{TTFT}}{\text{Output Tokens} - 1}\]

Average human silent reading speed is roughly ~250 ms per word (~5 tokens/s) (Brysbaert, 2019), but systems target much faster rates so users never "wait" for text to appear. The threshold for perceived smooth streaming is roughly 25 ms per token (~40 tokens/s). MLPerf targets P99 TPOT \(\leq\) 40 ms for interactive workloads.

P50 vs P99 latency matters because median latency hides the worst-case experience. P99 reveals the slowest 1% of requests — the ones that trigger user complaints and SLA violations. A system with great P50 but terrible P99 has a batching, preemption, or queue-depth problem.

5. Throughput: Tokens Per Second and the Latency Tradeoff

System-wide throughput is measured in output tokens per second across all concurrent requests. Requests per second (RPS) is a weaker metric because it ignores that a 10-token response and a 1,000-token response have vastly different costs. Typical numbers: Llama 3.1 8B on a single H100 reaches ~5,000-11,000 output tokens/s at high batch sizes (vLLM Benchmarks, 2024); Llama 3 70B FP8 on 4xH100 delivers competitive throughput across vLLM, SGLang, and TensorRT-LLM.

The core tradeoff: at low concurrency, each request gets excellent latency but the GPU is underutilized. Increasing batch size raises throughput nearly linearly until compute saturates, after which latency climbs sharply. The goodput metric — the fraction of requests meeting SLO targets — bridges raw throughput and actual user satisfaction.

6. KV Cache: The Memory Bottleneck That Rules Everything

During autoregressive generation, each new token must attend to all previous tokens. The KV cache stores the Key and Value projections from every token at every layer, avoiding catastrophic \(O(n^2)\) recomputation. Without it, generating token \(n\) would require re-running the model over all \(n-1\) previous tokens.

The KV cache is the dominant memory bottleneck because it grows linearly with sequence length, batch size, and layer count:

\[KVcache = 2 \times L \times h_{kv} \times d_h \times s \times B \times \text{bytes}\]

where:

  • \(L\) = number of layers
  • \(h_{kv}\) = number of KV heads
  • \(d_h\) = head dimension
  • \(s\) = sequence length
  • \(B\) = batch size

Concrete examples with FP16 and batch size 1: Llama 3 8B at 8,192 tokens uses ~1.0 GB of KV cache; at 128K tokens, 16 GB. Llama 3 70B at 128K tokens requires ~40 GB — half of an H100's total VRAM — for a single sequence. At production batch sizes, KV cache easily exceeds model weight memory. Naive implementations waste 60-80% of allocated KV memory to fragmentation, which is precisely the problem PagedAttention solves.

Key optimizations include GQA (reducing KV heads), KV cache quantization (FP8/INT8), PagedAttention (block-based allocation with <4% waste), and KV cache offloading to CPU or NVMe.

7. Prefill vs Decode: Two Phases, Two Bottlenecks

The prefill phase processes the entire input prompt in parallel, populating the KV cache. It is compute-bound — large matrix multiplications fully saturate GPU Tensor Cores. TTFT is determined here. The decode phase generates tokens one at a time, each step reading the full model weights and KV cache from HBM to produce a single token. It is memory-bandwidth-bound — the GPU's compute units sit mostly idle waiting for data. TPOT is determined here.

Prefill vs Decode Phases

Chunked prefill splits the prompt into fixed-size chunks (e.g., 512 tokens) instead of processing it all at once. This prevents a long prefill from blocking ongoing decode requests, co-schedules compute-bound and memory-bound work for better GPU utilization, and delivers +50% throughput in vLLM benchmarks (Agrawal et al., 2024). The tradeoff is slightly higher TTFT for the new request.

An emerging pattern called disaggregated serving (introduced by Splitwise and DistServe, now used by Perplexity and NVIDIA Dynamo) physically separates prefill and decode onto different GPU pools, each optimized for its respective bottleneck. KV cache transfers between pools via RDMA.

8. GQA and MQA: Shrinking the KV Cache

Attention Variants

Standard Multi-Head Attention (MHA) gives every query head its own K and V head. Multi-Query Attention (MQA) shares a single KV head across all query heads — an extreme reduction. Grouped-Query Attention (GQA) is the practical middle ground: groups of query heads share one KV head.

Llama 3 70B uses 64 query heads but only 8 KV heads, yielding an 8x KV cache reduction versus MHA. Llama 3.1 405B pushes this to 128 query heads with 8 KV heads for a 16x reduction (Meta, 2024). GQA retains MHA-level quality (within ~1% on benchmarks) while approaching MQA-level speed (Ainslie et al., 2023). The smaller KV cache directly enables larger batch sizes, higher throughput, and lower per-token decode latency.

9. Quantization: Trading Bits for Speed and Memory

Quantization reduces the precision of model weights and/or activations. The core tradeoffs:

Format Bits Memory (7B model) Quality Impact
FP16/BF16 16 ~14 GB Baseline
FP8 8 ~7 GB Essentially lossless on Hopper (H100)
INT8 8 ~7 GB 1-3% degradation with tuning
INT4 4 ~3.5 GB Stable for 70B+, risky for small models

AWQ (Activation-Aware Weight Quantization) identifies the <1% of salient weights by examining activation magnitudes and applies per-channel scaling to protect them. It uses only 128-1,024 calibration tokens and won the MLSys 2024 Best Paper Award. GPTQ uses second-order Hessian information for layer-wise quantization — better on coding benchmarks but needs more calibration data. bitsandbytes (a popular open-source Python library by Tim Dettmers) quantizes on-the-fly during model loading with zero preprocessing cost; its NF4 format powers QLoRA fine-tuning. FP8 on H100/H200 is becoming the production standard — near-lossless with 2x memory reduction.

Keep in mind: the kernel matters enormously — often more than the quantization algorithm itself. The same quantized weights served through different kernels can show a 2.6x throughput difference purely from GPU resource utilization (Section 10).


Part III — Inference Optimizations

With the fundamentals in place, these sections cover the software techniques that turn a working inference system into a fast one. Starting with CUDA kernels — the software units that actually execute on the GPU — each optimization targets a specific bottleneck from the previous part: FlashAttention exploits the SRAM-HBM gap, PagedAttention eliminates KV cache fragmentation, and continuous batching maximizes GPU utilization.

10. CUDA Kernels and Kernel Fusion

A CUDA kernel is a function written for the GPU that runs in parallel across thousands of threads. When the CPU calls a kernel, the GPU distributes the work across its SMs — each SM runs multiple warps of 32 threads, and each thread processes a slice of the data. Every operation in LLM inference — from matrix multiplication to token sampling — is ultimately a kernel launch. A single forward pass through a 70B model triggers hundreds to thousands of kernel launches, and the gap between a naive kernel and an optimized one can determine whether your system meets its latency SLO.

The main kernel categories in LLM serving are: GEMM kernels for matrix multiplication, which dominate both prefill and decode compute; attention kernels like FlashAttention that tile computations to stay in SRAM rather than spilling to HBM; fused kernels that combine multiple operations (e.g., add + layer norm, or QKV projection) into a single launch to eliminate intermediate HBM round-trips; and sampling kernels that convert logits into token IDs via top-k, top-p, or temperature sampling.

Kernel quality matters enormously — often more than the quantization algorithm itself. The same INT4-quantized weights served through Marlin (an optimized FP16xINT4 kernel) achieve 712 tokens/s versus vanilla GPTQ's 276 tokens/s — a 2.6x throughput difference purely from better GPU resource utilization. Marlin accomplishes this through asynchronous memory fetching and shared memory queues that keep Tensor Cores fed instead of waiting on HBM. Frameworks like Triton lower the barrier to writing custom kernels by exposing GPU programming through Python rather than raw CUDA C++, making kernel-level optimization accessible to ML engineers rather than just GPU specialists. Every optimization that follows in this part — FlashAttention, fused kernels, PagedAttention — is, at its core, a better kernel or a smarter way to orchestrate kernel launches.

Kernel fusion combines multiple sequential operations into a single GPU kernel, eliminating intermediate HBM writes. Common fusions include QKV projection (one matmul instead of three), attention + softmax (FlashAttention itself), add + RMSNorm (FlashNorm), and SwiGLU activation (DeepFusionKernel). Frameworks like Triton enable writing these custom fused kernels in Python. Without fusion, a 70B model has thousands of kernel launches per token with GPU at ~30% utilization. With fusion, layers become 1-2 optimized kernels at 80-90% utilization.

11. FlashAttention: Tiling Attention to Live in SRAM

Standard attention materializes the full \(N \times N\) attention matrix in HBM — \(O(N^2)\) memory and massive memory traffic. The core idea behind FlashAttention is to never materialize this matrix at all. Instead, it tiles the Q, K, V matrices into blocks that fit in SRAM, computes partial attention within each tile, and merges results using an online softmax trick (incrementally tracking the running max and sum across blocks). Memory drops from \(O(N^2)\) to \(O(N)\), and HBM reads drop by an order of magnitude.

Each version has targeted the bottleneck specific to its GPU generation:

  • FlashAttention v1 (A100, 2022) — proved the tiling + online softmax approach works. 2-4x speedup over standard attention, but only 25-40% GPU utilization because the kernel scheduling left many SMs idle.
  • FlashAttention v2 (A100, 2023) — reworked the parallelism to split across the sequence dimension rather than batch/heads, reaching 50-73% utilization on A100 — roughly 2x faster than v1.
  • FlashAttention v3 (H100 Hopper, 2024) — introduced warp specialization (separate warps for data movement vs. math) and GEMM-softmax pipelining to overlap memory loads with computation. Reaches 75-85% utilization on H100 and up to ~1.2 PFLOPS in FP8. Accepted as a NeurIPS 2024 spotlight paper.
  • FlashAttention v4 (B200 Blackwell, 2026) — addresses a new bottleneck: on Blackwell, tensor core throughput scales so fast that non-matmul operations (softmax exponentials, rescaling) become the limiter. FA4 software-emulates the exponential with polynomial approximations on FMA units, uses conditional rescaling to reduce overhead, and stores intermediates in Blackwell's dedicated tensor memory (TMEM) instead of registers. Result: 1,605 TFLOPS/s on B200 in BF16 — 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton.

As you can see, each generation hit a different hardware wall, and each FlashAttention version was redesigned from the kernel up to address it.

12. FlashDecoding: Parallelizing the Decode Bottleneck

Standard FlashAttention keeps the GPU busy by splitting work across batch size and query length. However, during the decode phase, the model generates exactly 1 token at a time (query length = 1). If your batch size multiplied by the model's attention heads is less than the GPU's total processing units (108 SMs on an A100), the GPU is severely underutilized. Most of the GPU sits idle while a few units grind sequentially through the entire token history.

FlashDecoding solves this by adding a new parallelization dimension: the KV sequence length itself. It chops the massive KV cache into smaller chunks and distributes them across all the otherwise-idle GPU processors to evaluate in parallel. It then mathematically merges their partial computations using a "log-sum-exp reduction."

The result: up to an 8x end-to-end decode speedup on long sequences (64K context) and near-constant decode time. By spreading the growing history across the GPU's massive parallel resources, generating token 60,000 remains almost as fast as generating token 100.

13. Continuous Batching vs Static Batching

Static batching waits for all sequences in a batch to finish before starting the next — short sequences waste GPU cycles idling after their end-of-sequence token. Continuous batching (introduced by the Orca paper, OSDI 2022) operates at iteration-level granularity: at each decode step, completed sequences are removed and new ones inserted.

The impact is dramatic. Anyscale benchmarks on OPT-13B show: optimized static batching achieves 4x over naive, continuous batching reaches 8x, and vLLM with continuous batching + PagedAttention hits 23x over naive (Anyscale, 2023). However, continuous batching amplifies KV cache fragmentation — more concurrent sequences means more scattered memory allocations — which is exactly why PagedAttention was necessary.

14. PagedAttention: Virtual Memory for KV Cache

vLLM's PagedAttention applies operating system virtual memory concepts to KV cache management. It divides the KV cache into fixed-size blocks (typically 16 tokens), allocates blocks on-demand as tokens are generated, and maps logical (sequential) positions to physical (scattered) memory locations via block tables. Multiple requests sharing a prefix (system prompts, beam search) can point to the same physical blocks.

Previous systems wasted 60-80% of KV cache memory to fragmentation and pre-allocation. PagedAttention reduces waste to <4%. This enables 2-4x throughput improvement at the same latency, and up to 24x over HuggingFace Transformers (vLLM Blog, 2023).

15. Speculative Decoding: Multiple Tokens Per Forward Pass

Speculative Decoding

A small draft model generates \(K\) candidate tokens, then the large target model verifies all \(K\) tokens in a single forward pass. Correct tokens are accepted; the first incorrect one is rejected. The key property: output quality is mathematically identical to the target model alone — this is a lossless speedup.

It works because LLM decode is memory-bandwidth-bound: verifying \(K\) tokens costs roughly the same as generating 1 (both load all model weights once). Typical speedups range from 1.5x to 3x, with advanced methods like EAGLE-3 reaching up to 6.5x. Variants include Medusa (extra prediction heads, no separate model), prompt lookup decoding (n-gram matching against the input, free), and EAGLE (feature-level extrapolation).

The critical caveat: at high concurrency, the extra compute for draft/verification can cause 1.4-1.8x slowdown. Speculative decoding helps most at low batch sizes and for tasks where draft acceptance rates are high (summarization, QA).

16. Prefix Caching and KV Cache Reuse

Instead of discarding KV cache after a request completes, prefix caching retains it for reuse when new requests share the same prefix tokens. This avoids redundant prefill computation for system prompts, few-shot examples, RAG context, and multi-turn conversation history.

vLLM's Automatic Prefix Caching hashes each KV block and uses a global hash table for lookup, achieving cache hit rates of 87%+ with well-structured prompts. SGLang's RadixAttention maintains a radix tree of all cached KV tensors, offering token-level granularity and automatic discovery of caching opportunities — delivering up to 5x throughput improvement.

17. Streaming in Practice

Streaming sends tokens to the client as generated rather than waiting for the complete response — the standard delivery mechanism for all interactive LLM applications. Most serving frameworks expose streaming via Server-Sent Events (SSE): the client opens a long-lived HTTP connection, and the server pushes each token (or small token batch) as a data: event. TTFT determines when the user first sees output; TPOT determines perceived smoothness. Target: TPOT < 25 ms for smooth streaming (~40 tokens/s).

On the client side, streaming introduces buffering decisions. Rendering token-by-token can cause visual jitter, especially with markdown or code blocks that need multi-token context to format correctly. Common patterns include word-level buffering (accumulate tokens until a whitespace boundary), line-level buffering (wait for a newline before rendering), and adaptive buffering (render immediately for prose, buffer for code blocks). The stream_options: {"include_usage": true} parameter in OpenAI-compatible APIs returns token counts in the final SSE event, enabling accurate cost tracking for streamed responses.

Chunked prefill is the mechanism that makes streaming work well under load — without it, a single long prefill can stall token delivery for all other concurrent users.


Part IV — Model Architecture

These concepts cover how LLMs are built — from the transformer block itself to tokenization, context handling, and the architectural variants that have won the scaling race. I've grouped them here because they underpin both inference and training: you need to understand the architecture before optimizing either side.

18. Transformer Architecture Essentials

A modern decoder-only transformer (GPT, Llama) is a stack of identical layers, each containing two sub-blocks: attention and feed-forward. Every sub-block is wrapped with a residual connection and normalization. Here are the key components:

Multi-Head Attention — The mechanism that lets each token "look at" every other token to decide what's relevant. The input is projected into three matrices — Queries (what am I looking for?), Keys (what do I contain?), and Values (what information do I carry?) — and attention scores are computed as:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

The \(QK^T\) dot product measures similarity between every pair of tokens. Division by \(\sqrt{d_k}\) prevents the dot products from growing too large (which would push softmax into regions with vanishing gradients). The softmax converts scores to probabilities, and multiplication by \(V\) produces a weighted combination of value vectors. Running this across multiple heads in parallel lets the model attend to different types of relationships simultaneously (e.g., one head for syntax, another for coreference).

Feed-Forward Network (FFN) — After attention determines which tokens are relevant, the FFN processes what to do with that information. Modern LLMs use SwiGLU instead of the original two-matrix ReLU FFN:

\[\text{SwiGLU}(x) = \text{Swish}(xW_1) \otimes xW_2\]

SwiGLU uses three weight matrices instead of two and a smooth Swish activation instead of ReLU. This adds ~50% more parameters to the FFN but improves quality enough that all major models (Llama, Mistral, Qwen, Gemma) have adopted it. The FFN typically accounts for ~2/3 of total model parameters.

Residual Connections — Each sub-block adds its output back to its input: \(\text{Output} = \text{Input} + \text{Sublayer}(\text{Input})\). Without this skip connection, gradients would vanish when backpropagating through 80-128 layers. Residual connections create a "gradient highway" that allows information to flow directly from early layers to late layers.

RMSNorm — Has replaced LayerNorm in all modern LLMs. LayerNorm normalizes by both re-centering (subtracting the mean) and re-scaling (dividing by standard deviation). RMSNorm skips the mean subtraction and only re-scales, which halves the learned parameters and is 7-64% faster with no quality loss. Pre-norm placement (normalizing before attention/FFN rather than after) is now standard — it provides more stable gradients without requiring careful learning rate warmup.

Parameter count estimation for a decoder-only model:

\[\text{Total} \approx V \times d + 12 \times L \times d^2\]

where \(V\) = vocabulary size, \(d\) = hidden dimension, \(L\) = layer count. The \(V \times d\) term is the embedding matrix; the \(12 \times d^2\) per layer covers attention projections (Q, K, V, output = \(4d^2\)) and the SwiGLU FFN (\(8d^2\) with the typical \(\frac{8}{3}d\) intermediate size). For Llama 3 8B (\(V\)=128,256, \(d\)=4,096, \(L\)=32): \(128{,}256 \times 4{,}096 + 12 \times 32 \times 4{,}096^2 \approx 7.0B + 0.5B \approx 7.5B\) parameters (actual: 8.03B with tied embeddings and GQA adjustments).

19. Why Decoder-Only Architectures Dominate

The original Transformer (2017) had both an encoder and a decoder. Since then, the field split into three architectural families — and one decisively won the scaling race for generative AI.

Encoder-only models (BERT, RoBERTa) use bidirectional attention — every token attends to every other token in both directions. This produces rich representations for understanding tasks (classification, NER, semantic similarity) but cannot generate text autoregressively. Encoder-only models still dominate as the backbone for embedding models, rerankers, and lightweight classifiers (e.g., the BERT-based routers in RouteLLM).

Encoder-decoder models (T5, BART, the original Transformer) separate understanding from generation. The encoder processes the full input with bidirectional attention, then the decoder generates output autoregressively while attending to the encoder's representations via cross-attention. This architecture had a natural advantage for sequence-to-sequence tasks like translation, where the input and output are fundamentally different sequences. Google's T5 showed that any NLP task could be framed as text-to-text, and encoder-decoder models still power some specialized systems (e.g., Whisper for speech recognition, FLAN-T5 for instruction following).

Decoder-only models (GPT, Llama, Mistral, Gemini) use causal (unidirectional) attention — each token attends only to previous tokens. They frame everything as next-token prediction: the "input" is just the beginning of the sequence, and the "output" is its continuation. This architecture won the scaling race for four reinforcing reasons:

  1. KV cache efficiency. In a decoder-only model, the KV cache from previous tokens remains valid as new tokens are generated — you never discard or recompute it. In encoder-decoder models, the decoder must maintain two separate attention caches: the self-attention cache (like decoder-only) plus the cross-attention cache over the encoder output. This adds memory overhead and architectural complexity for KV management.

  2. Training simplicity. The training objective is just next-token prediction on raw text — no need for paired input-output data (as translation models require) or masked-token reconstruction (as BERT requires). This means you can train on virtually any text from the internet, books, and code without any special preprocessing, which is a massive advantage when scaling data to trillions of tokens.

  3. Architectural simplicity. One module handles everything — the same transformer block, repeated \(L\) times. No encoder-decoder cross-attention layers, no separate encoder stack. This simplifies implementation, enables more straightforward parallelism strategies (Section 33), and reduces the engineering surface for optimization (FlashAttention, quantization, speculative decoding all only need to target one attention pattern).

  4. In-context learning. Decoder-only models naturally excel at few-shot learning because the examples, instructions, and query are all just tokens in the same sequence. The model doesn't distinguish between "input" and "output" — it simply predicts the next token given everything before it. This emergent capability, first demonstrated at scale by GPT-3, made decoder-only models uniquely suited for the general-purpose assistant paradigm.

20. Mixture of Experts

MoE replaces the dense FFN in each transformer layer with multiple smaller expert FFNs plus a lightweight gating router. The router computes a score for each expert — typically a softmax over learned linear projections — and selects the top-\(k\) experts per token. Only activated experts compute, so a model can have enormous total capacity while keeping per-token cost low. This is sparse conditional computation: total parameters determine what the model can represent; active parameters determine what it costs to run.

Model Total Params Active Params Experts (Routed + Shared) Top-\(k\)
Mixtral 8x7B 47B ~13B 8 + 0 2
DeepSeek-V3 671B 37B 256 + 1 8

The shared expert in DeepSeek-V3 is always activated for every token, providing a baseline representation that routed experts can specialize on top of.

Training challenges include load balancing (tokens clustering on a few popular experts, leaving others undertrained), expert collapse (experts converging to identical representations, negating the benefit of having multiple experts), and communication overhead for Expert Parallelism. Traditional MoE models add an auxiliary loss to penalize imbalanced routing, but this loss directly competes with the primary training objective, degrading quality. DeepSeek-V3 solved this with auxiliary-loss-free bias terms: each expert has a bias added to its gating score, dynamically adjusted outside of backpropagation — decreased for overloaded experts, increased for underloaded ones. This achieves balanced routing without any quality tradeoff, a significant innovation over prior approaches.

21. Tokenization: BPE, SentencePiece, and tiktoken

LLMs don't see text — they see sequences of integer token IDs. A tokenizer splits raw text into tokens (subword pieces) and maps each to an ID. The choice of tokenizer affects model quality, inference speed, and multilingual fairness.

Byte Pair Encoding (BPE) is the dominant algorithm. It works by iteratively merging the most frequent adjacent pairs in the training corpus. A simplified example:

  1. Start with character-level vocabulary: [l, o, w, e, r, _]
  2. Most frequent pair is (l, o) → merge into lo → vocabulary: [l, o, w, e, r, _, lo]
  3. Next most frequent pair is (lo, w) → merge into low → vocabulary adds low
  4. Continue until the vocabulary reaches the target size (e.g., 128K tokens)

Common words like "the" become single tokens, while rare words like "defenestration" get split into subword pieces like ["def", "en", "est", "ration"]. This balances vocabulary size against sequence length.

Three major tokenizer implementations dominate:

  • SentencePiece — Treats input as a raw byte stream with no language-specific preprocessing (no pre-tokenization by spaces or punctuation). This makes it truly language-agnostic, which is critical for non-Latin scripts. Supports both BPE and unigram algorithms. Used by Llama 1/2, T5, and Mistral.
  • tiktoken — OpenAI's Rust-based tokenizer using byte-level BPE. Its compiled Rust core makes it 3-6x faster than Python-based alternatives. Llama 3 switched from SentencePiece to tiktoken's algorithm.
  • HuggingFace Tokenizers — Rust-based library supporting BPE, WordPiece, and Unigram. The de facto standard for open-source model distribution.

Vocabulary sizes have grown steadily, with major implications for efficiency:

Model Vocab Size English Fertility Why It Matters
GPT-2 50,257 ~1.3 tokens/word Original BPE baseline
Llama 2 32,000 ~1.4 tokens/word Smaller vocab, longer sequences
GPT-4 100,256 ~1.1 tokens/word Better compression, fewer tokens per request
Llama 3 128,256 ~1.0 tokens/word 4x larger than Llama 2, major multilingual gain
GPT-4o 200,000 ~1.0 tokens/word Largest production vocabulary

Fertility (tokens per word) measures compression efficiency. Lower is better — fewer tokens means shorter sequences, lower cost, and more content within the context window. English typically achieves ~1.0–1.3 tokens/word, but non-Latin scripts (Chinese, Japanese, Korean, Arabic) can be 2-4x higher with English-centric vocabularies. This means the same content costs 2-4x more in tokens for non-English users — a persistent equity issue that larger, more balanced vocabularies partially address.

22. Context Windows and Positional Encodings

The context window is the maximum number of tokens a model can process in a single forward pass. It has grown dramatically:

Model Context Window Year
Original Transformer 512 2017
GPT-4 Turbo 128K 2023
Claude 3.5 200K 2024
Gemini 1.5 Pro 1M+ 2024
Grok 4 Fast 2M 2025

But there's a fundamental problem: the attention mechanism treats its input as a set, not a sequence — it has no built-in notion of word order. Without positional information, "the cat sat on the mat" and "the mat sat on the cat" would produce identical representations. Positional encodings inject order information so the model knows where each token sits.

Three approaches dominate:

  • RoPE (Rotary Position Embeddings) — Encodes each token's position by rotating its query and key vectors by an angle proportional to the position. Two tokens close together get similar rotations, making their dot product (attention score) higher. Tokens far apart get very different rotations, naturally encoding relative distance. RoPE is the standard for nearly all modern open LLMs (Llama, Mistral, Qwen) because it handles relative positions gracefully and is computationally cheap.

  • ALiBi (Attention with Linear Biases) — Instead of modifying embeddings, ALiBi adds a simple penalty directly to attention scores: the farther apart two tokens are, the larger the negative bias. No learned parameters, no extra computation. It allows some extrapolation beyond training length but degrades noticeably at 2x+ the training context.

  • YaRN (Yet another RoPE extensioN) — The state-of-the-art method for stretching a model's context window beyond what it was trained on. It groups RoPE's frequency dimensions into three categories — high-frequency (don't scale), low-frequency (scale linearly), and mid-frequency (interpolate) — applying different scaling to each. This allows extending context with 10x fewer fine-tuning tokens and 2.5x fewer training steps than naive position interpolation.


Part V — Training and Alignment

Training is where LLM capabilities are built. The concepts here cover how models learn from data, how to fine-tune them efficiently with techniques like LoRA and mixed precision, and the scaling laws and alignment methods that guide resource allocation.

23. Pretraining, Fine-Tuning, and Alignment

Training Pipeline

Pretraining is self-supervised next-token prediction on trillions of tokens. It builds general language understanding and costs $500K-$100M+ (Llama 3 405B consumed \(3.8 \times 10^{25}\) FLOPs). Supervised fine-tuning (SFT) adapts the pretrained model to specific tasks using labeled data, costing hundreds to low thousands of dollars on a single GPU. RLHF (Reinforcement Learning from Human Feedback) / RLAIF (Reinforcement Learning from AI Feedback) teaches subjective qualities like helpfulness through preference learning — collect human comparisons, train a reward model, then fine-tune with RL (typically PPO or DPO). RLAIF replaces human annotators with AI feedback (Constitutional AI).

Compute spans roughly six orders of magnitude: pretraining at \(10^{24}\)-\(10^{26}\) FLOPs, SFT at \(10^{18}\)-\(10^{21}\), RLHF similar to SFT but requiring 4 model copies in memory for PPO. I covered the full fine-tuning decision framework — when to fine-tune vs. use RAG vs. prompt engineering — in The Complete Guide to LLM Fine-Tuning.

24. LoRA and QLoRA: Parameter-Efficient Fine-Tuning

LoRA freezes pretrained weights and injects trainable low-rank matrices \(A\) (\(r \times k\)) and \(B\) (\(d \times r\)) such that the updated weight is \(W_0 + BA\). The key insight: weight updates during fine-tuning have low intrinsic rank. This reduces trainable parameters by 10,000x (GPT-3 175B: from 175B to ~18M) and GPU memory by ~3x. Typical ranks: \(r\)=8-16 for simple tasks, \(r\)=64-128 for complex scenarios. LoRA adapters can be merged post-training for zero inference overhead.

QLoRA loads the base model in 4-bit NF4 quantization while training LoRA adapters in BF16. NormalFloat4 places more quantization levels near zero where weight density is highest. QLoRA enables fine-tuning a 65B model on a single 48GB GPU with performance barely distinguishable from 16-bit full fine-tuning. The tradeoff: 39% increased training time for 33% GPU memory savings versus standard LoRA.

25. Mixed Precision Training

Every floating-point format allocates its bits across three fields — sign (always 1 bit), exponent (sets the dynamic range), and mantissa (sets the precision). More exponent bits mean a wider range of representable magnitudes; more mantissa bits mean finer distinctions between nearby values. Integer formats have no exponent at all — they represent only evenly spaced whole numbers within a fixed range.

Format Bits Layout (S / E / M) Range Precision Best For
FP32 32 1 / 8 / 23 \(\pm 3.4 \times 10^{38}\) ~7 decimal digits Master weights, optimizer states (Adam momentum & variance)
BF16 16 1 / 8 / 7 \(\pm 3.4 \times 10^{38}\) ~2 decimal digits Preferred training format — same range as FP32, no loss scaling needed
FP16 16 1 / 5 / 10 \(\pm 65{,}504\) ~3 decimal digits Training with loss scaling (older GPUs); inference on pre-Hopper hardware
FP8 E4M3 8 1 / 4 / 3 \(\pm 448\) ~1 decimal digit Forward pass on Hopper (H100) — more precision for weights & activations
FP8 E5M2 8 1 / 5 / 2 \(\pm 57{,}344\) ~0.6 decimal digits Backward pass on Hopper — wider range for gradients
INT8 8 fixed-point \(-128\) to \(127\) Exact integers Post-training weight quantization for inference (W8A8); KV cache quantization
INT4 4 fixed-point \(-8\) to \(7\) Exact integers Aggressive weight-only quantization (AWQ, GPTQ) for inference on memory-constrained hardware

BF16 has the same range as FP32 because range is determined by the exponent field, and BF16 keeps all 8 exponent bits from FP32. It sacrifices mantissa bits instead (7 vs 23), trading precision for a 2x memory reduction while eliminating the overflow/underflow problems that plague FP16 training. FP16 has only 5 exponent bits, capping its range at ~65K — gradients routinely exceed this, requiring loss scaling (multiplying the loss by a large constant before backprop, then dividing gradients afterward) to avoid overflow. BF16 makes loss scaling unnecessary.

Integer formats are absent from training because integer quantization lacks dynamic range entirely — it cannot represent the wide spread of gradient magnitudes during backpropagation. However, INT8 and INT4 are highly effective for inference, where weights are frozen and can be mapped to a fixed range. INT4 quantization (via AWQ or GPTQ) cuts a 7B model from ~14 GB to ~3.5 GB, enabling deployment on consumer GPUs with only minor quality loss.

FP8 training on H100 via Transformer Engine uses E4M3 for forward (more mantissa bits → better precision for activations) and E5M2 for backward (more exponent bits → wider range for gradients), delivering up to 75% faster wall-clock time for 175B models. DeepSeek-V3 trained entirely with FP8 mixed precision at ~$5.6M — the marginal compute cost for the final training run, excluding R&D and infrastructure.

26. Gradient Checkpointing

During the forward pass, every layer produces an intermediate output called an activation:

\[\text{Input} \rightarrow [\text{Layer 1}] \rightarrow \text{activation}_1 \rightarrow [\text{Layer 2}] \rightarrow \text{activation}_2 \rightarrow [\text{Layer 3}] \rightarrow \text{output}\]

Normally, all activations must be kept in memory because backpropagation needs them to compute gradients. For a deep transformer, these stored activations can consume more memory than the model weights themselves. Gradient checkpointing trades compute for memory by discarding most of these activations and recomputing them on-the-fly during backprop. The standard strategy (Chen et al., 2016) divides a network of \(n\) layers into \(\sqrt{n}\) evenly-spaced segments and saves only the boundary activation of each segment — these saved boundaries are the "checkpoints." All intermediate activations within a segment are dropped immediately. When the backward pass reaches a layer inside a segment, its activations are recomputed from the nearest checkpoint. This reduces activation memory from \(O(n)\) to \(O(\sqrt{n})\) — in practice a ~60-70% reduction — at a cost of roughly one extra forward pass (~20-33% more compute). FlashAttention inherently applies this principle to attention by not materializing the full attention matrix. Enable trivially in HuggingFace with gradient_checkpointing=True.

27. DeepSpeed ZeRO Stages

In standard data parallelism, every GPU holds a complete copy of the model weights, gradients, and optimizer states. For Adam, each parameter needs: 2 bytes for the FP16 weight + 4 bytes for the FP32 master weight + 4 bytes for momentum + 4 bytes for variance + 2 bytes for the gradient = 16 bytes per parameter. A 7.5B-parameter model therefore needs ~120 GB per GPU — and every GPU stores the exact same thing. With 64 GPUs, that's 64 identical 120 GB copies, a massive waste.

DeepSpeed ZeRO (Zero Redundancy Optimizer) eliminates this duplication by partitioning (sharding) these components across GPUs instead of replicating them:

  • Stage 1 — Partition optimizer states. Each GPU stores only 1/N of the optimizer states (Adam's momentum and variance, 8 bytes/param). When a GPU needs to update a weight, it updates only its assigned slice and broadcasts the result. Memory drops from ~120 GB to ~31 GB per GPU.
  • Stage 2 — Also partition gradients. Gradients (2 bytes/param) are no longer all-reduced to every GPU. Instead, each GPU receives only the gradient slice it needs via reduce-scatter. Memory drops to ~16 GB per GPU.
  • Stage 3 — Also partition the model weights themselves. Each GPU holds only 1/N of the FP16 weights. Before each layer's forward or backward pass, the GPU calls all-gather to temporarily reconstruct the full layer weights from all other GPUs, computes, and discards the gathered weights. Memory drops to ~1.9 GB per GPU.
Config Optimizer States Gradients Weights Per-GPU Memory (7.5B)
No ZeRO Replicated Replicated Replicated ~120 GB
Stage 1 Partitioned Replicated Replicated ~31 GB
Stage 2 Partitioned Partitioned Replicated ~16 GB
Stage 3 Partitioned Partitioned Partitioned ~1.9 GB

The tradeoff is communication: Stage 1 adds minimal overhead, Stage 2 replaces all-reduce with reduce-scatter (similar cost), but Stage 3 requires all-gather calls before every layer in both forward and backward passes — roughly 1.5x communication volume versus standard data parallelism.

ZeRO-Infinity extends Stage 3 further by offloading partitioned states to CPU RAM and even NVMe SSDs, enabling training of models with trillions of parameters on limited GPU clusters. The cost is significant speed reduction — NVMe is ~500x slower than HBM — so ZeRO-Infinity is used when a model simply cannot fit in GPU + CPU memory otherwise.

28. FSDP: PyTorch-Native Sharding

Fully Sharded Data Parallel (FSDP) is PyTorch's built-in answer to DeepSpeed ZeRO-3 — it shards parameters, gradients, and optimizer states across GPUs using the same core idea. The mechanics work in a simple loop for each layer:

  1. All-gather the full parameters from all GPUs (temporarily reconstruct the complete layer).
  2. Compute the forward or backward pass for that layer.
  3. Free the gathered parameters immediately — each GPU keeps only its own shard.
  4. Reduce-scatter gradients so each GPU gets only its assigned gradient slice.

Because FSDP is native to PyTorch, it avoids the overhead of bridging between frameworks. This integration advantage makes FSDP up to 5x faster per iteration than DeepSpeed ZeRO-3 for mid-range models in multi-node setups, though results are workload-dependent, and it works seamlessly with PyTorch's debugging tools, profilers, and torch.compile.

Criteria FSDP (PyTorch) DeepSpeed ZeRO
Best for Models up to ~70B, PyTorch-native workflows Massive models (70B+), extreme memory constraints
Offloading CPU offloading CPU + NVMe offloading (ZeRO-Infinity)
ZeRO stages Stage 3 only (full sharding) Stages 1, 2, 3 (granular control)
Framework integration Native PyTorch, torch.compile support Separate library, own config system
Ecosystem PyTorch-native Larger feature set, more knobs to tune

FSDP2 (2024-2025) is a major rewrite that improves torch.compile integration for better kernel fusion, adds FP8 training support via TorchAO, and simplifies the API. Both FSDP and DeepSpeed are accessible through HuggingFace Accelerate, which lets you switch between them with a single config change.

29. Scaling Laws and the Chinchilla Trap

Chinchilla scaling (DeepMind, 2022) showed that compute-optimal training uses ~20 tokens per parameter — meaning a 70B model should be trained on ~1.4T tokens. But following Chinchilla exactly produces models too large for efficient inference — the Chinchilla trap. A Chinchilla-optimal 70B model achieves great loss during training, but every single inference request must load and run all 70B parameters. If a smaller model trained on more data can reach similar quality, it will be dramatically cheaper to serve across the billions of requests it handles in production.

The solution: overtrain smaller models on vastly more data. The progression is striking:

Model Params Training Tokens Tokens/Param Chinchilla ×
Chinchilla 70B 1.4T 20:1
Llama 1 65B 1.4T 22:1
Llama 2 70B 2.0T 29:1 1.4×
Llama 3 8B 8B 15T 1,875:1 94×
Qwen3-0.6B 0.6B 36T 60,000:1 3,000×

When accounting for the lifetime cost of inference across billions of requests, training smaller and longer is overwhelmingly optimal. A model like Llama 3 8B costs more in training compute than Chinchilla would prescribe, but it costs a fraction to deploy — and inference cost dominates total spend. "Chinchilla-optimal" now means "training compute-optimal," which is a fundamentally different objective from "inference-aware optimal."

30. RLHF, DPO, GRPO, and the Alignment Landscape

Alignment is the process of steering a pretrained model to follow instructions, refuse harmful requests, and produce truthful responses — bridging the gap between "can predict the next token" and "is actually useful and safe." A pretrained base model will happily generate toxic content, hallucinate confidently, or ignore your instructions entirely. The methods below are the main approaches for closing that gap, each trading off complexity, data requirements, and training stability.

The classic RLHF pipeline: SFT → collect human preference pairs → train a reward model on those pairs → fine-tune the policy with PPO (Proximal Policy Optimization). PPO requires 4 model copies in memory simultaneously (policy, reference, critic/value model, reward model), is hyperparameter-sensitive, and prone to reward hacking — the model learns to exploit quirks in the reward model (e.g., producing verbose, confident-sounding answers) rather than genuinely improving quality.

DPO (Direct Preference Optimization) eliminates the reward model and RL loop entirely, directly optimizing a contrastive loss on preference pairs. It reduces the pipeline to a single supervised training step, making it far simpler and more stable. The tradeoff: DPO is offline — it trains only on a fixed dataset of pre-collected preferences. The model never generates new responses during training, so it cannot explore behaviors beyond its initial distribution. This limits its effectiveness for tasks like reasoning where the model needs to discover novel solution strategies.

GRPO (Group Relative Policy Optimization, DeepSeek) combines the best of both worlds. It eliminates PPO's critic/value model by generating multiple completions per prompt and using the group's average reward as the baseline — storing only 2 LLM copies versus PPO's 4. Unlike DPO, GRPO is on-policy: the model generates fresh responses during training, enabling exploration. GRPO powered DeepSeek-R1's breakthrough in reasoning via RLVR (Reinforcement Learning from Verifiable Rewards) — replacing the learned reward model with rule-based verification (math correctness, code compilation, unit tests). Because verifiable rewards cannot be hacked, RLVR sidesteps reward hacking entirely.

Method Models in Memory Reward Signal Online/Offline Key Limitation
PPO 4 (policy, ref, critic, reward) Learned reward model Online Reward hacking, complex tuning
DPO 2 (policy, reference) Implicit (preference pairs) Offline No exploration, fixed data
GRPO 2 (policy, reference) Explicit (verifiable or learned) Online Needs verifiable rewards for full benefit

31. Distillation: Compressing Knowledge Across Models

Knowledge distillation transfers capabilities from a large teacher to a smaller student. Two approaches exist: logit-based distillation trains the student to match the teacher's full output probability distribution (the "soft labels" that capture uncertainty and relationships between tokens), while data-based distillation has the teacher generate synthetic training data that the student fine-tunes on. For LLMs, data-based distillation dominates because it works across different architectures and tokenizers, requires only API access to the teacher (no internal weights needed), and scales to arbitrary amounts of training data.

DeepSeek-R1 generated 800,000 chain-of-thought reasoning examples and used them to distill Qwen2.5 and Llama 3 models ranging from 1.5B to 70B parameters. The results redefine what small models can achieve:

  • DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024 and 94.3% on MATH-500, outperforming OpenAI o1-mini
  • DeepSeek-R1-Distill-Qwen-7B scores 55.5% on AIME 2024, surpassing QwQ-32B-Preview — a specially designed 32B reasoning model achieving comparable performance with a model 4.5x smaller

Crucially, distillation proved superior to applying RL directly to smaller models. When DeepSeek applied GRPO to smaller base models without distillation, the results were significantly worse. The teacher's chain-of-thought examples effectively transfer reasoning patterns — how to decompose problems, when to backtrack, when to verify — that RL alone struggles to discover from scratch in smaller models.

32. Synthetic Data Generation

Using LLMs to generate training data has become standard practice. The key techniques form a progression of increasing sophistication:

  • Self-Instruct bootstraps from a small seed set of human-written instructions — the LLM generates new instructions, inputs, and outputs, which are filtered and added back to the pool. This powered the Alpaca dataset (52K examples from 175 seeds) and proved that a $600 fine-tune could approximate GPT-3.5 quality.
  • Evol-Instruct (WizardLM) takes existing instructions and iteratively evolves them along complexity axes — adding constraints, deepening reasoning, making problems more concrete — producing progressively harder training examples.
  • Microsoft's Phi-4 (14B) pushed this further by using synthetic data as the majority of pretraining data, employing multi-agent prompting (multiple LLMs collaborating to generate and critique solutions), self-revision workflows, and instruction reversal. Its STEM and coding performance exceeds models several times its size.

The critical risk is model collapse: when models are recursively trained on synthetic data from previous model generations, the tails of the original data distribution progressively vanish. The model overestimates common patterns and loses rare but important variations, causing consistent decrease in lexical, syntactic, and semantic diversity (Shumailov et al., 2024). This is compounded by web contamination — by April 2025, 74.2% of newly created webpages in their sample of 900K pages contained AI-generated text (Ahrefs Study, 2025), making it increasingly difficult to source clean human-generated pretraining data. Mitigation requires blending synthetic with real data (never training on synthetic alone), rigorous quality filtering, and data lineage tracking to prevent recursive contamination.


Part VI — Scaling and Deployment

Scaling from a single GPU to a cluster requires splitting work across devices. These sections cover the parallelism strategies, the frameworks that implement them, GPU selection, and routing techniques I've found most relevant for cost-effective deployment at scale.

33. Four Flavors of Parallelism

Parallelism Strategies

Tensor Parallelism (TP) splits individual weight matrices across GPUs, requiring all-reduce after each layer. It demands NVLink bandwidth (900 GB/s on H100) and works best within a single node. TP=2 or TP=4 is typical; going higher introduces diminishing returns from communication overhead. For inference, TP reduces per-request latency by splitting compute.

Pipeline Parallelism (PP) splits layers sequentially across GPUs, passing activations between stages. Lower bandwidth requirements make it suitable across nodes via InfiniBand, but it introduces pipeline bubbles (GPU idle time). A common pattern: Llama-405B uses TP=8 within nodes and PP=2 across nodes.

Data Parallelism (DP) replicates the full model on each GPU. For inference, it is the most cost-effective scaling strategy — each replica handles independent requests with zero inter-GPU communication. For training, it is the primary scaling axis, combined with ZeRO to shard optimizer states.

Expert Parallelism (EP) distributes MoE experts across GPUs using all-to-all communication for token routing. DeepSeek-V3 (671B total, 37B active, 256 experts) typically deploys with EP=8 per node. The all-to-all communication accounts for ~47% of forward-pass latency even on NVLink, making it the key MoE bottleneck.

The Parallelism Decision Framework:

  • Model fits on 1 GPU: Use DP only.
  • Model fits on 1 node: Use TP within the node + DP across nodes.
  • Model exceeds 1 node: Use TP + PP + DP.
  • Mixture of Experts (MoE): Add EP to any of the above.

Parallelism Decision Framework

34. Serving Frameworks Compared

vLLM is the most widely adopted open-source framework, built on PagedAttention and continuous batching. It offers the broadest model support, an OpenAI-compatible API, and supports TP/PP/DP/EP. It is a PyTorch Foundation project with the largest community.

SGLang matches or exceeds vLLM in many scenarios (up to 3.1x throughput on Llama-70B) through RadixAttention for prefix reuse, a zero-overhead CPU scheduler, and native structured generation support. It was the first open-source framework to match DeepSeek's official inference throughput at scale on 96 H100s.

TensorRT-LLM delivers the best single-request latency (35-50 ms TTFT) through CUDA graph fusion and kernel optimization. FP8/FP4 native support. The tradeoff: steeper learning curve, Docker dependency, NVIDIA lock-in.

TGI (HuggingFace) offers seamless ecosystem integration with multi-backend support (NVIDIA, AMD, Intel). Note: as of early 2025, TGI has been placed in maintenance mode by HuggingFace.

Ollama targets developers wanting local LLM access in two commands. Not optimized for production throughput.

llama.cpp is the portable C/C++ option with the broadest hardware support (ARM, AVX, Metal, CUDA, ROCm, Vulkan). The GGUF format supports 1.5-bit through 8-bit quantization. CPU inference delivers 3-45 tokens/s; GPU (RTX 4090) reaches ~128 tokens/s on Llama 8B Q4_K_M..

35. GPU Selection for Inference

GPU pricing and availability as of March 2026:

GPU Memory Bandwidth Native Precision TF32 TFLOPS NVLink Cloud $/hr
B200 192 GB HBM3e 8 TB/s FP4, FP8, INT8 ~4,500 (FP8) 5.0 (1.8 TB/s) ~$6.25
H200 141 GB HBM3e 4.8 TB/s FP8, INT8 989 4.0 (900 GB/s) $2.15-6.00
H100 SXM 80 GB HBM3 3.35 TB/s FP8, INT8 989 4.0 (900 GB/s) $1.49-3.90
A100 SXM 80 GB HBM2e 2.0 TB/s INT8, FP16 312 3.0 (600 GB/s) $1.10-2.54
L40S 48 GB GDDR6 864 GB/s FP8, INT8 362 (FP16) None $0.80-1.50
A10G 24 GB GDDR6 600 GB/s INT8, FP16 70 None $1.00-1.50
RTX 4090 24 GB GDDR6X 1.0 TB/s FP8, INT8 83 (FP32) None ~$0.35/hr

H200 is the current sweet spot for large models — its 141 GB memory fits Llama 70B on a single GPU (previously requiring 2x H100), and 4.8 TB/s bandwidth delivers up to 2x faster inference than H100 on Llama 2 70B. B200 represents a generational leap: native FP4 support, 192 GB HBM3e, and NVLink 5.0 at 1.8 TB/s. The A10G is heavily utilized in cloud environments (like AWS G5) for serving 7B-8B parameter models due to its excellent cost-to-performance ratio. The RTX 4090 remains the consumer king at ~$1,600 MSRP.

Native hardware quantization support heavily influences performance. While algorithmic formats like AWQ and GPTQ (using INT4 weights) can be executed efficiently on any architecture by dequantizing into FP16 registers, true native acceleration via Tensor Cores is generation-dependent. Hopper (H100/H200) and Ada (L40S/4090) natively accelerate FP8, while Blackwell (B200) introduces native FP4 Tensor Cores for massive throughput gains. All listed GPUs support INT8 matrix operations.

Since LLM decode is memory-bandwidth-bound, bandwidth matters more than raw TFLOPS for most serving workloads. This is why H200 outperforms H100 despite identical compute architecture.

36. Model Cascading and Routing

Model Cascading vs Routing

Model routing dynamically selects which LLM handles each query based on predicted complexity. RouteLLM (LMSYS/UC Berkeley, ICLR 2025) uses matrix factorization routers trained on Chatbot Arena preference data to achieve 85% cost reduction on MT-Bench while maintaining 95% of GPT-4 quality. The economic rationale remains compelling today: a frontier model like GPT-4o costs ~$5.00/M tokens (blended) versus a fast open model like Llama 3 8B at ~$0.05/M — maintaining a roughly 100x cost gap that makes even imperfect routing enormously valuable.

Router architectures range from lightweight BERT classifiers (~1-5 ms overhead) to LLM-based judges. Cascading takes a sequential approach: the query runs through a sequence of models, starting with the fastest/cheapest. If a scoring function determines the generation is insufficient, it escalates to the next, more expensive model. Frameworks like FrugalGPT demonstrated that this sequential fallback can reduce inference costs by up to 98% while matching the best individual LLM's performance, or boost accuracy by up to 4% at the same cost. This ensures expensive frontier models are only engaged for the difficult tail of queries that actually require them.


Part VII — Applications

This part covers the application-level patterns that turn raw model capabilities into useful systems. Embeddings power retrieval, RAG grounds responses in external knowledge, agents orchestrate multi-step workflows, and prompt engineering ties it all together. Each pattern builds on the infrastructure from previous parts — your embedding model choice sets the ceiling for RAG quality, and your RAG architecture shapes your capacity plan.

37. Embedding Models vs Generative Models

Embedding models encode text into fixed-dimensional vectors capturing semantic meaning. Unlike generative decoder models that produce token sequences, they output a single dense vector (768–4,096 dimensions) representing the entire input. Architecturally, most embedding models use encoder-only transformers (bidirectional attention) rather than decoders. The encoder processes all input tokens simultaneously, producing a contextualized representation for each token. A pooling layer then collapses these per-token representations into a single vector — typically mean pooling (averaging all token embeddings) or CLS pooling (using a special classification token's output). The model is then fine-tuned with contrastive learning: semantically similar texts are pushed closer together in vector space while dissimilar texts are pushed apart.

Top embedding models (2025-2026 landscape):

Model Dimensions Architecture MTEB Score
Qwen3-Embedding-8B up to 4,096 Decoder-based (Qwen3) 70.6%
Gemini Embedding 2 3,072 Multimodal (text/image/video/audio) 68.2%
pplx-embed-v1-4B 2,560 Decoder-based (Qwen3), native INT8/binary 69.7%
Voyage-3-large 2,048 Proprietary 66.8%
OpenAI text-embedding-3-large 3,072 Proprietary 64.6%

Key trends in the 2025-2026 generation: top models now use decoder-only backbones (e.g., Qwen3, Mistral) with bidirectional attention and pooling layers added on top, challenging the traditional encoder-only paradigm. Perplexity's pplx-embed introduced native quantized embeddings — INT8 (4x storage reduction) and binary (32x reduction) produced directly during inference, eliminating post-hoc quantization loss. Google's Gemini Embedding 2 is the first natively multimodal embedding model, mapping text, images, video, and audio into a single unified vector space.

Matryoshka Representation Learning (MRL, Kusupati et al., NeurIPS 2022) is the key technique that makes embedding dimensions flexible. Named after Russian nesting dolls, MRL structures an embedding so that its first \(m\) dimensions are as informative as an independently trained \(m\)-dimensional model. The core idea: during training, instead of computing a single loss on the full embedding, MRL computes multiple losses in parallel at logarithmically spaced dimensions (e.g., 64, 128, 256, 512, 1024, 2048, 3072). All losses are aggregated and backpropagated together, forcing the model to pack the most critical semantic information into the leading dimensions, with each subsequent group of dimensions adding progressively finer detail — coarse-to-fine, like nested dolls.

After training, you can truncate the embedding to any of these trained dimensions by simply taking a prefix of the vector. The practical impact is striking: OpenAI's text-embedding-3-large truncated to just 256 dimensions outperforms the older text-embedding-ada-002 at its full 1,536 dimensions on MTEB — a 6x reduction in vector size with better quality. This translates directly to 6x less storage, 6x faster similarity search, and 6x lower vector database costs, with no model retraining required.

The embedding model is the single most important component choice in RAG pipelines — it determines retrieval quality, which sets the ceiling for generation quality. A poor embedding model cannot be compensated for by a better LLM.

38. RAG Architecture in Production

RAG Architecture

Retrieval-Augmented Generation grounds LLM responses in external knowledge by retrieving relevant documents at query time and injecting them into the prompt context. This addresses the core limitation of parametric-only models: their knowledge is frozen at training time and they hallucinate when asked about information not in their weights. A production RAG system is a multi-stage pipeline, and each stage significantly impacts final answer quality.

The ingestion pipeline runs offline. Raw documents (PDFs, HTML, Markdown, databases) are first parsed into clean text — a deceptively difficult step, since PDF parsing alone can lose tables, headers, and formatting. The text is then split into chunks, which are independently embedded and indexed. Chunking strategy has an outsized effect on retrieval quality: too small (< 100 tokens) and chunks lose context; too large (> 512 tokens) and they dilute relevance with off-topic content. The dominant approaches are fixed-size with overlap (e.g., 256 tokens with 10-15% overlap — simple and effective), recursive character splitting (splits by paragraph, then sentence, then word — respects natural boundaries), and semantic chunking (groups sentences by embedding similarity — highest quality but slower). Each chunk is then embedded via a model like those in Section 23 and stored in a vector database (Pinecone, Weaviate, Qdrant, pgvector, etc.).

The retrieval pipeline runs at query time. Naive RAG — embed the query, run a single vector search, stuff results into the prompt — often achieves disappointing accuracy in enterprise settings. Production-grade RAG adds multiple stages to close this gap:

  • Hybrid search combines dense vector retrieval (semantic similarity) with sparse retrieval like BM25 (exact keyword matching), merged via Reciprocal Rank Fusion (RRF). Dense search excels at semantic paraphrases ("cost of living" matches "expenses"), while sparse search catches exact terms that embeddings miss (product IDs, error codes, acronyms). Hybrid search delivers 15-30% accuracy improvement over pure vector search (Redis/Azure Benchmarks, 2024).
  • Reranking passes the top-\(k\) retrieved chunks (typically 20-50) through a cross-encoder model that jointly scores query-document relevance, far more accurately than the initial retrieval's bi-encoder cosine similarity. The cross-encoder sees the full query and document together, enabling fine-grained semantic matching. Reranking adds 23.4% improvement over hybrid search alone (Redis/Azure Benchmarks, 2024), at the cost of 50-200 ms additional latency. The reranked top-\(n\) results (typically 3-10) are then injected into the LLM prompt. I covered the full multi-stage search pipeline — from BM25 to cross-encoder reranking to LLM-powered relevance — in Building a Modern Search Ranking Stack.
  • Query transformation rewrites the user's query before retrieval to improve recall. HyDE (Hypothetical Document Embeddings) has the LLM generate a hypothetical answer, which is then embedded and used for retrieval. Multi-query expansion generates multiple phrasings of the same question. Step-back prompting asks a more general question first to retrieve broader context.

The most common failure modes in RAG are: retrieval failure (the correct document exists but isn't retrieved — fix with better chunking, hybrid search, or metadata filtering), context poisoning (irrelevant retrieved chunks mislead the LLM — fix with reranking and stricter top-\(k\) cutoffs), and lost-in-the-middle (the LLM ignores relevant context placed in the middle of a long prompt — Liu et al., 2024 showed models attend most to the beginning and end of context windows).

GraphRAG (Microsoft, 2024) augments vector search with knowledge graph traversal. During indexing, an LLM extracts entities and relationships from documents, building a graph that captures multi-hop connections invisible to flat vector search. At query time, GraphRAG traverses this graph for relationship-heavy queries ("which suppliers are connected to both companies?") where standard RAG fails entirely. The cost is substantially higher indexing compute and storage.

Typical end-to-end latency breakdown: embedding 5-20 ms, vector search 10-100 ms, reranking 50-200 ms, LLM generation 200-2,000 ms (Milvus RAG Optimization Guide, AIMultiple Reranker Benchmarks, 2026). The retrieval stages add only ~100-300 ms total overhead — modest compared to generation time — but their impact on answer quality is enormous: the difference between a hallucinated answer and a grounded one.

39. Agent Architectures and Tool Calling

LLM agents use models as reasoning engines that plan, invoke tools, observe results, and iterate. The architecture choice — how the agent reasons and when it acts — determines cost, latency, and reliability. Three core reasoning patterns dominate:

  • ReAct — Thought → Action → Observation loops. Adapts in real time as each observation updates reasoning, but history accumulates and must be re-processed every step, making it the most expensive pattern.
  • ReWOO — Plans all tool calls upfront in a single LLM pass using placeholders (#E1, #E2), executes them (potentially in parallel), then synthesizes. Up to 5x token savings over ReAct, but "blind" during execution — cannot recover if an early tool fails.
  • Plan-and-Execute — A planner generates a multi-step plan, an executor runs each step with the ability to re-plan on failure. Enables model specialization (strong model plans, cheap model executes). Best success rates on complex tasks.
Pattern Token Cost Adaptability Best For
ReAct High Excellent — pivots in real time Exploratory tasks, debugging, chat
ReWOO Low (~5x savings) Poor — blind during execution Predictable pipelines, dashboards
Plan-and-Execute Medium Good — re-plans on failure Complex analysis, research tasks

Function calling is the mechanism agents use to invoke tools. Native function calling (supported by GPT-4o, Claude, Gemini, Llama 3.1+) outputs structured JSON tool invocations validated against a provided schema, with significantly lower error rates than text-based parsing. The model sees tool definitions as part of its context and learns to emit well-formed calls. Parallel function calling (multiple tools in a single response) reduces round trips for independent operations.

Structured output and constrained decoding guarantee schema conformance by modifying token generation itself. Engines like xgrammar (used in vLLM and SGLang) apply grammar masks at each decoding step, achieving 100% valid JSON with near-zero overhead. Schema-Guided Reasoning (SGR) takes this further: because LLMs generate fields sequentially, placing analysis fields before decision fields in the schema forces the model to reason before deciding. Three SGR patterns — Cascade (sequential steps), Routing (Union types as semantic switches), and Cycle (bounded lists) — cover most production use cases.

Production challenges: tool selection degrades beyond ~15-20 available tools, multi-step agents take 5-30+ seconds, and complex workflows cost 10-50x more tokens than single-prompt solutions. The 2024-2025 trend is hybrid approaches — ReAct reasoning with native function calling, orchestrated by frameworks like LangGraph.

40. Prompt Engineering for Production

Production prompt engineering is less about clever tricks and more about systematic reliability. The techniques below are ordered by impact — the first two alone solve most production issues.

Few-shot examples are the single most reliable way to control output format. Providing 3-5 diverse examples that cover edge cases (empty inputs, ambiguous queries, multi-part answers) dramatically improves format compliance and reduces the need for post-processing. The key is diversity — examples should span the distribution of real inputs, not just show the happy path. Diminishing returns set in beyond 5 examples, and each example consumes context tokens, so quality matters more than quantity.

Chain-of-thought (CoT) prompting asks the model to reason step by step before answering. Even the simple suffix "Let's think step by step" significantly boosts accuracy on math, logic, and multi-step reasoning tasks (Kojima et al., 2022). For production, few-shot CoT — providing examples that include the reasoning steps, not just the final answer — is more reliable than zero-shot. Self-consistency (Wang et al., 2023) takes this further: generate multiple CoT paths (temperature 0.5–0.7) and take the majority vote, reducing random errors at the cost of additional inference calls.

Structured output with explicit JSON schemas (Section 39) eliminates an entire class of parsing failures. When combined with constrained decoding engines like xgrammar, you get 100% valid output with near-zero overhead — no regex parsing, no retry loops.

Prompt chaining breaks complex tasks into a pipeline of focused steps — classify intent → retrieve context → generate response → validate output. Each step gets a simple, testable prompt with a single objective. This consistently outperforms "mega-prompts" that try to handle everything in one call, because each stage can use a different model (cheap classifier → expensive generator), failures are localized and debuggable, and intermediate outputs can be cached and reused. The tradeoff: chained prompts add latency from sequential LLM calls, so reserve chaining for quality-critical flows.

Temperature controls randomness in token sampling. Use 0.0–0.2 for deterministic tasks (classification, extraction, code generation) where consistency matters. Use 0.7–1.0 for creative tasks (writing, brainstorming) where diversity is desired. Avoid temperatures above 1.0 in production — outputs become incoherent. Note that temperature interacts with other sampling parameters (top_p, top_k) in non-obvious ways; in most production systems, setting temperature and leaving top_p at 1.0 is sufficient.

System vs user message separation matters more than most teams realize. System messages establish persistent behavior ("You are a medical coding assistant. Always cite ICD-10 codes.") while user messages carry per-request content. Models attend to system messages differently — they set the behavioral frame rather than contributing to the conversation content. Use hard constraints ("NEVER include patient names in output") in the system message, because models are more consistent at avoiding specific patterns than following general positive instructions.

The overarching shift in 2024-2025 is from "prompt engineering" to context engineering — the focus has moved from crafting clever instructions to dynamically assembling the right information (retrieved documents, tool results, conversation history, few-shot examples) in the right format at the right position within the context window. Given that models attend most strongly to the beginning and end of their context, placing critical instructions and the most relevant retrieved content at these positions yields measurable quality improvements. I explored this shift in depth in Context Engineering in the Agentic-AI Era.


Part VIII — Production Operations

The operational concerns in this part determine whether your system survives contact with real traffic. Rate limiting, failure modes, monitoring, cost optimization, and capacity planning are deeply interconnected — your monitoring signals drive your autoscaling decisions, your cost optimization strategy constrains your capacity plan, and understanding failure modes is prerequisite to building reliable alerting.

41. Rate Limiting for Variable-Cost Requests

Traditional requests-per-second (RPS) rate limiting assumes roughly equal cost per request. LLMs shatter that assumption. A 10-token classification prompt and a 100K-token document analysis hit the same API endpoint but differ in cost by four orders of magnitude. Rate-limiting by RPS either lets expensive requests through unchecked or starves cheap requests unnecessarily.

Production systems need token-based rate limiting across multiple dimensions. OpenAI enforces RPM (requests per minute) and TPM (tokens per minute) with tiered limits — GPT-5 Tier 1 starts at 500K TPM, scaling to 4M TPM at Tier 4. Anthropic takes a more granular approach with separate ITPM (input tokens per minute) and OTPM (output tokens per minute) limits, reflecting the reality that output tokens cost 3-5x more to generate than input tokens cost to process. Both use the token bucket algorithm, continuously replenishing capacity rather than resetting at fixed intervals.

The practical implementation pattern is a multi-dimensional limit hierarchy: user → application → organization → global, with priority tiers for premium access. At the request level, the key technique is token budget reservation: estimate total tokens (input + max_tokens) at admission time, deduct from the bucket, then adjust when the request completes with actual usage. This prevents a burst of long-generation requests from exhausting capacity before they even start producing output.

For self-hosted deployments, the equivalent is provisioned throughput — reserving dedicated GPU capacity for guaranteed token rates. Anthropic and Amazon Bedrock offer this as a product; for vLLM deployments, it means configuring admission control based on active decode slots and KV cache pressure rather than simple request counts. The core insight connects back to Section 5: throughput is measured in tokens, not requests, so rate limiting must be too.

42. Failure Modes to Design Against

LLM serving fails in ways that traditional web services do not. Understanding these failure modes — and designing defenses before they hit production — separates reliable systems from impressive demos.

Out-of-Memory (OOM) is the most common failure. A 70B FP16 model needs ~140 GB for weights alone, and KV cache for a single 128K-context sequence adds ~40 GB (Section 6). The gap between "fits in memory" and "OOM under load" is smaller than it looks: a batch of 8 long-context requests can require more memory for KV cache than the model weights themselves. Prevention starts with vLLM's gpu_memory_utilization parameter (default 0.9, conservative deployments use 0.85-0.90), combined with quantization and PagedAttention. For workloads with high KV cache pressure, LMCache enables offloading KV cache to CPU memory or disk, achieving 3-10x latency reduction by expanding effective cache capacity beyond GPU memory limits.

Preemption occurs when KV cache space runs out and the scheduler must evict active requests to make room for new ones. In vLLM, preempted requests are recomputed from scratch rather than swapped to CPU (recomputation is faster for most practical sequence lengths). From the user's perspective, this means their in-flight request silently restarts — TTFT appears normal, but end-to-end latency doubles or triples. Monitor preemption counts closely; rising preemption rates signal that you need more memory, shorter sequences, or more replicas.

Tail latency spikes when large prefills monopolize the GPU. A single 100K-token prefill can stall decode steps for all other requests in the batch. Chunked prefill (Section 7) is the primary defense, achieving 54% P99 TBT reduction. Beyond chunking, scheduling strategy matters: the Learning-to-Rank scheduler (NeurIPS 2024) predicts relative output lengths and approximates shortest-job-first ordering, delivering 6.9x mean latency reduction over FCFS with less than 2% overhead. Length-aware scheduling systems like CascadeInfer take this further by partitioning instances into length-specialized groups, reducing tail TTFT by 56-62%.

Cascading failures follow a predictable pattern: a slow request (large prefill or long generation) occupies a decode slot → queue depth grows → upstream clients hit timeouts → retries amplify load 2-3x → the system spirals into full degradation. The prevention hierarchy: admission control (reject requests when queue depth exceeds threshold) → per-request concurrency limits and max_tokens caps → circuit breakers at the gateway → disaggregated prefill/decode pools (Section 7) so that prefill-heavy requests cannot starve decoders.

43. Monitoring LLM Systems

LLM monitoring differs fundamentally from traditional API monitoring. Every request has variable cost, two distinct phases with different bottlenecks, and a memory footprint that depends on both input length and generation length. Standard metrics like request latency and error rate miss most of what matters.

The most important metric you are probably not tracking is goodput — the number of requests per second that meet all SLO thresholds (TTFT, TPOT, and total latency). The concept originated from Google Cloud's ML productivity measurement and has become the truest single measure of system health. Raw throughput can look excellent while goodput collapses: a system processing 100 requests/second but violating TTFT SLOs on 40% of them has a goodput of 60 — and 40% of your users are having a bad experience. Optimizing for goodput forces you to care about the distribution of performance, not just the mean.

vLLM exposes a comprehensive Prometheus endpoint at /metrics with metrics covering the full serving lifecycle: vllm:num_requests_running and vllm:num_requests_waiting (queue pressure), vllm:kv_cache_usage_perc (the fraction of KV cache blocks in use — above 90% predicts imminent preemption), generation length histograms, and prefix cache hit rates. The practical setup is straightforward: Prometheus scrapes the /metrics endpoint (typically every 1-5 seconds), Grafana visualizes dashboards, and OpenTelemetry with Jaeger handles distributed tracing across multi-service LLM pipelines.

The essential alert list, with context for why each threshold matters:

  • Preemption count spikes — requests are being evicted and restarted; users experience silent latency doubling
  • KV cache utilization >90% — you are one batch of long-context requests away from preemption cascades
  • Queue depth exceeding 2-3x your batch size — admission control should start rejecting or deprioritizing
  • TTFT rising while TPOT stays flat — this specific divergence pattern means the problem is queuing or network, not GPU compute. The prefill and decode phases have independent bottlenecks (Section 4), so when only one metric moves, you immediately know which subsystem to investigate. This single diagnostic saves hours of debugging.

44. Cost Optimization: A Compounding Strategy

Current API pricing (as of March 2026) spans three orders of magnitude: GPT-4o at \(2.50/\)10.00 per million input/output tokens versus Gemini Flash-Lite at \(0.075/\)0.30. The trajectory is remarkable — GPT-4-equivalent performance costs $0.40/M tokens today versus $20 in late 2022, costs declining roughly 10x annually. But even on this deflationary curve, the difference between naive and optimized deployment is 10-20x at any given price point.

Notice the asymmetry in every provider's pricing: output tokens cost 3-10x more than input tokens. GPT-4o charges $10/M output versus $2.50/M input (4x). Claude Sonnet 4 charges $15/M output versus $3/M input (5x). This asymmetry exists because output tokens require sequential decode steps (Section 7) while input tokens are processed in parallel. The practical implication: optimizing output length — through structured outputs, constrained generation, and concise system prompts — is higher leverage than optimizing input length.

The winning strategy stacks multiple approaches, and the effects compound multiplicatively:

  1. Quantization (FP16 → INT4) cuts serving memory 75%, enabling larger batches and 60-70% operational cost reduction (Section 9)
  2. Model routing directs 70% of traffic to cheaper models — real case: monthly bill from $42K to $29K (Maxim AI Case Study, 2024) — by matching request complexity to model capability (Section 36)
  3. Prompt caching saves up to 90% for workloads with shared prefixes — Anthropic charges 0.1x base price for cache reads, and cached tokens do not even count against ITPM rate limits (Section 16)
  4. Batch APIs offer 50% discounts for non-real-time work (evaluations, synthetic data generation, bulk classification)
  5. Self-hosting breaks even at roughly >2M tokens/day, but the breakeven analysis is treacherous — a $5K/month GPU budget easily becomes $25K after engineering time, infrastructure overhead, on-call burden, and sub-optimal utilization compared to providers who amortize across thousands of customers

Stack these conservatively: quantization (0.3x) × routing (0.7x) × caching (0.5x) × batching (0.5x) = ~0.05x, a 95% reduction from naive baseline. Not every workload qualifies for every optimization, but even partial stacking yields 5-10x savings.

The hidden cost that undermines the self-hosting math: ML engineering represents 70-80% of total deployment costs, not compute (TrueFoundry / MLOps Reports, 2024). The GPU bill is the visible part of the iceberg.

45. Capacity Planning and Autoscaling

Capacity planning for LLM serving differs from traditional web services in three fundamental ways: request costs vary by orders of magnitude (a 100-token request and a 100K-token request hit the same endpoint), decode is a long-running sequential process (not a fast request-response cycle), and the binding constraint is memory, not compute. Getting this wrong means either paying for idle GPUs or dropping requests under load.

The hard ceiling on concurrent requests is the KV cache memory budget, not FLOPS:

\[\text{max concurrent sequences} = \frac{\text{GPU memory} - \text{model weights} - \text{overhead}}{\text{per-sequence KV cache size}}\]

For a Llama 3 70B model in INT4 (~35 GB) on an 80 GB H100, roughly 40 GB remains for KV cache. With GQA and 4K average context, each sequence needs ~160 MB of KV cache, yielding ~250 concurrent sequences. At 128K context, that drops to ~5 sequences per GPU. This is why GPU selection and KV cache optimization directly determine your capacity plan.

The capacity formula for fleet sizing:

\[\text{required GPUs} = \frac{\text{peak tokens/min} \times \text{safety margin (1.3-1.5x)}}{\text{per-GPU throughput at target SLO}}\]

The critical detail is "at target SLO" — a single H100 can push 2,000+ tokens/second if you do not care about latency, but only 400-800 tokens/second while maintaining P99 TTFT under 500 ms. Always benchmark at your actual SLO targets, not theoretical maximums.

Autoscaling signals require rethinking conventional wisdom. GPU utilization is nearly useless as a scaling signal because GPU utilization stays high even when the system is overloaded — the GPU is always busy, whether it is processing requests healthily or thrashing on preemptions. Better signals: queue depth (directly measures demand exceeding capacity), KV cache utilization (predicts preemption before it happens — scale up at 80%, not 95%), and goodput degradation (SLO violations are the ground truth that capacity is insufficient). These are exactly the metrics from Section 43.

For development and staging environments, scale-to-zero is essential economics. GPU instances cost $2-4/hour; a staging environment running 24/7 costs $1,500-3,000/month for a single GPU. Serverless inference platforms and Kubernetes-based autoscalers (like KEDA with custom metrics) can spin instances down to zero during idle periods and cold-start in 30-60 seconds — acceptable for non-production use. The tradeoff: model loading time (30-60 seconds for a 70B model) makes scale-to-zero impractical for production latency targets, but it cuts non-production GPU spend by 80-90%.


The Interconnected System

These 45 concepts form a deeply interconnected system rather than a loose collection. KV cache size dictates batch size, which determines arithmetic intensity, which controls whether decode is memory-bound, which drives TPOT, which defines throughput. GQA shrinks KV cache -> enables larger batches -> raises arithmetic intensity -> better GPU utilization. FlashAttention exploits the SRAM-HBM bandwidth gap. Continuous batching solves compute utilization but creates memory fragmentation that PagedAttention resolves. Chunked prefill co-schedules compute-bound and memory-bound work — the roofline model explains why this complementarity works.

Inference Optimization Stack

On the training side, the Chinchilla trap pushed the field from "train big" to "train small and long" (Llama 3 8B at 1,875 tokens per parameter). GRPO eliminated PPO's four-model memory burden, enabling DeepSeek-R1's reasoning breakthroughs. Distillation from R1's chain-of-thought proved more effective than applying RL directly to smaller models — a finding that reshaped how the field builds compact reasoning systems.

The most important meta-pattern: inference costs are falling 10x per year while capabilities rise. Organizations that treat inference optimization as a strategic capability — combining routing, caching, quantization, and right-sized hardware — gain compounding advantages in a market where the cost floor keeps dropping.

Key Principles

  1. LLM inference has two distinct phases — prefill is compute-bound, decode is memory-bandwidth-bound. Every optimization targets one or both.
  2. The KV cache is the central bottleneck — its size determines batch capacity, memory pressure, and ultimately throughput. GQA, PagedAttention, and quantization all attack it.
  3. The GPU memory hierarchy drives everything — the 10x bandwidth gap between SRAM and HBM explains FlashAttention, kernel fusion, and why decode is memory-bound.
  4. Continuous batching + PagedAttention together deliver 23x throughput over naive serving — they are non-negotiable for production.
  5. Chinchilla-optimal is not inference-optimal — the field has moved to massive overtraining of smaller models (1,875 tokens/parameter for Llama 3 8B).
  6. GRPO and distillation reshaped alignment — DeepSeek-R1 showed that GRPO + verifiable rewards + distillation beats applying RL directly to smaller models.
  7. Cost optimization compounds — stacking quantization, routing, caching, and batch APIs achieves 5-10x cost reduction versus naive deployment.
  8. Bandwidth matters more than TFLOPS for serving — H200 outperforms H100 with identical compute, purely from memory bandwidth.

Further Reading

Related deep-dives from this blog, organized by topic:


References

Organized by topic area; section numbers in brackets where applicable.

Inference and Attention

Speculative Decoding

Quantization

Training and Fine-Tuning

Alignment

Scaling and Architecture

Embeddings

Agent Architectures

Routing

Benchmarks

Serving Architectures

Serving Frameworks

  • vLLM - PagedAttention-based serving engine
  • SGLang - RadixAttention and structured generation
  • TensorRT-LLM - NVIDIA optimized inference
  • llama.cpp - Portable C/C++ inference
  • DeepSpeed - Microsoft distributed training library
  • Ollama - Local LLM runner

Operations