2026-03-13 · Updated 2026-07-15

LLM Engineering Guide: 45 Concepts for Inference, Training, Architecture, and Operations

Production LLM systems pull from GPU hardware, systems engineering, and ML theory all at the same time. The same handful of concepts shows up whether you are tuning TTFT for a chatbot or configuring DeepSpeed ZeRO for a fine-tuning run. This guide collects them in one place.

TL;DR: This reference covers 45 concepts in eight parts, from hardware and inference through training, deployment, and operations. Each entry defines the concept, explains its practical effect, gives reported numbers where they clarify scale, and links to related sections. The cited data runs from 2024 through early 2026.

This guide assumes familiarity with basic ML (backpropagation, gradient descent, softmax) and some systems knowledge (memory hierarchies, networking basics).

[!NOTE] A note on scope

This is a long reference, not a linear tutorial. Use the table below to jump to the part that matches your current decision.

Part	Topics	Sections
I — Hardware foundations	Roofline model, GPU memory, hardware glossary	1–3
II — Inference fundamentals	Latency, throughput, KV cache, attention, quantization	4–9
III — Inference optimizations	CUDA kernels, FlashAttention, batching, PagedAttention, speculative decoding	10–17
IV — Model architecture	Transformer internals, decoder-only, MoE, tokenization, context windows	18–22
V — Training and alignment	Pretraining, LoRA, mixed precision, ZeRO, scaling laws, RLHF/DPO/GRPO, distillation	23–32
VI — Scaling and deployment	Parallelism, serving frameworks, GPU selection, routing	33–36
VII — Applications	Embeddings, RAG, agents, prompt engineering	37–40
VIII — Production operations	Rate limiting, failure modes, monitoring, cost, capacity planning	41–45

How to use this guide as a hub

This page is deliberately broad. Use it as the map, then jump to the deeper posts when the decision becomes concrete.

If you are deciding…	Start with	Then read
How to serve a model	Inference fundamentals and deployment	LoRAX Serving Guide
Whether to fine-tune	Training and alignment	LLM Fine-Tuning Guide
How retrieval fits into an app	Embeddings and RAG	RAG Evaluation Metrics
How agent systems work	Agents and prompt engineering	AI Agent Reasoning Loops
How to rank search results	Embeddings and reranking	Search Ranking Stack

The highest-return path is usually: understand the bottleneck, pick the smallest stack that exposes it, benchmark with the real workload, then add complexity only where the numbers justify it.

Part I — Hardware foundations

The concepts here — arithmetic intensity, the GPU memory hierarchy, and the hardware terms — show up everywhere else in this guide.

1. Memory-bound vs compute-bound and the roofline model

The starting point for LLM performance is arithmetic intensity: for every byte of data the GPU loads from memory, how many useful calculations does it perform? That ratio decides whether an operation is compute-bound (waiting on the processor) or memory-bound (waiting on data to load).

Every GPU has a “critical intensity” threshold where its compute capability exactly balances its memory bandwidth. For an NVIDIA H100 (Datasheet, 2023):

$\frac{989 \text{ TFLOPS}}{3.35 \text{ TB/s}} \approx 295 \text{ FLOPs/byte}$

Roofline Model

The two phases of LLM inference sit on opposite sides of this threshold:

Decode is memory-bound. Generating tokens one by one means loading the multi-gigabyte weight matrix from memory to multiply it against a single new token. In a simplified 16-bit, batch-one analysis, the operation has about 1 FLOP/byte, almost 300x below the H100 roofline threshold. That gap explains why decode cannot approach peak compute throughput in this regime.
Prefill is compute-bound. Processing the input prompt loads weights once but multiplies them against hundreds or thousands of tokens at the same time. The intensity climbs well above 295 and saturates the compute units.

So to speed up decode, work on memory bandwidth: shrink the weights with quantization, reduce KV memory overhead with GQA and PagedAttention, and raise intensity with batching. To speed up prefill, work on raw computation: faster GPUs, FP8 compute.

2. GPU memory hierarchy

A GPU has four memory layers, arranged like a pyramid: a large but slow main memory (HBM) at the bottom, tiny but very fast registers at the top. Moving data up and down this pyramid is the main traffic jam. The hardest bottleneck sits between HBM and SRAM, where SRAM is roughly 10x faster.

GPU Memory Hierarchy

From fastest to slowest on an H100:

Registers — the fastest memory, attached directly to the processing threads. This is where the math actually runs; data has to be loaded here for the Tensor Cores to use it.
SRAM (Shared Memory) — the working memory at roughly 33 TB/s.
L2 Cache — a middle layer (50 MB) at around 12 TB/s. It acts as a buffer so that when multiple SMs need the same weights, they do not all have to fetch from HBM.
HBM3 — the 80 GB main memory holding model weights and KV cache, at ~3.35 TB/s.

FlashAttention, kernel fusion, and PagedAttention all reduce traffic between these layers. They keep data in fast SRAM longer and avoid repeated transfers to HBM.

3. GPU hardware glossary

The terms below show up throughout the rest of the guide.

HBM (High Bandwidth Memory) — Stacked DRAM dies connected via through-silicon vias (TSVs), mounted on-package next to the GPU die. Generations: HBM2e (A100, 2 TB/s), HBM3 (H100, 3.35 TB/s), HBM3e (H200/B200, 4.8–8 TB/s). Why it matters for LLMs: decode is memory-bandwidth-bound, so HBM bandwidth directly determines TPOT.

GDDR (Graphics DDR) — Traditional graphics memory (GDDR6, GDDR6X) used in consumer GPUs (RTX 4090, L40S). Lower bandwidth than HBM but cheaper per GB. GDDR6X on RTX 4090 delivers ~1 TB/s versus H100’s 3.35 TB/s HBM3.

SM (Streaming Multiprocessor) — The basic compute building block of NVIDIA GPUs. Each SM contains CUDA cores, Tensor Cores, shared memory (SRAM), and a warp scheduler. H100 has 132 SMs; A100 has 108.

Tensor Cores — Specialized matrix-multiply-accumulate units inside each SM. They accelerate mixed-precision matmuls (FP16, BF16, FP8, INT8) that dominate transformer computation. H100 Tensor Cores deliver 989 TFLOPS in TF32 versus ~67 TFLOPS from CUDA cores alone.

CUDA Cores — General-purpose floating-point and integer units. They handle element-wise operations, activation functions, and non-matmul work. Tensor Cores handle the heavy lifting for LLMs; CUDA cores handle everything else.

Warp — A group of 32 threads that execute in lockstep on an SM. The smallest scheduling unit on NVIDIA GPUs. Warp specialization assigns different warps to different tasks (data loading vs computation) for pipelining.

NVLink — High-speed GPU-to-GPU interconnect within a node. NVLink 4.0 (H100) delivers 900 GB/s bidirectional; NVLink 5.0 (B200) reaches 1.8 TB/s. Essential for tensor parallelism where GPUs must exchange activations every layer.

InfiniBand — High-speed network fabric for inter-node GPU communication. NVIDIA ConnectX-7 delivers 400 Gb/s per port. Used for pipeline parallelism and distributed training across nodes.

RDMA (Remote Direct Memory Access) — Allows one GPU to read/write another machine’s memory without involving the CPU, minimizing latency. GPUDirect RDMA enables direct GPU-to-GPU transfers across nodes. Used in disaggregated serving for KV cache transfers.

NVMe (Non-Volatile Memory Express) — High-speed SSD interface used for KV cache offloading and ZeRO-Infinity parameter offloading when GPU/CPU memory is insufficient. Sequential read speeds of 5–7 GB/s per drive (PCIe Gen 4), with newer Gen 5 drives reaching 10–14 GB/s.

TFLOPS / PFLOPS — Tera/Peta floating-point operations per second. 1 TFLOPS = 10¹² FLOPS. The standard unit for measuring GPU compute throughput. H100 delivers 989 TFLOPS (TF32); FlashAttention-3 achieves ~1.2 PFLOPS in FP8.

Part II — Inference fundamentals

Inference is the part of the system that users actually feel. Latency, the KV cache, the two-phase execution model, attention, and quantization between them set how fast, how cheap, and how reliably you can serve.

4. Latency: TTFT, TPOT, and percentiles

Time to First Token (TTFT) is the delay from request submission to the first output token. It is set by the prefill phase: the model has to process the whole prompt before generating anything, so longer prompts usually raise TTFT. Targets are product-specific; MLPerf Inference v5.0 uses a P99 TTFT limit of $\leq$ 450 ms for its Llama 2 70B interactive scenario.

Time Per Output Token (TPOT) is the average interval between consecutive tokens after the first. It maps to the decode phase, where each step is memory-bandwidth-bound:

$\text{TPOT} = \frac{\text{E2E Latency} - \text{TTFT}}{\text{Output Tokens} - 1}$

Average adult English silent reading speed is about 238 words per minute for non-fiction (Brysbaert, 2019). Streaming targets should still come from product testing; MLPerf uses a P99 TPOT limit of $\leq$ 40 ms for its interactive scenario.

P50 vs P99 latency matters because the median hides the tail. A system with a good P50 and a bad P99 may have batching, preemption, queueing, or workload-skew problems; traces are needed to distinguish them.

5. Throughput: tokens per second and the latency tradeoff

Throughput is measured in output tokens per second across concurrent requests. Requests per second is weaker on its own because a 10-token response and a 1,000-token response have very different costs. Published benchmark numbers vary with model, precision, hardware, prompt and output lengths, concurrency, and SLO. Compare vLLM, SGLang, and TensorRT-LLM with one harness rather than combining their headline results.

The tradeoff: at low concurrency, each request gets great latency but the GPU is underutilized. Increasing batch size raises throughput almost linearly until compute saturates, after which latency climbs sharply. Goodput, the fraction of requests that meet your SLO targets, is the metric that connects raw throughput to actual user satisfaction.

6. KV cache: the bottleneck behind most other bottlenecks

During autoregressive generation, each new token attends to every previous token. The KV cache stores the Key and Value projections from every token at every layer to avoid $O(n^2)$ recomputation. Without it, generating token $n$ would require re-running the model over all $n-1$ previous tokens.

The KV cache is usually the dominant memory pressure because it grows linearly with sequence length, batch size, and layer count:

$KVcache = 2 \times L \times h_{kv} \times d_h \times s \times B \times \text{bytes}$

where:

$L$ = number of layers
$h_{kv}$ = number of KV heads
$d_h$ = head dimension
$s$ = sequence length
$B$ = batch size

Concrete examples with FP16 and batch size 1: Llama 3 8B at 8,192 tokens uses ~1.0 GB of KV cache; at 128K tokens, 16 GB. Llama 3 70B at 128K tokens needs ~40 GB for a single sequence, half of an H100’s VRAM. At production batch sizes, KV cache easily exceeds model weight memory. Naive implementations waste 60–80% of the allocated KV memory to fragmentation, which is the problem PagedAttention was built to solve.

The main optimizations are GQA (fewer KV heads), KV cache quantization (FP8/INT8), PagedAttention (block-based allocation with <4% waste), and offloading KV cache to CPU or NVMe.

7. Prefill vs decode: two phases, two bottlenecks

The prefill phase processes the input prompt in parallel and populates the KV cache. Its large matrix multiplications make it compute-bound, so prefill largely determines time to first token (TTFT). The decode phase generates one token at a time. Each step reads model weights and the KV cache from HBM, making decode memory-bandwidth-bound and the main driver of time per output token (TPOT).

Prefill vs Decode Phases

Chunked prefill splits the prompt into fixed-size chunks (for example, 512 tokens) instead of processing it all at once. A long prefill no longer blocks ongoing decode requests, compute-bound and memory-bound work get co-scheduled on the same GPU, and vLLM benchmarks show +50% throughput (Agrawal et al., 2024). The cost is slightly higher TTFT for the new request.

Disaggregated serving places prefill and decode on separate GPU pools, allowing each pool to target a different bottleneck. Splitwise and DistServe describe the pattern. The pools transfer KV-cache data over a fast interconnect such as RDMA, so communication cost becomes part of the design.

8. GQA and MQA: shrinking the KV cache

Attention Variants

Standard Multi-Head Attention (MHA) gives every query head its own K and V head. Multi-Query Attention (MQA) shares a single KV head across all query heads, which is an extreme reduction. Grouped-Query Attention (GQA) is the practical middle ground: groups of query heads share one KV head.

Llama 3 70B uses 64 query heads but only 8 KV heads, an 8x KV cache reduction versus the same architecture with one KV head per query head. Llama 3.1 405B uses 128 query heads and 8 KV heads, a 16x reduction by the same calculation (Meta, 2024). Ainslie et al. report GQA quality close to MHA in their tested models while approaching MQA speed. A smaller KV cache can support larger batches, but the realized latency and throughput gain still depends on the kernel and workload.

9. Quantization: trading bits for speed and memory

Quantization reduces the precision of model weights and/or activations. The core tradeoffs:

Format	Bits	Weight memory (7B model)	Quality note
FP16/BF16	16	~14 GB	Baseline for comparison
FP8	8	~7 GB	Hardware-native on Hopper; evaluate model
INT8	8	~7 GB	Calibration and kernel dependent
INT4	4	~3.5 GB	Largest compression; evaluate carefully

AWQ (Activation-Aware Weight Quantization) finds the <1% of salient weights by looking at activation magnitudes and applies per-channel scaling to protect them. It needs only 128–1,024 calibration tokens and won the MLSys 2024 Best Paper Award. GPTQ uses second-order Hessian information for layer-wise quantization and needs more calibration data. bitsandbytes (Tim Dettmers’ library) quantizes during model loading with no separate preprocessing step; its NF4 format powers QLoRA fine-tuning. FP8 on Hopper-class hardware halves weight memory relative to FP16/BF16, but quality and speed still depend on the model, calibration, and kernel.

The serving kernel can matter as much as the quantization algorithm. In the comparison discussed in Section 10, the same quantized weights differ by 2.6x in throughput across kernels.

Part III — Inference optimizations

This part is about the software techniques that turn a working inference system into a fast one. Each one targets a specific bottleneck: FlashAttention exploits the SRAM-HBM gap, PagedAttention removes KV cache fragmentation, continuous batching keeps the GPU busy.

10. CUDA kernels and kernel fusion

A CUDA kernel is a function written for the GPU that runs in parallel across thousands of threads. When the CPU calls a kernel, the GPU distributes the work across its SMs: each SM runs multiple warps of 32 threads, and each thread processes a slice of the data. Every operation in LLM inference, from matrix multiplication to token sampling, is ultimately a kernel launch. A single forward pass through a 70B model triggers hundreds to thousands of kernel launches, and the gap between a naive kernel and an optimized one can decide whether your system meets its latency SLO.

The main kernel categories in LLM serving:

GEMM kernels for matrix multiplication, which dominate both prefill and decode compute.
Attention kernels like FlashAttention that tile computations to stay in SRAM instead of spilling to HBM.
Fused kernels that combine multiple operations (such as add + layer norm or QKV projection) into a single launch to skip the intermediate HBM round-trips.
Sampling kernels that convert logits to token IDs via top-k, top-p, or temperature sampling.

Kernel quality often matters more than the quantization algorithm. The same INT4-quantized weights served through Marlin (an optimized FP16xINT4 kernel) hit 712 tokens/s versus vanilla GPTQ’s 276 tokens/s, a 2.6x throughput difference purely from better GPU utilization. Marlin gets there through asynchronous memory fetching and shared memory queues that keep Tensor Cores fed instead of waiting on HBM. Triton lowers the barrier to writing custom kernels by exposing GPU programming through Python rather than raw CUDA C++, which makes kernel-level optimization accessible to ML engineers rather than only GPU specialists. Most of the optimizations later in this part (FlashAttention, fused kernels, PagedAttention) are at heart either better kernels or smarter ways to orchestrate kernel launches.

Kernel fusion combines sequential operations into one GPU kernel and skips intermediate HBM writes. Common fusions include QKV projection, attention plus softmax, add plus RMSNorm (FlashNorm), and SwiGLU activation (DeepFusionKernel). Triton makes these kernels accessible through Python. The exact launch-count and utilization gains depend on the model graph, compiler, GPU, and serving framework, so profile the deployed stack rather than relying on a universal percentage.

11. FlashAttention: tiling attention to live in SRAM

Standard attention materializes the full $N \times N$ attention matrix in HBM, which costs $O(N^2)$ memory and produces a lot of memory traffic. The idea behind FlashAttention is to never materialize this matrix at all. It tiles the Q, K, V matrices into blocks that fit in SRAM, computes partial attention within each tile, and merges results using an online softmax (incrementally tracking the running max and sum across blocks). Memory drops from $O(N^2)$ to $O(N)$ , and HBM reads drop by an order of magnitude.

Each version targets the bottleneck of its GPU generation:

FlashAttention v1 (A100, 2022) proved the tiling plus online-softmax idea works. A 2–4x speedup over standard attention, but only 25–40% GPU utilization because kernel scheduling left many SMs idle.
FlashAttention v2 (A100, 2023) reworked the parallelism to split across the sequence dimension rather than batch and heads. It reached 50–73% utilization on A100, roughly 2x faster than v1.
FlashAttention v3 (H100 Hopper, 2024) added warp specialization (separate warps for data movement vs. math) and GEMM-softmax pipelining to overlap memory loads with computation. 75–85% utilization on H100 and up to ~1.2 PFLOPS in FP8. NeurIPS 2024 spotlight.
FlashAttention v4 (B200 Blackwell, 2026) addresses a new bottleneck: on Blackwell, tensor core throughput scales so fast that non-matmul operations (softmax exponentials, rescaling) become the limiter. FA4 software-emulates the exponential with polynomial approximations on FMA units, uses conditional rescaling to reduce overhead, and stores intermediates in Blackwell’s dedicated tensor memory (TMEM) instead of registers. The result: 1,605 TFLOPS/s on B200 in BF16, 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton.

Each generation hit a different hardware wall, and each FlashAttention version was redesigned from the kernel up to address it.

12. FlashDecoding: parallelizing the decode bottleneck

Standard FlashAttention keeps the GPU busy by splitting work across batch size and query length. During decode the model generates exactly 1 token at a time (query length = 1). If the batch size times the number of attention heads is less than the GPU’s total SM count (108 on an A100), most of the GPU sits idle while a few units grind sequentially through the token history.

FlashDecoding solves this by adding a new parallelization dimension: the KV sequence length itself. It chops the KV cache into smaller chunks and distributes them across all the otherwise-idle GPU processors to evaluate in parallel, then merges their partial computations with a log-sum-exp reduction.

The result is up to an 8x end-to-end decode speedup on long sequences (64K context) and roughly constant decode time per token. Generating token 60,000 stays almost as fast as generating token 100.

13. Continuous batching vs static batching

Static batching waits for every sequence in a batch to finish before starting the next, so short sequences waste GPU cycles idling after they hit end-of-sequence. Continuous batching (introduced by the Orca paper, OSDI 2022) operates at iteration-level granularity: at each decode step, completed sequences are removed and new ones inserted.

In Anyscale’s OPT-13B benchmark, optimized static batching reached 4x its naive baseline, continuous batching reached 8x, and vLLM with continuous batching plus PagedAttention reached 23x (Anyscale, 2023). Continuous batching also increases pressure on KV allocation, which is why it is commonly paired with paged memory management.

14. PagedAttention: virtual memory for KV cache

vLLM’s PagedAttention applies the OS virtual-memory idea to KV cache management. The KV cache is split into fixed-size blocks (typically 16 tokens), blocks are allocated on demand as tokens are generated, and logical (sequential) positions map to physical (scattered) memory locations through block tables. Multiple requests that share a prefix (system prompts, beam search) can point to the same physical blocks.

Earlier systems wasted 60–80% of KV cache memory to fragmentation and pre-allocation. PagedAttention drops that to <4%, which lets throughput rise 2–4x at the same latency and up to 24x over HuggingFace Transformers (vLLM Blog, 2023).

15. Speculative decoding: multiple tokens per forward pass

Speculative Decoding

A small draft model generates $K$ candidate tokens, then the large target model verifies all $K$ tokens in a single forward pass. Correct tokens are accepted; the first incorrect one is rejected. Output quality is mathematically identical to the target model alone, so this is a lossless speedup.

It works because LLM decode is memory-bandwidth-bound: verifying $K$ tokens costs about the same as generating 1, since both load all model weights once. Typical speedups land between 1.5x and 3x, with methods like EAGLE-3 reaching up to 6.5x. Variants include Medusa (extra prediction heads, no separate model), prompt lookup decoding (n-gram matching against the input, free), and EAGLE (feature-level extrapolation).

At high batch sizes, extra draft and verification work can erase the gain; one cited evaluation reports a 1.4–1.8x slowdown in that regime. Speculative decoding is most promising when the serving batch is small enough and draft acceptance is high.

16. Prefix caching and KV cache reuse

Instead of throwing away the KV cache when a request finishes, prefix caching keeps it around for reuse on new requests that share the same prefix tokens. That cuts redundant prefill for system prompts, few-shot examples, RAG context, and multi-turn conversation history.

vLLM’s Automatic Prefix Caching hashes KV blocks and uses a global hash table for lookup. SGLang’s RadixAttention maintains a radix tree of cached KV tensors with token-level granularity. Both depend on repeated token-identical prefixes, so report hit rate alongside latency or throughput.

17. Streaming in practice

Streaming sends tokens to the client as they are generated instead of waiting for the full response. Many serving frameworks expose it via Server-Sent Events: the client opens a long-lived HTTP connection, and the server pushes each token or token batch as a data: event. TTFT determines when the user first sees output; TPOT helps determine how smooth it feels. Set the target through product testing and the chosen interaction model.

On the client side, streaming forces buffering decisions. Rendering token by token can cause visual jitter, especially with markdown or code blocks that need multi-token context to format correctly. Common patterns are word-level buffering (accumulate tokens until a whitespace boundary), line-level buffering (wait for a newline before rendering), and adaptive buffering (render immediately for prose, buffer for code blocks). The stream_options: {"include_usage": true} parameter in OpenAI-compatible APIs returns token counts in the final SSE event, which makes accurate cost tracking possible for streamed responses.

Chunked prefill is what makes streaming hold up under load. Without it, a single long prefill can stall token delivery for every other concurrent user.

Part IV — Model architecture

How LLMs are built: the transformer block, tokenization, context handling, and the architectural variants that became the default. These underpin both inference and training.

18. Transformer architecture essentials

A modern decoder-only transformer (GPT, Llama) is a stack of identical layers, each with two sub-blocks: attention and feed-forward. Every sub-block is wrapped with a residual connection and normalization. The key components:

Multi-Head Attention — the mechanism that lets each token look at every other token to decide what is relevant. The input is projected into three matrices: Queries (what am I looking for?), Keys (what do I contain?), and Values (what information do I carry?). Attention scores are then computed as:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

The $QK^T$ dot product measures similarity between every pair of tokens. Dividing by $\sqrt{d_k}$ keeps the dot products from growing too large (which would push softmax into regions with vanishing gradients). The softmax converts scores to probabilities, and multiplication by $V$ produces a weighted combination of value vectors. Running this across multiple heads in parallel lets the model attend to different relationships at the same time (one head for syntax, another for coreference, and so on).

Feed-Forward Network (FFN) — once attention has decided which tokens are relevant, the FFN decides what to do with that information. Modern LLMs use SwiGLU instead of the original two-matrix ReLU FFN:

$\text{SwiGLU}(x) = \text{Swish}(xW_1) \otimes xW_2$

SwiGLU uses three weight matrices instead of two and a smooth Swish activation instead of ReLU. It is used by model families including Llama, Mistral, Qwen, and Gemma. The FFN typically accounts for about two-thirds of total model parameters, although the exact share depends on the architecture.

Residual Connections — each sub-block adds its output back to its input: $\text{Output} = \text{Input} + \text{Sublayer}(\text{Input})$ . Without this skip connection, gradients vanish when backpropagating through 80–128 layers. The residual creates a path that lets information and gradients flow directly from early layers to late ones.

RMSNorm — is common in modern LLM families. LayerNorm normalizes by re-centering (subtracting the mean) and re-scaling (dividing by standard deviation). RMSNorm skips the mean subtraction and only re-scales; its paper reports 7–64% speedups across the tested models without a performance penalty in those experiments. Pre-norm placement, which normalizes before attention or the FFN, is also common because it improves gradient stability.

Parameter count estimation for a decoder-only model:

$\text{Total} \approx V \times d + 12 \times L \times d^2$

where $V$ is vocabulary size, $d$ is hidden dimension, and $L$ is layer count. The $V \times d$ term is the input embedding matrix; the $12 \times d^2$ term approximates the attention and FFN weights in each layer. For Llama 3 8B ( $V=128{,}256$ , $d=4{,}096$ , $L=32$ ), the estimate is about $0.53B + 6.44B = 6.97B$ parameters. The published 8.03B total is higher because the approximation omits architectural details such as the exact FFN width and the separate output projection.

19. Why decoder-only architectures dominate

The original Transformer (2017) had both an encoder and a decoder. Since then, the field split into three architectural families, and one became the default for generative AI.

Encoder-only models (BERT, RoBERTa) use bidirectional attention: every token attends to every other token in both directions. That produces rich representations for understanding tasks (classification, NER, semantic similarity) but cannot generate text autoregressively. Encoder-only models still dominate as the backbone for embedding models, rerankers, and lightweight classifiers (for example, the BERT-based routers in RouteLLM).

Encoder-decoder models (T5, BART, the original Transformer) separate understanding from generation. The encoder processes the full input with bidirectional attention, then the decoder generates output autoregressively while attending to the encoder’s representations through cross-attention. This had a natural advantage for sequence-to-sequence tasks like translation, where input and output are fundamentally different sequences. Google’s T5 showed that any NLP task could be framed as text-to-text, and encoder-decoder models still power some specialized systems (Whisper for speech recognition, FLAN-T5 for instruction following).

Decoder-only models (GPT, Llama, Mistral, Gemini) use causal (unidirectional) attention: each token attends only to previous tokens. They frame everything as next-token prediction: the “input” is the beginning of the sequence, and the “output” is its continuation. Four reasons this architecture took over:

KV cache efficiency. The KV cache from previous tokens stays valid as new tokens are generated, so you never discard or recompute it. Encoder-decoder models have to maintain two separate attention caches (self-attention plus cross-attention over the encoder output), which adds memory overhead and architectural complexity.
Training simplicity. The training objective is plain next-token prediction on raw text. No paired input-output data (as translation models need) or masked-token reconstruction (as BERT needs). You can train on essentially any text from the internet, books, and code with no special preprocessing, which is a huge advantage when scaling data to trillions of tokens.
Architectural simplicity. One module handles everything: the same transformer block, repeated $L$ times. No encoder-decoder cross-attention layers, no separate encoder stack. This makes parallelism strategies more straightforward (Section 33), and shrinks the engineering surface for optimization. FlashAttention, quantization, and speculative decoding only have to target one attention pattern.
In-context learning. Decoder-only models are naturally good at few-shot learning because examples, instructions, and the query are all just tokens in the same sequence. The model does not distinguish “input” from “output”; it predicts the next token given everything before it. GPT-3 first demonstrated this at scale, and it made decoder-only models a strong fit for the general-purpose assistant role.

20. Mixture of experts

MoE replaces the dense FFN in each transformer layer with multiple smaller expert FFNs plus a lightweight gating router. The router computes a score for each expert (typically a softmax over learned linear projections) and selects the top- $k$ experts per token. Only the activated experts compute, so a model can have enormous total capacity while keeping per-token cost low. This is sparse conditional computation: total parameters set what the model can represent, active parameters set what it costs to run.

Model	Total Params	Active Params	Experts (Routed + Shared)	Top- $k$
Mixtral 8x7B	47B	~13B	8 + 0	2
DeepSeek-V3	671B	37B	256 + 1	8

The shared expert in DeepSeek-V3 is activated for every token. It provides a baseline representation that the routed experts can specialize on top of.

Training MoE has three recurring problems: load imbalance, expert collapse, and communication overhead for expert parallelism. Traditional MoE models add an auxiliary loss to penalize imbalanced routing, but that loss can compete with the main objective. DeepSeek-V3 instead uses bias terms outside backpropagation: the system lowers the score of overloaded experts and raises the score of underused ones. The paper reports better routing balance without the auxiliary-loss trade-off in its setup.

21. Tokenization: BPE, SentencePiece, and tiktoken

LLMs do not see text. They see sequences of integer token IDs. A tokenizer splits raw text into tokens (subword pieces) and maps each to an ID. The tokenizer choice affects model quality, inference speed, and multilingual fairness.

Byte Pair Encoding (BPE) is the common algorithm. It iteratively merges the most frequent adjacent pairs in the training corpus. A simplified example:

Start with character-level vocabulary: [l, o, w, e, r, _]
Most frequent pair is (l, o) → merge into lo → vocabulary: [l, o, w, e, r, _, lo]
Next most frequent pair is (lo, w) → merge into low → vocabulary adds low
Continue until the vocabulary reaches the target size (e.g., 128K tokens)

Common words like “the” become single tokens, while rare words like “defenestration” get split into subword pieces like ["def", "en", "est", "ration"]. The trade is vocabulary size against sequence length.

Three tokenizer implementations cover most production use:

SentencePiece treats input as a raw byte stream with no language-specific preprocessing (no pre-tokenization by spaces or punctuation), which makes it language-agnostic and matters a lot for non-Latin scripts. Supports both BPE and unigram. Used by Llama 1/2, T5, and Mistral.
tiktoken is OpenAI’s Rust-based tokenizer using byte-level BPE. Its compiled Rust core is 3–6x faster than Python-based alternatives. Llama 3 switched from SentencePiece to tiktoken’s algorithm.
Hugging Face Tokenizers is a widely used Rust-based library supporting BPE, WordPiece, and Unigram.

Vocabulary sizes have grown steadily, with major implications for efficiency:

Model	Vocab Size	English Fertility	Why It Matters
GPT-2	50,257	~1.3 tokens/word	Original BPE baseline
Llama 2	32,000	~1.4 tokens/word	Smaller vocab, longer sequences
GPT-4	100,256	~1.1 tokens/word	Better compression, fewer tokens per request
Llama 3	128,256	~1.0 tokens/word	4x larger than Llama 2, major multilingual gain
GPT-4o	200,000	~1.0 tokens/word	Largest production vocabulary

Fertility (tokens per word) measures compression efficiency. Lower is better: fewer tokens means shorter sequences, lower cost, and more content fits in the context window. English typically lands at ~1.0–1.3 tokens/word, but non-Latin scripts (Chinese, Japanese, Korean, Arabic) can be 2–4x higher with English-centric vocabularies. The same content costs 2–4x more in tokens for non-English users, which is a persistent equity issue that larger, more balanced vocabularies only partly address.

22. Context windows and positional encodings

The context window is the maximum number of tokens a model can process in a single forward pass. It has grown a lot:

Model	Context Window	Year
Original Transformer	512	2017
GPT-4 Turbo	128K	2023
Claude 3.5	200K	2024
Gemini 1.5 Pro	1M+	2024
Grok 4 Fast	2M	2025

There is a basic problem here: the attention mechanism treats its input as a set, not a sequence. It has no built-in notion of word order. Without positional information, “the cat sat on the mat” and “the mat sat on the cat” would produce identical representations. Positional encodings inject order so the model knows where each token sits.

Three common approaches are:

RoPE (Rotary Position Embeddings) encodes each token’s position by rotating its query and key vectors by an angle proportional to the position. Tokens close together get similar rotations, so their dot product (attention score) stays high. Tokens far apart get very different rotations, which encodes relative distance. RoPE is the standard for almost all modern open LLMs (Llama, Mistral, Qwen) because it handles relative positions well and is computationally cheap.
ALiBi (Attention with Linear Biases) skips embedding modifications and adds a penalty directly to attention scores: the farther apart two tokens are, the larger the negative bias. No learned parameters, no extra computation. It allows some extrapolation beyond training length but degrades noticeably at 2x+ the training context.
YaRN (Yet another RoPE extensioN) extends a RoPE model beyond its training context. It groups frequency dimensions into three categories and scales each differently. The paper reports 10x fewer fine-tuning tokens and 2.5x fewer training steps than its position-interpolation baseline.

Part V — Training and alignment

Training is where capabilities are built. This part covers pretraining, efficient fine-tuning (LoRA, mixed precision), scaling laws, and alignment methods.

23. Pretraining, fine-tuning, and alignment

Training Pipeline

Pretraining is self-supervised next-token prediction on a large corpus. Its compute spans many orders of magnitude; Llama 3 405B, for example, used $3.8 \times 10^{25}$ FLOPs. Supervised fine-tuning (SFT) adapts the pretrained model to task-specific labeled data. RLHF / RLAIF uses preference data to shape behavior: a conventional RLHF pipeline collects comparisons, trains a reward model, then optimizes the policy. RLAIF substitutes AI-generated feedback for some human judgments.

Compute depends on model size, sequence length, data volume, optimizer, and method. PPO also carries more model state than SFT because a typical setup includes policy, reference, reward, and critic models. I covered the full fine-tuning decision framework in LLM Fine-Tuning Guide.

24. LoRA and QLoRA: parameter-efficient fine-tuning

LoRA freezes the pretrained weights and injects trainable low-rank matrices $A$ ( $r \times k$ ) and $B$ ( $d \times r$ ) so that the updated weight is $W_0 + BA$ . The LoRA paper reduced GPT-3 175B to about 18 million trainable parameters in its setup. Rank is a tuning parameter rather than a task-complexity rule; select it with a quality and memory sweep. LoRA adapters can be merged into the base weights after training to avoid a separate adapter path at inference.

QLoRA loads the base model in 4-bit NF4 quantization while training LoRA adapters in BF16. NormalFloat4 places more quantization levels near zero, where weight density is highest. The paper fine-tuned a 65B model on a single 48 GB GPU and reported results close to its 16-bit baselines. Its runtime and memory trade-offs are specific to the tested stack.

25. Mixed-precision training

Every floating-point format allocates its bits across three fields: sign (always 1 bit), exponent (sets the dynamic range), and mantissa (sets the precision). More exponent bits mean a wider range of representable magnitudes; more mantissa bits mean finer distinctions between nearby values. Integer formats have no exponent at all and represent only evenly spaced whole numbers within a fixed range.

Format	Bits	Layout (S / E / M)	Range	Precision	Common use
FP32	32	1 / 8 / 23	$\pm 3.4 \times 10^{38}$	~7 decimal digits	Master weights, optimizer states (Adam momentum & variance)
BF16	16	1 / 8 / 7	$\pm 3.4 \times 10^{38}$	~2 decimal digits	Preferred training format — same range as FP32, no loss scaling needed
FP16	16	1 / 5 / 10	$\pm 65{,}504$	~3 decimal digits	Training with loss scaling (older GPUs); inference on pre-Hopper hardware
FP8 E4M3	8	1 / 4 / 3	$\pm 448$	~1 decimal digit	Forward pass on Hopper (H100) — more precision for weights & activations
FP8 E5M2	8	1 / 5 / 2	$\pm 57{,}344$	~0.6 decimal digits	Backward pass on Hopper — wider range for gradients
INT8	8	fixed-point	$-128$ to $127$	Exact integers	Post-training weight quantization for inference (W8A8); KV cache quantization
INT4	4	fixed-point	$-8$ to $7$	Exact integers	Aggressive weight-only quantization (AWQ, GPTQ) for inference on memory-constrained hardware

BF16 has the same range as FP32 because range is set by the exponent field, and BF16 keeps all 8 exponent bits from FP32. It gives up mantissa bits instead (7 vs 23), trading precision for a 2x memory reduction while avoiding the overflow and underflow problems that plague FP16 training. FP16 has only 5 exponent bits, capping its range at ~65K. Gradients routinely exceed that, which is why FP16 training needs loss scaling: multiply the loss by a large constant before backprop, then divide gradients afterward. BF16 makes loss scaling unnecessary.

Integer formats are uncommon for the main training arithmetic because backpropagation needs a wide dynamic range. They are widely used for inference, where frozen weights can be mapped to calibrated scales. INT4 weight quantization cuts a 7B model from about 14 GB to 3.5 GB before runtime overhead; quality must be measured for the chosen model and method.

FP8 training on H100 via Transformer Engine uses E4M3 where precision matters and E5M2 where wider range matters. NVIDIA reports up to 75% faster wall-clock time in a tested 175B configuration. DeepSeek-V3 used FP8 mixed precision and reported about $5.6 million in rental-equivalent compute for its final training run, excluding R&D and infrastructure.

26. Gradient checkpointing

Each layer of the forward pass produces an intermediate output called an activation:

$\text{Input} \rightarrow [\text{Layer 1}] \rightarrow \text{activation}_1 \rightarrow [\text{Layer 2}] \rightarrow \text{activation}_2 \rightarrow [\text{Layer 3}] \rightarrow \text{output}$

Normally all activations have to stay in memory because backpropagation needs them to compute gradients. For a deep transformer, the stored activations can take more memory than the model weights themselves.

Gradient checkpointing trades compute for memory by throwing most of those activations away and recomputing them on the fly during backprop. The standard strategy (Chen et al., 2016) divides a network of $n$ layers into $\sqrt{n}$ evenly spaced segments and saves only the boundary activation of each segment. Those saved boundaries are the “checkpoints.” All intermediate activations within a segment are dropped immediately.

When the backward pass reaches a layer inside a segment, its activations are recomputed from the nearest checkpoint. This drops activation memory from $O(n)$ to $O(\sqrt{n})$ , which is a 60–70% reduction in practice, at a cost of roughly one extra forward pass (~20–33% more compute). FlashAttention applies the same principle inside attention by not materializing the full attention matrix. Enable it in HuggingFace with gradient_checkpointing=True.

27. DeepSpeed ZeRO stages

In standard data parallelism, every GPU holds a complete copy of the model weights, gradients, and optimizer states. For Adam, each parameter takes 2 bytes for the FP16 weight + 4 bytes for the FP32 master weight + 4 bytes for momentum + 4 bytes for variance + 2 bytes for the gradient, which is 16 bytes per parameter. A 7.5B-parameter model needs ~120 GB per GPU, and every GPU stores the same thing. On 64 GPUs that is 64 identical 120 GB copies. A lot of waste.

DeepSpeed ZeRO (Zero Redundancy Optimizer) removes this duplication by sharding these components across GPUs instead of replicating them:

Stage 1 — partition optimizer states. Each GPU stores only 1/N of the optimizer states (Adam’s momentum and variance, 8 bytes/param). When a GPU needs to update a weight, it updates only its slice and broadcasts the result. Memory drops from ~120 GB to ~31 GB per GPU.
Stage 2 — also partition gradients. Gradients (2 bytes/param) are no longer all-reduced to every GPU. Each GPU receives only the gradient slice it needs via reduce-scatter. Memory drops to ~16 GB per GPU.
Stage 3 — also partition the model weights. Each GPU holds only 1/N of the FP16 weights. Before each layer’s forward or backward pass, the GPU calls all-gather to temporarily reconstruct the full layer weights from all other GPUs, computes, and discards the gathered weights. Memory drops to ~1.9 GB per GPU.

Config	Optimizer States	Gradients	Weights	Per-GPU Memory (7.5B)
No ZeRO	Replicated	Replicated	Replicated	~120 GB
Stage 1	Partitioned	Replicated	Replicated	~31 GB
Stage 2	Partitioned	Partitioned	Replicated	~16 GB
Stage 3	Partitioned	Partitioned	Partitioned	~1.9 GB

The tradeoff is communication. Stage 1 adds minimal overhead, Stage 2 replaces all-reduce with reduce-scatter (similar cost), but Stage 3 needs all-gather calls before every layer in both forward and backward passes, roughly 1.5x communication volume versus standard data parallelism.

ZeRO-Infinity extends Stage 3 by offloading partitioned states to CPU RAM and even NVMe SSDs, which makes training models with trillions of parameters possible on limited GPU clusters. The cost is a big drop in speed (NVMe is ~500x slower than HBM), so ZeRO-Infinity is what you use when the model genuinely does not fit in GPU + CPU memory.

28. FSDP: PyTorch-native sharding

Fully Sharded Data Parallel (FSDP) is PyTorch’s built-in answer to DeepSpeed ZeRO-3. It shards parameters, gradients, and optimizer states across GPUs with the same core idea. The mechanics for each layer are a simple loop:

All-gather the full parameters from all GPUs (temporarily reconstruct the complete layer).
Compute the forward or backward pass for that layer.
Free the gathered parameters immediately. Each GPU keeps only its own shard.
Reduce-scatter gradients so each GPU gets only its assigned gradient slice.

Because FSDP is native to PyTorch, it integrates directly with PyTorch debugging tools, profilers, and torch.compile. Performance relative to DeepSpeed ZeRO-3 depends on wrapping policy, communication topology, offload settings, and model size, so compare them on the same cluster.

Criteria	FSDP (PyTorch)	DeepSpeed ZeRO
Control style	Full sharding through PyTorch APIs	Selectable ZeRO stages
Offloading	CPU offloading	CPU + NVMe with ZeRO-Infinity
Framework integration	Native PyTorch, `torch.compile` paths	Separate library and config system
Selection test	Profile the target PyTorch workload	Profile required features and offload

FSDP2 (2024–2025) is a rewrite that improves torch.compile integration for better kernel fusion, adds FP8 training support via TorchAO, and simplifies the API. Both FSDP and DeepSpeed are accessible through HuggingFace Accelerate, which lets you switch between them with a single config change.

29. Scaling laws and the Chinchilla trap

Chinchilla scaling (DeepMind, 2022) found a compute-optimal allocation near 20 training tokens per parameter under its assumptions. That objective does not include downstream serving cost. If a smaller model trained on more data reaches the required quality, it may cost less over a high-volume inference lifecycle.

The fix is to overtrain smaller models on much more data. The progression is striking:

Model	Params	Training Tokens	Tokens/Param	Chinchilla ×
Chinchilla	70B	1.4T	20:1	1×
Llama 1	65B	1.4T	22:1	1×
Llama 2	70B	2.0T	29:1	1.4×
Llama 3 8B	8B	15T	1,875:1	94×
Qwen3-0.6B	0.6B	36T	60,000:1	3,000×

For a model served at high volume, spending more training compute on a smaller model can reduce lifecycle cost. Llama 3 8B illustrates the strategy, but whether it wins depends on the required quality and projected inference volume. “Chinchilla-optimal” refers to training-compute efficiency, which is a different objective from lifecycle cost.

30. RLHF, DPO, GRPO, and the alignment landscape

Alignment steers a pretrained model toward desired instructions, preferences, and safety policies. It does not by itself guarantee truthfulness or safe behavior. The methods below trade off implementation complexity, data requirements, exploration, and training stability.

The classic RLHF pipeline: SFT → collect human preference pairs → train a reward model on those pairs → fine-tune the policy with PPO (Proximal Policy Optimization). PPO holds 4 model copies in memory at once (policy, reference, critic/value model, reward model), is hyperparameter-sensitive, and is prone to reward hacking, where the model exploits quirks in the reward model (such as verbose, confident-sounding answers) instead of genuinely improving quality.

DPO (Direct Preference Optimization) skips the learned reward model and online RL loop by optimizing a loss on preference pairs directly. That simplifies the training pipeline. Standard DPO is offline: it trains on a fixed dataset and does not explore new responses during the update loop. Whether that limitation matters depends on the task and data coverage.

GRPO (Group Relative Policy Optimization, DeepSeek) removes PPO’s learned critic by generating multiple completions per prompt and using group-relative rewards as the baseline. This reduces the model-state burden relative to a typical PPO setup. Unlike DPO, GRPO is on-policy: the model generates fresh responses during training. DeepSeek-R1 combines GRPO with RLVR (reinforcement learning from verifiable rewards), using checks such as math answers, code compilation, and unit tests. These rewards are easier to audit than a learned preference score, but incomplete tests and proxy objectives can still be exploited.

Method	Typical model state	Reward signal	Online/offline	Key limitation
PPO	4 (policy, ref, critic, reward)	Learned reward model	Online	Reward hacking, complex tuning
DPO	2 (policy, reference)	Implicit (preference pairs)	Offline	No exploration, fixed data
GRPO	2 (policy, reference)	Explicit (verifiable or learned)	Online	Needs verifiable rewards for full benefit

31. Distillation: compressing knowledge across models

Knowledge distillation transfers capabilities from a large teacher to a smaller student. Logit-based distillation trains the student to match the teacher’s output distribution. Data-based distillation has the teacher generate examples that the student fine-tunes on. Data-based methods are common for LLMs because they can work across architectures and with API-only teachers, but their value is bounded by teacher quality, data coverage, filtering, and generation cost.

DeepSeek-R1 generated 800,000 reasoning examples and used them to distill Qwen2.5 and Llama 3 models from 1.5B to 70B parameters. In the paper’s evaluation:

DeepSeek-R1-Distill-Qwen-32B scores 72.6% on AIME 2024 and 94.3% on MATH-500, above the paper’s reported OpenAI o1-mini numbers.
DeepSeek-R1-Distill-Qwen-7B scores 55.5% on AIME 2024, above the paper’s QwQ-32B-Preview result with a smaller model.

In DeepSeek-R1’s small-model experiments, distillation outperformed direct GRPO on the tested base models. That result supports distillation for this setup; it does not establish a universal ranking between distillation and RL.

32. Synthetic-data generation

LLM-generated training data is used in several recurring patterns:

Self-Instruct bootstraps from a small seed set of human-written instructions: the LLM generates new instructions, inputs, and outputs, which are filtered and added back to the pool. The Alpaca project used 52,000 examples from 175 seeds and reported a roughly $600 generation cost; its GPT-3.5 comparison was a limited project evaluation, not broad equivalence.
Evol-Instruct (WizardLM) takes existing instructions and iteratively evolves them along complexity axes (adding constraints, deepening reasoning, making problems more concrete) to produce progressively harder training examples.
Microsoft’s Phi-4 (14B) used synthetic data for much of pretraining, including generation, critique, self-revision, and instruction reversal. Its technical report compares the resulting STEM and coding performance with larger models on the selected benchmarks.

The risk that matters here is model collapse: when models are recursively trained on synthetic data from previous generations, the tails of the original distribution progressively vanish. The model overestimates common patterns and loses rare but important variations (Shumailov et al., 2024). A separate Ahrefs study classified 74.2% of newly created webpages in its 900,000-page sample as containing AI-generated text; that is a vendor classifier result, not a census of the web. Mitigation starts with blending synthetic and real data, filtering, and lineage tracking so recursively generated material can be measured.

Part VI — Scaling and deployment

Scaling from one GPU to a cluster means splitting work across devices. This part covers parallelism strategies, serving frameworks, GPU selection, and routing.

33. Four forms of parallelism

Parallelism Strategies

Tensor Parallelism (TP) splits individual weight matrices across GPUs and usually communicates after each layer. Fast intra-node links such as NVLink make it most practical within a node. More shards reduce per-device memory and compute but increase communication, so select the degree with a latency benchmark.

Pipeline Parallelism (PP) splits layers sequentially across GPUs, passing activations between stages. Its communication pattern can work across nodes, but pipeline bubbles and uneven stage times reduce utilization. Large deployments often combine TP within a node and PP across nodes.

Data Parallelism (DP) replicates the serving model so each replica handles independent requests without per-request cross-replica communication. It is efficient when the model fits and traffic can be balanced. In training, DP is commonly combined with ZeRO or FSDP to shard state.

Expert Parallelism (EP) distributes MoE experts across GPUs using all-to-all communication for token routing. Its performance depends on token balance, expert placement, and interconnect topology; all-to-all traffic can become the dominant bottleneck.

A starting parallelism heuristic:

Model fits on one GPU: begin with independent replicas and measure DP scaling.
Model fits within one node: test TP within the node, then replicate the group if traffic requires it.
Model spans nodes: test a TP and PP combination against the interconnect and latency target.
Mixture of experts: add EP only when expert placement requires it.

Parallelism Decision Framework

34. Serving frameworks compared

vLLM provides paged KV allocation, continuous batching, an OpenAI-compatible API, and several parallelism modes. Its model and hardware support changes frequently, so verify the target model against the current compatibility matrix.

SGLang combines RadixAttention for prefix reuse, a custom scheduler, and structured generation. Its published throughput gains depend on workload and configuration; compare it with vLLM and TensorRT-LLM using identical prompts, outputs, hardware, and SLOs.

TensorRT-LLM targets low single-request latency through CUDA graph fusion and kernel optimization, with native FP8/FP4 support. Published numbers are hardware- and model-specific, so compare it with the other frameworks on one harness. The tradeoff is a steeper learning curve and NVIDIA-specific deployment surface.

TGI integrates with the Hugging Face ecosystem and supports several hardware backends. Check the repository’s current maintenance and feature status before selecting it for a new deployment.

Ollama emphasizes a simple local model workflow. Use it for development convenience; benchmark another serving stack when high concurrency or explicit SLO control matters.

llama.cpp is a portable C/C++ runtime with ARM, x86, Metal, CUDA, ROCm, and Vulkan paths. GGUF supports several quantization levels. Performance varies widely by model, quantization, context, and backend, so use its local benchmark tool for the target machine.

35. GPU selection for inference

The table is a March 2026 hardware and cloud-price snapshot. Precision labels are vendor capabilities, and prices vary by provider, region, commitment, and availability; verify them before purchasing.

GPU	Memory	Bandwidth	Native Precision	TF32 TFLOPS	NVLink	Cloud $/hr
B200	192 GB HBM3e	8 TB/s	FP4, FP8, INT8	~4,500 (FP8)	5.0 (1.8 TB/s)	~$6.25
H200	141 GB HBM3e	4.8 TB/s	FP8, INT8	989	4.0 (900 GB/s)	$2.15-6.00
H100 SXM	80 GB HBM3	3.35 TB/s	FP8, INT8	989	4.0 (900 GB/s)	$1.49-3.90
A100 SXM	80 GB HBM2e	2.0 TB/s	INT8, FP16	312	3.0 (600 GB/s)	$1.10-2.54
L40S	48 GB GDDR6	864 GB/s	FP8, INT8	362 (FP16)	None	$0.80-1.50
A10G	24 GB GDDR6	600 GB/s	INT8, FP16	70	None	$1.00-1.50
RTX 4090	24 GB GDDR6X	1.0 TB/s	FP8, INT8	83 (FP32)	None	~$0.35/hr

Select first by memory fit, then by measured throughput at the latency target. H200’s 141 GB capacity can simplify some large-model deployments, while B200 adds FP4 support, 192 GB HBM3e, and a newer NVLink generation. Smaller GDDR-based GPUs can be economical for quantized models when their memory and interconnect limits match the workload.

Native hardware quantization support has a big influence on performance. AWQ and GPTQ (with INT4 weights) can run efficiently on any architecture by dequantizing into FP16 registers, but true native acceleration through Tensor Cores depends on the generation. Hopper (H100/H200) and Ada (L40S/4090) natively accelerate FP8, and Blackwell (B200) adds native FP4 Tensor Cores for large throughput gains. All listed GPUs support INT8 matrix operations.

LLM decode is often memory-bandwidth-bound, so HBM capacity and bandwidth can matter more than peak TFLOPS for serving workloads. Compare GPUs with the model, precision, batch distribution, context length, and latency target held constant.

36. Model cascading and routing

Model Cascading vs Routing

Model routing picks which LLM handles each query based on predicted complexity or capability. RouteLLM (LMSYS/UC Berkeley, ICLR 2025) reports an 85% cost reduction on its MT-Bench setup while retaining 95% of its GPT-4 quality baseline. Whether routing pays off depends on current prices, the traffic mix, router errors, and the quality floor.

Routers range from lightweight classifiers to LLM-based judges. Cascading is the sequential variant: a query starts with a cheaper model and escalates when a scoring function rejects the answer. FrugalGPT reports up to 98% lower cost or up to 4% higher accuracy in its evaluated model pool. A production cascade needs calibrated escalation criteria and monitoring for the queries the cheap model accepts incorrectly.

Part VII — Applications

Application-level patterns that turn raw model capabilities into useful systems. Embeddings power retrieval, RAG grounds responses in external knowledge, agents orchestrate multi-step workflows, and prompt engineering ties it all together.

37. Embedding models vs generative models

Embedding models encode text into fixed-dimensional vectors that capture semantic meaning. Unlike generative decoder models that produce token sequences, they output a single dense vector (768–4,096 dimensions) for the entire input. Most embedding models use encoder-only transformers (bidirectional attention) rather than decoders. The encoder processes all input tokens at once and produces a contextualized representation for each token. A pooling layer then collapses those per-token representations into a single vector, typically mean pooling (averaging all token embeddings) or CLS pooling (using a special classification token’s output). The model is then fine-tuned with contrastive learning: semantically similar texts are pushed closer together in vector space and dissimilar texts are pushed apart.

Selected embedding models and their reported scores (2025–2026):

Model	Dimensions	Architecture	MTEB Score
Qwen3-Embedding-8B	up to 4,096	Decoder-based (Qwen3)	70.6%
Gemini Embedding 2	3,072	Multimodal (text/image/video/audio)	68.2%
pplx-embed-v1-4B	2,560	Decoder-based (Qwen3), native INT8/binary	69.7%
Voyage-3-large	2,048	Proprietary	66.8%
OpenAI text-embedding-3-large	3,072	Proprietary	64.6%

The table mixes model reports and benchmark versions, so it is a shortlist rather than a strict ranking. Some newer embedding systems use decoder backbones with bidirectional attention and pooling. Others emit lower-precision or multimodal embeddings. Evaluate the required language, modality, task, dimension, and serving cost on one retrieval set.

Matryoshka Representation Learning (MRL, Kusupati et al., NeurIPS 2022) makes embedding dimensions flexible. Named after Russian nesting dolls, MRL structures an embedding so that its first $m$ dimensions are as informative as an independently trained $m$ -dimensional model. During training, instead of computing a single loss on the full embedding, MRL computes multiple losses in parallel at logarithmically spaced dimensions (64, 128, 256, 512, 1024, 2048, 3072). The aggregated loss pushes the leading dimensions to carry coarse semantic information, with later dimensions adding finer detail.

After training, an MRL embedding can be truncated to a supported prefix dimension. OpenAI reports that text-embedding-3-large at 256 dimensions outperforms text-embedding-ada-002 at 1,536 dimensions on its cited MTEB comparison. That gives a 6x reduction in raw vector storage; search latency and database cost also depend on the index, metadata, filtering, and hardware.

The embedding model is one important component in a RAG pipeline, alongside parsing, chunking, search, reranking, and generation. If relevant evidence is not retrieved, a stronger generator cannot reliably recover it.

38. RAG architecture in production

RAG Architecture

Retrieval-Augmented Generation supplies an LLM with documents retrieved at query time. It can provide current or private evidence that is absent from model weights, but retrieval does not guarantee that the answer uses that evidence correctly. A production RAG system is a multi-stage pipeline whose stages need separate evaluation.

The ingestion pipeline runs offline. Raw documents (PDFs, HTML, Markdown, databases) are first parsed into clean text, which is harder than it sounds: PDF parsing alone can lose tables, headers, and formatting. The text is then split into chunks, which are embedded and indexed independently.

Chunking affects both retrieval recall and the context available to the generator. Useful sizes depend on document structure, query granularity, embedder limits, and reranker limits. Common approaches are fixed-size with overlap, recursive splitting along document boundaries, and semantic chunking by embedding similarity. Compare them on page- or section-level relevance labels instead of adopting one token range universally.

Each chunk is then embedded with a model like those in Section 37 and stored in a vector database (Pinecone, Weaviate, Qdrant, pgvector, and so on).

The retrieval pipeline runs at query time. Start with a measurable baseline, then add stages when error analysis shows they address a real miss:

Hybrid search combines dense vector retrieval with sparse retrieval such as BM25, often merged through Reciprocal Rank Fusion (RRF). Dense search handles semantic paraphrases; sparse search catches exact identifiers, error codes, and acronyms. Vendor benchmarks report gains over vector-only baselines, but the size depends on the corpus and relevance labels.
Reranking passes retrieved candidates through a model that scores the query and document together. This can improve fine-grained relevance at the cost of another model call. Candidate count, retained count, and latency should be tuned together. I covered the full multi-stage pipeline in Building a Modern Search Ranking Stack.
Query transformation rewrites the user’s query before retrieval to improve recall. HyDE (Hypothetical Document Embeddings) has the LLM generate a hypothetical answer, which is then embedded and used for retrieval. Multi-query expansion generates multiple phrasings of the same question. Step-back prompting asks a more general question first to pull in broader context.

The common failure modes:

Retrieval failure — the correct document exists but is not retrieved. Test chunking, query transformation, hybrid search, and metadata filtering against the miss.
Context poisoning — irrelevant retrieved chunks mislead the LLM. Test reranking, context filters, and smaller retained sets.
Lost-in-the-middle — the LLM ignores relevant context placed in the middle of a long prompt (Liu et al., 2024 showed models attend most to the beginning and end of the context window).

GraphRAG (Microsoft, 2024) augments vector retrieval with an extracted entity and relationship graph. It targets corpus-level and relationship-heavy questions that flat chunk retrieval may miss. The trade-off is additional extraction, indexing, storage, and evaluation work.

Practitioner guides publish latency ranges for embedding, search, reranking, and generation, but those figures vary by region, corpus, hardware, and model. Measure each stage in traces and evaluate the quality change before accepting the added latency.

39. Agent architectures and tool calling

LLM agents use models to select and sequence tool calls around an evolving state. Three useful orchestration patterns are:

ReAct — interleaves action selection with observations. It can adapt after each tool result, but a growing history raises token and latency cost.
ReWOO — plans tool calls with placeholders, executes independent work in parallel, and then synthesizes. Its paper reports token savings over ReAct, but the fixed plan needs an explicit recovery path when a tool fails.
Planner-executor — separates planning from execution and can add a re-planning policy after failure. It permits model specialization but adds orchestration state and another decision boundary.

Pattern	Token tendency	Adaptability	Useful starting point
ReAct	Higher	Updates after observations	Uncertain or exploratory tool use
ReWOO	Lower	Fixed plan unless extended	Predictable work with parallel steps
Planner-executor	Medium	Can revise the explicit plan	Longer tasks that benefit from control

Function calling is a common mechanism for tool invocation. APIs expose tool definitions and return structured arguments, reducing the need to parse free-form text. Schema-valid arguments can still select the wrong tool or contain invalid values. Parallel function calling can reduce round trips when operations are independent.

Structured output and constrained decoding enforce a schema by restricting the tokens available at each generation step. Engines such as xgrammar, used in vLLM and SGLang, can remove many syntax and parsing failures with low overhead in supported configurations. They do not guarantee that the extracted values or decisions are correct. Schema-Guided Reasoning (SGR) uses field order and schema structure to make intermediate state inspectable before the final decision. Its three patterns are Cascade (sequential steps), Routing (union types as semantic switches), and Cycle (bounded lists).

Tool-selection quality, end-to-end latency, and token cost usually worsen as the tool set and action depth grow. Measure those curves with the actual tool descriptions and failure distribution. Frameworks such as LangGraph can make state and recovery paths explicit, but they do not remove the evaluation burden.

40. Prompt engineering for production

Production prompting is an evaluation problem: change one part of the prompt or context, then measure task quality and failure modes. The techniques below are common starting points, not a universal order.

Few-shot examples are often effective for controlling output format. Start with 3–5 examples that cover empty inputs, ambiguous queries, and multi-part answers, then measure the result on a held-out set. Examples should span the real input distribution rather than only the happy path. More examples consume context and do not guarantee further gains.

Chain-of-thought (CoT) prompting asks a model to expose intermediate reasoning before answering. Kojima et al. reported gains from the suffix “Let’s think step by step” on their tested reasoning tasks, but the effect varies by model and newer reasoning APIs may not expose hidden traces. For production, prefer an inspectable task decomposition or concise rationale when it is useful to the evaluator. Self-consistency (Wang et al., 2023) samples several reasoning paths and aggregates the answers, trading extra inference cost for robustness on suitable tasks.

Structured output with explicit JSON schemas (Section 39) removes many parsing failures. Constrained decoding engines such as xgrammar can enforce the supported grammar during generation; factual accuracy and semantic validity still require evaluation, and unsupported schema features may still need handling.

Prompt chaining breaks a task into focused stages, for example classify intent → retrieve context → generate response → validate output. It can localize failures, allow different models per stage, and expose cacheable intermediate state. It also adds interfaces and latency, so compare it with a single-call baseline.

Temperature changes the sampling distribution. Low values are a reasonable starting point for classification or extraction; higher values can increase diversity for ideation. Exact behavior differs across model APIs and interacts with top_p, top_k, and provider defaults, so sweep the supported settings on the task rather than copying one range.

System vs user message separation keeps persistent policy apart from per-request content. Chat templates and instruction tuning give these roles different priority, but they do not make a system message an enforcement boundary. Put stable behavior in the system message, keep untrusted data in user or tool content, and enforce hard constraints such as PII removal outside the model as well.

Context engineering expands prompt work to the assembly of retrieved documents, tool results, conversation history, and examples. Liu et al. found a lost-in-the-middle effect in the long-context models they tested, so position should be part of the evaluation rather than assumed irrelevant. I covered the broader workflow in Context Engineering for AI Agents.

Part VIII — Production operations

This part covers rate limiting, failure modes, monitoring, cost optimization, and capacity planning under real traffic.

41. Rate limiting for variable-cost requests

Traditional requests-per-second rate limiting assumes roughly equal cost per request. LLMs break that assumption. A 10-token classification prompt and a 100K-token document analysis hit the same API endpoint but differ in cost by four orders of magnitude. Rate-limiting by RPS either lets expensive requests through unchecked or starves cheap requests unnecessarily.

Production systems need token-based rate limiting across multiple dimensions. OpenAI documents request and token limits by usage tier. Anthropic separates input-token and output-token limits. Exact quotas and algorithms can change, so treat the provider documentation as the source of truth; the architectural point is to budget requests and tokens independently.

The practical implementation pattern is a multi-dimensional limit hierarchy (user → application → organization → global), with priority tiers for premium access. At the request level, the technique that matters is token budget reservation: estimate total tokens (input + max_tokens) at admission time, deduct from the bucket, then adjust when the request completes with actual usage. That prevents a burst of long-generation requests from exhausting capacity before they even start producing output.

For self-hosted deployments, the equivalent is provisioned throughput: reserving dedicated GPU capacity for target token rates. For vLLM deployments, that means configuring admission control around active decode slots and KV cache pressure rather than request count alone. As Section 5 explains, throughput and admission both need token-aware limits.

42. Failure modes to design against

LLM serving adds failure modes tied to variable sequence length, KV memory, and long-running decode work. Design and load-test the defenses before production traffic depends on them.

Out-of-Memory (OOM) is a common failure. A 70B FP16 model needs about 140 GB for weights alone, and KV cache for a single 128K-context sequence can add about 40 GB under the assumptions in Section 6. The gap between “fits in memory” and “OOM under load” is smaller than it looks because a batch of long-context requests can consume more KV memory than expected. Prevention combines a measured memory reserve with quantization and paged KV allocation. For workloads with high KV pressure, LMCache can offload KV data to CPU memory or disk; use its published results as a starting point and benchmark the local memory hierarchy.

Preemption happens when KV cache pressure forces the scheduler to evict or recompute work. The exact strategy depends on the serving version and configuration. From the user’s side, the symptom is higher end-to-end latency without an obvious application error. Watch preemption counts and correlate them with KV use, queue depth, and request lengths.

Tail latency can spike when large prefills delay decode work. Chunked prefill (Section 7) and length-aware scheduling target this interference. The Learning-to-Rank scheduler and CascadeInfer papers report substantial gains over their baselines, but the exact result depends on the request-length distribution and scheduler configuration.

Cascading failures can start when slow requests grow the queue, upstream clients time out, and retries add more load. Defenses include admission control, per-tenant concurrency limits, output caps, retry budgets, and gateway circuit breakers. Disaggregated prefill and decode pools may help when load tests show persistent phase interference.

43. Monitoring LLM systems

LLM monitoring is different from traditional API monitoring in some basic ways. Every request has variable cost, two distinct phases with different bottlenecks, and a memory footprint that depends on both input length and generation length. Standard metrics like request latency and error rate miss most of what matters.

Goodput is the number of requests per second that meet all defined SLO thresholds, such as TTFT, TPOT, and total latency. It is a useful combined measure because raw throughput can look healthy while latency SLOs fail: a system processing 100 requests per second but missing its thresholds on 40% of them has a goodput of 60. Optimizing for goodput keeps the performance distribution visible instead of reporting only the mean.

vLLM exposes a Prometheus endpoint at /metrics with running and waiting requests, KV cache use, generation-length distributions, and prefix-cache statistics. Metric names can change across releases, so bind dashboards to the deployed version. A typical stack uses Prometheus for metrics, Grafana for visualization, and OpenTelemetry-compatible traces across application and serving components.

Useful alert patterns include the following. Derive their thresholds from load tests rather than copying these examples unchanged:

Preemption count spikes — requests are being evicted and restarted; users experience silent latency doubling.
KV cache utilization approaching the tested preemption region — add capacity or shed load before evictions cascade.
Queue depth persistently above the tested batch envelope — admission control should start rejecting or deprioritizing.
TTFT rising while TPOT stays flat — this divergence points first to queuing, admission, network, or prefill pressure rather than decode throughput. Use traces and queue metrics to distinguish among them.

44. Cost optimization: a compounding strategy

As a dated snapshot, March 2026 API prices spanned orders of magnitude across model tiers. Exact prices change quickly, so use the provider’s current calculator for a purchasing decision. The durable point is that model choice and output length can dominate the bill even before infrastructure optimizations.

In the March 2026 snapshot used for this section, output tokens cost several times more than input tokens for the cited model tiers. That asymmetry reflects the sequential decode work described in Section 7. For those pricing structures, reducing unnecessary output can have more leverage than shaving the same number of input tokens.

Several approaches can be stacked, but only after measuring which ones apply to the workload:

Quantization from FP16 to INT4 cuts weight memory by 75%. Whether that reduces the bill depends on kernel speed, batch size, and hardware utilization (Section 9).
Model routing sends eligible traffic to cheaper models. A Maxim AI vendor case study reported a monthly bill reduction from $42,000 to$ 29,000; reproduce the quality gate before borrowing its routing share (Section 36).
Prompt caching reduces repeated-prefix work. Provider discounts and rate-limit treatment change over time, so combine measured hit rate with current terms (Section 16).
Batch APIs can discount non-real-time work such as evaluations, synthetic-data generation, and bulk classification. Check current prices and completion windows.
Self-hosting can win at sustained utilization, but there is no universal token-volume break-even. Include engineering, orchestration, observability, capacity slack, and on-call costs in addition to GPU rental.

Multiplying the illustrative factors gives a large theoretical reduction, but the inputs are not independent: quantization changes throughput, routing changes quality mix, and caching and batching apply only to eligible traffic. Build the estimate from measured traffic shares and validate it against the invoice.

A 2024 vendor report from TrueFoundry attributes most deployment cost in its sample to ML engineering rather than compute. Treat that figure as a prompt to include labor and operations in the model, not as a universal ratio.

45. Capacity planning and autoscaling

Capacity planning for LLM serving must account for variable request cost, long-running decode work, and sequence-dependent memory. Depending on the workload, the limiting resource may be KV memory, memory bandwidth, compute, or the interconnect.

The hard ceiling on concurrent requests is the KV cache memory budget, not FLOPS:

$\text{max concurrent sequences} = \frac{\text{GPU memory} - \text{model weights} - \text{overhead}}{\text{per-sequence KV cache size}}$

In a simplified Llama 3 70B INT4 example, about 35 GB of weights on an 80 GB H100 leaves roughly 40 GB after reserving additional overhead. At 4K context, an assumed 160 MB per sequence gives a theoretical ceiling near 250 sequences; at 128K context, the same arithmetic falls to about five. Real capacity is lower once allocator behavior, runtime buffers, request-length variance, and the latency SLO are included. This is why GPU selection and KV cache optimization directly drive the capacity plan.

The capacity formula for fleet sizing:

$\text{required GPUs} = \frac{\text{peak tokens/min} \times \text{measured safety margin}}{\text{per-GPU throughput at target SLO}}$

The detail that matters is “at target SLO.” Peak token throughput and SLO-compliant throughput can differ sharply as concurrency rises. Benchmark with the real prompt and output-length distribution at the required TTFT and TPOT thresholds rather than using a theoretical maximum.

GPU utilization is insufficient as the only autoscaling signal because it can stay high during both healthy processing and overload. Combine it with queue depth, KV cache utilization, and goodput degradation. Tune thresholds from load tests; values such as 80% KV utilization are starting points, not universal limits. These metrics are introduced in Section 43.

Scale-to-zero can fit development and staging environments with long idle periods. Serverless inference platforms and Kubernetes-based autoscalers such as KEDA can remove idle capacity, but the savings and cold-start time depend on model size, image and weight caching, and infrastructure. Measure startup time before using the same policy for latency-sensitive production traffic.

The interconnected system

These 45 concepts are not a loose collection. They form a connected system. KV cache size determines batch size, batch size determines arithmetic intensity, arithmetic intensity controls whether decode is memory-bound, that drives TPOT, and TPOT defines throughput. GQA shrinks KV cache, which enables larger batches, which raises arithmetic intensity, which improves GPU utilization. FlashAttention exploits the SRAM-HBM bandwidth gap. Continuous batching solves compute utilization but creates memory fragmentation, which is what PagedAttention resolves. Chunked prefill co-schedules compute-bound and memory-bound work; the roofline model explains why this complementarity works.

Inference Optimization Stack

On the training side, lifecycle cost can favor training a smaller model on more tokens, as Llama 3 8B illustrates. GRPO reduces the critic-state burden of PPO. In DeepSeek-R1’s tested small-model setup, distillation outperformed direct RL. These are design options to evaluate, not a single training recipe.

The operational pattern is more stable than any price curve: routing, caching, quantization, and right-sized hardware compound only when each is measured against the same quality and latency target.

Key principles

LLM inference has two distinct phases. In common serving regimes, prefill leans toward compute and decode toward memory bandwidth; workload shape can move the boundary.
The KV cache is often a central constraint. Its size affects batch capacity and memory pressure. GQA, paged allocation, and KV quantization target different parts of that constraint.
The GPU memory hierarchy explains many optimizations. FlashAttention and kernel fusion reduce expensive data movement rather than changing the model’s function.
Continuous batching and PagedAttention work together. One published comparison reports 23x throughput over its naive baseline; measure the gain in your serving stack.
Chinchilla-optimal is not inference-optimal. Llama 3 8B, for example, was trained at about 1,875 tokens per parameter to trade more training compute for cheaper inference.
GRPO and distillation changed the alignment toolkit. DeepSeek-R1 reported stronger small-model results from distillation than from its direct RL experiments.
Cost optimization compounds only across eligible traffic. Model quantization, routing, caching, and batch APIs need separate coverage and quality measurements before their savings can be multiplied.
Serving hardware must match the bottleneck. Decode-heavy workloads often benefit from HBM capacity and bandwidth; prefill-heavy workloads place more weight on compute.

References

Organized by topic area; section numbers in brackets where applicable.

Serving frameworks

vLLM - PagedAttention-based serving engine
SGLang - RadixAttention and structured generation
TensorRT-LLM - NVIDIA optimized inference
llama.cpp - Portable C/C++ inference
DeepSpeed - Microsoft distributed training library
Ollama - Local LLM runner

Operations

Efficient LLM Scheduling by Learning to Rank - Fu et al., NeurIPS 2024 (vLLM-LTR)
CascadeInfer: Low-Latency and Load-Balanced LLM Serving via Length-Aware Scheduling - 2024
Goodput metric as measure of ML productivity - Google Cloud, 2024
vLLM Optimization and Tuning - vLLM Documentation
vLLM Metrics - vLLM Documentation
LMCache: KV Cache Management for LLM Serving - KV cache offloading
OpenAI Rate Limits - OpenAI API Documentation
Anthropic Rate Limits - Anthropic API Documentation