Inference
What we call “an LLM thinking” is, at the serving level, almost the same machine every time. Template the prompt. Tokenize. Schedule into a batch. Prefill the model. Run a tight per-token loop wrapped in continuous batching and streaming. Send the bytes back. The wrapper is universal.
The mind is not. Peel open the model and four genuinely different cognitive interiors are now in production under the same chat-completion endpoint. Dense decoder-only transformers. Sparse mixture-of-experts. Recurrent state-space hybrids. Parallel diffusion denoisers. The shape of “thinking” inside each one is different enough that the same prompt can take a structurally different route through the math depending on which model you happened to call.
This hub is the front door to a seven-part walk through what happens between the moment a user presses Enter on a chat client and the moment the answer finishes typing itself out. Each phase page goes deep on one stretch of the pipeline with the diagrams the mechanics deserve. The walkthrough assumes comfort with REST APIs and the idea of a transformer, and assumes you are tired of treating the inside of the model as a black box. It does not assume you know what GQA is, what RoPE rotates, or why prefill is compute-bound while decode is memory-bound. Those are what the phase pages are for.

The two loops
Almost all of LLM inference reduces to two nested loops.
Once per request: prefill. The whole prompt gets templated, tokenized, scheduled into a batch, and pushed through every transformer layer in one big parallel pass. The point of prefill is to compute the K and V vectors for every token in the prompt and store them in a per-layer cache. Prefill is compute-bound. The GPU’s math units are the bottleneck, and you keep them busy by packing many tokens into the batch at once.
Once per output token: decode. With the prompt’s K/V cached, generating each new token is a much smaller forward pass. Attend over the cache, predict the next token’s distribution, sample, append, write one new K and V column to the cache, repeat. Decode is memory-bandwidth-bound. The GPU spends its time reading the growing cache out of HBM, not doing arithmetic. The KV cache, not the model weights, is what actually constrains LLM serving at scale.

This split shapes nearly every serving optimization you’ve seen in passing. Continuous batching exists because prefill and decode have different bottlenecks and can be packed together for higher GPU utilization. Speculative decoding exists because decode is memory-bound and you can verify several speculated tokens for nearly the cost of one. Paged attention exists because the KV cache fragments physical memory if you allocate it naively. Prefix caching exists because long shared prompt prefixes mean re-computing the same K/V over and over for no reason.
Deeper: why no KV cache means O(n²) per step.
A decoder-only transformer without a cache has to recompute K and V for every prior token at every step, since attention multiplies the current Q against all prior K’s. After n steps you’ve done O(n²) work just to keep going. The KV cache turns that into O(n) by storing K and V once and reading them back. The cost is the memory the cache eats. A 70B-class model serving 2k-token contexts at batch size 32 can easily spend 80+ GB on KV cache before the model weights are loaded, which is why production serving in 2026 leans on GQA, PagedAttention, and KV-cache quantization just to make the math fit on commodity accelerators.
The wrapper is universal
Read the SDK docs for OpenAI, Anthropic, Google, Mistral, or any of the open-source serving stacks (vLLM, SGLang, TensorRT-LLM) and the surface is suspiciously similar. A chat-completion endpoint. An array of messages with roles. A streamed delta or a final completion. A known set of generation parameters: temperature, top_p, max_tokens, stop, seed. Tool calling. Structured output via a JSON schema. Prompt caching, sometimes explicit, sometimes automatic.
The convergence is not a coincidence. The serving wrapper is the boring part. Auth, rate-limit, route, tokenize, schedule, prefill, decode, stream, bill. Every vendor has a version of it and the versions are nearly interchangeable. Swap providers and you change a few constants. The interior, where the model actually thinks, is a different problem.
This family covers the wrapper in depth anyway, because the wrapper is where most production failures originate. The wrong chat template silently degrades quality. A blown context window throws an error you have to handle. The KV cache eats your GPU memory. Speculative decoding interacts with your prompt cache. Streaming interacts with your Markdown parser. None of this is interesting in isolation. All of it is interesting when you are debugging at 3 a.m. with a customer on the line.
The mind is not
Open the model itself and four shapes are now in production.
Dense decoder-only. Llama, Mistral, GPT-class, Claude-class. Every token sees every layer’s full FFN. Compute per token scales with parameter count, which is why scaling these models up is expensive on inference, not just training.
Mixture-of-experts. Mixtral, DeepSeek-V3, Qwen-MoE, Grok. The FFN at each layer becomes a router and N expert FFNs. Each token activates only a small top-k subset of experts. DeepSeek-V3 has roughly 671B total parameters and activates roughly 37B per token, which is why MoE has eaten so much of the open-weights frontier: more knowledge capacity at the same per-token cost. The cost moves to engineering complexity. Load balancing across experts. Expert capacity per batch. The all-to-all GPU communication that dispatches tokens to the right device and gathers results back.
State-space and hybrid. Mamba, Jamba, Falcon-H1, Zamba. Attention is replaced (sometimes entirely, sometimes in a fraction of layers) by a recurrent scan over a fixed-size state. Linear in sequence length instead of quadratic. No growing KV cache on the SSM layers, because there is no cache to grow, just a state that gets updated at every position. This is one of the reasons “every LLM has a KV cache” is false.
Diffusion language models. Mercury, LLaDA, Gemini Diffusion. Generation is not autoregressive at all. The sequence starts fully masked. The model predicts every position in parallel using bidirectional attention. The most confident predictions get committed; the rest get re-masked; the process repeats for some number of denoising steps. Number of steps is decoupled from sequence length, which means the speed/quality tradeoff is a different shape than anything autoregressive serving optimizes for.

Below is the comparison at a glance. The Phase 7 page goes deep on each row.
| Family | Examples | Generation | KV cache | Distinctive step |
|---|---|---|---|---|
| Dense decoder-only | Llama, Mistral, GPT-class, Claude-class | Autoregressive | Yes | Dense FFN per layer |
| Mixture-of-experts | Mixtral, DeepSeek-V3, Qwen-MoE, Grok | Autoregressive | Yes | Router → top-k experts, expert parallelism |
| Reasoning / “thinking” | o-series, DeepSeek-R1, Claude extended thinking, Gemini thinking | Autoregressive, long internal CoT first | Yes | Hidden reasoning tokens under a thinking budget |
| Multimodal / any-to-any | GPT-4o, Gemini, Claude vision, Qwen-VL | Autoregressive (plus encoders) | Yes | Vision/audio encoder and projector before the LLM |
| State-space / hybrid | Mamba, Jamba, Falcon-H1, Zamba | Autoregressive recurrent scan | Partial / none | Fixed-size state update instead of attention |
| Diffusion LM | Mercury, LLaDA, Gemini Diffusion | Iterative parallel denoising | No | Refines the whole sequence over N steps |
| Encoder-decoder | T5 family | Encoder pass + autoregressive decoder | Decoder only | Cross-attention into encoder output |
A contestable claim worth sitting with. Most of the engineering work needed to ship LLM products in 2026 is wrapper work. Most of the interesting work, including the work that decides which models will still be competitive in five years, is interior work. Wrapper engineers and interior researchers are not the same team. They are not the same skill set. The gap between what each of them cares about has gotten wider, not narrower, since the dense-transformer monoculture cracked open.
The phases in order
Each phase page covers one stretch of the journey with the diagrams the topic deserves. They publish in order. If a link below 404s, that page hasn’t shipped yet.
- Phase 1 — From Enter Key to Datacenter — client validation, edge and WAF, API gateway, auth, rate limits, billing pre-check, prompt-cache lookup, model routing, queue admission.
- Phase 2 — Building the Prompt — chat templating, system-prompt injection, tool-schema serialization, multimodal encoding, tokenization, special tokens, context-window enforcement, position IDs and the causal attention mask.
- Phase 3 — Scheduling and Prefill — continuous batching, prefix caching, KV-cache allocation and PagedAttention, the transformer layer stack, RoPE, attention, FlashAttention. The compute-bound half of a request.
- Phase 4 — The Per-Token Loop — single-token forward passes, the logit processing pipeline (penalties, temperature, top-k/top-p/min-p, constrained decoding, tool forcing), sampling, speculative decoding. The memory-bound half.
- Phase 5 — Streaming Back and Rendering — incremental detokenization with the UTF-8 buffering gotcha, stop criteria, reasoning-channel routing, SSE and gRPC transport, output-side moderation, tool-call interception, client-side Markdown / code / table / LaTeX rendering.
- Phase 6 — Tool Loops, Termination, and What the User Actually Feels — the agentic loop, billing commit, prompt-cache write, observability. Then the latency model: TTFT versus TPOT, throughput-versus-latency tradeoffs, what the user does and does not perceive.
- Phase 7 — The Inside Is Not Universal — dense decoder-only, mixture-of-experts (DeepSeek-V3 as the worked example), state-space and hybrids, diffusion language models. The architectural fork in depth.
Read them in order. Earlier phases set up the vocabulary the later phases use without explanation.
Adjacent material on this site
- Artificial Intelligence — the broader category these pages live under.
- Cryptography — relevant for the TLS / transport portion of Phase 1, and increasingly relevant to model-card signing for supply chain trust on hosted weights.