Phase 2 — Building the Prompt
The request is in the scheduler’s queue. Nothing has touched the GPU yet. Between admission and the first compute operation lives a stretch of mostly-clerical work that converts a structured message array into the exact integer sequence the model’s first layer will consume. It is templating. It is tokenization. It is position-ID assignment. It is the most boring part of the pipeline, and the part that most often produces wrong-but-confident outputs without anyone noticing.
This is also the only stage where you can corrupt the model’s input in ways the model has no protection against. A wrong chat template doesn’t error. The model just produces lower-quality text. A blown context window may or may not error, depending on the gateway’s preferences. An inconsistent tokenizer doesn’t error either. You get nonsense.
The chat template is load-bearing
A modern chat model wasn’t trained on a JSON array of {role, content} objects. It was trained on a flat string with role markers, turn delimiters, and a handful of special tokens that carry semantic weight. Llama-class models use a format with <|begin_of_text|>, <|start_header_id|>, and <|end_header_id|> tokens. Mistral uses [INST]...[/INST] blocks. Qwen has its own convention. Anthropic’s hosted models use a format the public never sees. The exact shape for open-weights models is documented in the model’s chat template configuration on Hugging Face; the closed-weights ones the vendor handles for you.
The format is load-bearing. It is not cosmetic, decorative, or interchangeable. If you serve a Llama-3 fine-tune with the Llama-2 template, you get text that looks superficially fine but performs measurably worse on every benchmark. The output is grammatical, on-topic, and confident. It is also subtly degraded in ways your manual eyeball test will miss and your eval suite will catch.
This is one of those things the hosted API providers handle for you and the developer running open-weights on their own infrastructure has to get right. vLLM and SGLang both ship default templates and let you override per model. Most production incidents on self-hosted LLMs in the year after a model’s release trace back to a template mismatch somewhere in the stack: wrong tokenizer special tokens, wrong system-prompt placement, wrong newline conventions inside an assistant turn.

System prompts get injected at this stage. The vendor’s hosted system prompt (the one that defines safety constraints and assistant persona) is prepended to the developer-supplied system prompt, which is prepended to the conversation. There are at least three system prompts active in a typical production request: the vendor’s, the developer’s, and any application-specific prompt that the developer’s framework adds on top. The model sees them concatenated in the templated string, with whichever role markers the model expects.
Tool schemas get serialized. If the request includes function or tool definitions, those get inlined into the prompt. The format varies. Anthropic uses XML tags. OpenAI uses pseudo-code. Some open-weights models use JSON. Some use special tokens that were introduced during fine-tuning. The model has to recognize the format from training, which means tool calls only work well when the schema-injection convention matches what the model was trained on. Cross-vendor tool-calling abstractions exist (LangChain, the OpenAI-compatible APIs that other vendors emulate) and they work by translating the developer’s tool definitions into whatever string convention the target model was actually trained on.
Deeper: what a real chat template actually looks like.
The Llama 3 instruction-tuned chat template, in its Jinja form, looks roughly like:<|begin_of_text|>followed by, for each message,<|start_header_id|>{role}<|end_header_id|>\n\n{content}<|eot_id|>, followed by (if generating) a trailing<|start_header_id|>assistant<|end_header_id|>\n\n. The double newlines after the header are part of the format. Missing them is a real bug class. The<|eot_id|>token is the end-of-turn marker and the model’s primary stop token: if the tokenizer and the runtime don’t agree on its integer ID, generation never terminates. Templates for closed-weights models follow similar shapes with different special-token names.
Multimodal inputs go through their own encoder first
Text-only models skip this section. Vision-language models, audio-language models, and the any-to-any models that have become common in 2026 all need a way to map non-text inputs into the same embedding space the language model operates in.
The standard pattern: a separate encoder (a Vision Transformer for images, a similar architecture for audio) processes the input independently of the language model. The encoder outputs a sequence of embedding vectors. A small projector (usually a couple of linear layers, sometimes a more elaborate cross-attention adapter) maps those vectors into the language model’s embedding dimension. The projected vectors are inserted into the token stream at the position where the image or audio “lived” in the original message. The language model then treats them as tokens, except they came from the encoder’s output rather than from the token embedding table.

The position where this happens matters. If the developer’s request had an image attached to the second user turn, the templated prompt has a placeholder there (often a special token like <|image|> or a region of <|reserved_special_token_N|> slots), and the projector’s output replaces or fills those slots before the input goes to the model’s first layer.
Two practical consequences. Images and audio occupy real tokens in the context window: a high-resolution image at the default encoder configuration can consume hundreds of token slots, sometimes more than the surrounding text in the conversation. And the encoder’s quality determines the model’s perception. A vision model with a weak encoder will misread small text in a screenshot regardless of how strong the language head is, which is why “the model is bad at OCR” is almost always a vision-encoder problem rather than a language-model problem.
Tokenization
Now the templated string, with any multimodal placeholders resolved, has to become a sequence of integers. The model’s first operation is an embedding lookup, and the embedding lookup is indexed by integer token IDs. The tokenizer is the function that converts text to those integers.
Three families of tokenizer are in widespread production use in 2026.
Byte-Pair Encoding (BPE) and byte-level BPE. The GPT family, Llama, and most OpenAI / Meta lineage models use byte-level BPE. The tokenizer starts from raw bytes (not characters) and was trained by repeatedly merging the most frequent adjacent byte pairs into a fixed vocabulary, typically 32,000 to 200,000+ entries. Byte-level matters: it means the tokenizer can represent any input, including arbitrary Unicode, emoji, and binary noise, without ever producing an “unknown token” symbol. The cost is that some sequences (a single emoji, certain ideographic characters) get split across multiple tokens.
SentencePiece unigram. Google models, including Gemini and the T5 lineage, generally use SentencePiece in unigram mode. The vocabulary is learned with a probabilistic model rather than greedy merges, which produces somewhat different splits and, in practice, slightly better behavior on multilingual corpora. The vocabulary sizes are comparable to BPE.
WordPiece. The BERT family used WordPiece. Modern decoder-only chat models do not.
The differences across the three families matter less than the broad behaviors all sub-word tokenizers share. Three of those behaviors create most of the production failures.
Token counts are not word counts. A common rule of thumb for English is about 0.75 words per token, which means a 1,000-word document tokenizes to roughly 1,300 tokens. The ratio is worse for code (more punctuation, more rare identifiers), worse for non-English Latin-script languages (some words tokenize as four or five pieces), and dramatically worse for non-Latin scripts. A paragraph of Chinese, Japanese, or Korean text in a byte-level BPE tokenizer often takes three to five times the tokens of equivalent English text. This is why per-token billing is implicitly more expensive for non-English use cases and why context windows feel smaller in those languages.
Digit splitting wrecks arithmetic. Most tokenizers split multi-digit numbers into sub-word pieces in ways that are not aligned with place value. The number 1,234,567 might tokenize as 1, 234, 567 or as 12, 345, 67 or as some other equally arbitrary split. The model is forced to learn to recognize and operate on the digit groupings rather than reading numbers as integers. This is why most LLMs are reliably bad at multi-digit arithmetic compared to their performance on every other reasoning task at the same parameter scale. A model that can write a publishable analysis of macroeconomic policy can’t reliably multiply two five-digit numbers, and the reason is the tokenizer, not the architecture.
Tokenizers are model-specific and not interchangeable. Each model was trained against exactly one tokenizer’s vocabulary and merge rules. Using the wrong tokenizer to encode the input produces a sequence of integers the model has never seen aligned with its training. The model still generates outputs, because the embedding table has rows for every integer in the vocabulary, but the outputs are random in their relationship to the input. This is a less common bug than chat-template mismatch, because most serving stacks load the tokenizer with the model from the same artifact. It still happens when developers manually wire up an inference pipeline against a model they downloaded from one source and a tokenizer they downloaded from another.

Deeper: why “fix the tokenizer” is harder than it sounds.
Several papers have shown that tokenization choices materially affect model capability, particularly on multilingual and numerical tasks. The obvious fix, training models with better tokenizers, runs into a wall. The tokenizer is decided before training begins, the entire training run is conditioned on its choices, and the embedding table is sized to its vocabulary. Changing the tokenizer post-training requires re-training or extensive adaptation. Several research efforts (byte-level models that skip tokenization entirely, like Meta’s MEGABYTE; character-level models; tokenizer-distillation approaches) have tried to address this. None has displaced sub-word tokenization in production frontier models as of mid-2026. The economics of training a frontier model from scratch with a novel tokenizer are not yet attractive enough.
A contestable claim worth sitting with. Tokenization is the most underrated source of silent model failure in production. Most “prompt engineering” practice (the spacing tricks, the digit-grouping conventions, the strange capitalization patterns that some teams rely on for better outputs) is in significant part tokenizer adaptation that the practitioner does not recognize as such. The trick works because it shifts the input across a tokenizer boundary that the model handles better. Teams that understand this can replace fragile prompt-engineering folklore with predictable, tokenizer-aware formatting. Teams that don’t ship superstition disguised as best practice.
Special tokens and the boundaries
Once the templated string is tokenized, a small number of additional special tokens get inserted at the boundaries. The beginning-of-sequence (BOS) token. The end-of-sequence (EOS) token. Turn-boundary markers that separate user turns from assistant turns. Control tokens that the model was trained to interpret as instruction markers. The exact set varies per model.
These tokens are integers like any other, but the model was trained to treat them differently. Misplacement of an EOS token can cause generation to stop early. A missing BOS can degrade output in ways without an obvious signature. The chat template usually handles all of this, which is one more reason the template is load-bearing.
Context-window enforcement
The model has a maximum context length: typically 128k, 200k, 1M, or for some 2026-era models, 2M tokens or more. After tokenization, the runtime knows how long the prompt is and how many output tokens the request has asked for. If prompt_tokens + max_tokens exceeds the model’s context window, the request gets rejected at this stage, or the prompt gets silently truncated. The silent truncation is the dangerous one, because some serving stacks do it by default and the developer never sees that their multi-turn conversation has lost its earliest turns.
The error message at this stage is usually friendly. The silent truncation is usually catastrophic. Audit your runtime’s truncation behavior before you ship.
Position IDs and the causal mask
The last two pieces of preprocessing are integer arrays the model’s attention layer needs to operate.
Position IDs are sequential integers (0, 1, 2, …) that label each token’s position in the sequence. For RoPE-based models, position IDs are cheap to assign here and the actual position-aware rotation happens inside attention per layer (Phase 3 covers this). For models with explicit positional embeddings, the position ID indexes into a learned embedding table at the input stage. Either way, the runtime builds this array deterministically.
The causal attention mask is the matrix that says token at position t can attend to positions 0 through t but not to positions t+1 onward. The mask is what makes the model autoregressive: each token’s prediction can only depend on tokens that came before it. The mask is conceptually a triangular matrix of zeros and -inf values applied to the attention scores just before softmax. In practice, the modern attention kernels (FlashAttention and friends) bake the mask into their tiling rather than materializing it explicitly, but the idea is the same.
For sequences where prefix caching is in play, the runtime also marks the boundary at which the cache hit ended and new computation begins. The K and V vectors for the cached portion get read from the prompt cache; only the tokens after the cache boundary get new K/V computation. Phase 3 covers the mechanics.
Then: into the scheduler’s batch
The request has now been converted from a JSON message array to:
- A string templated with role markers and special tokens.
- A sequence of integer token IDs.
- An array of position IDs.
- An attention mask.
- Optional projected vectors from a vision or audio encoder.
- Optional cache-boundary metadata for prefix caching.
The model has still not been touched. The scheduler now packs this prepared request into a batch with other requests in its queue. Phase 3 — Scheduling and Prefill covers how that batch gets assembled and what happens when it actually hits the GPU.
The wrapper is universal. So is the integer assignment that precedes it.