§ Trackr.Live

Phase 5 — Streaming Back and Rendering

The decode loop is running. New tokens are coming out one at a time. The user is staring at their browser, waiting for the response to start appearing. The bytes need to travel from the GPU through the gateway, across the wire, and into a renderer that turns them into formatted text as they arrive. This phase covers that whole back-half: detokenization, streaming transport, output-side moderation, tool-call interception, and the client-side Markdown rendering that most production chat UIs still get subtly wrong.

The model has done its work. From here it is plumbing and parsing.

Detokenization, with a UTF-8 problem

Each token coming out of the decode loop is an integer ID. Before any of those IDs leave the inference server, they need to be converted back to text. The conversion is the tokenizer’s decode() function, and on byte-level BPE tokenizers (which is most of them, including the GPT, Llama, and Mistral lineages) the conversion has an annoying property: a single character can span multiple tokens, and a single token can hold a fraction of a multi-byte character.

For most ASCII text the issue is invisible. The token for “the” emits “the” and you’re done. The issue shows up with anything outside the Latin alphabet. A common emoji like 🤔 is three bytes in UTF-8 (F0 9F A4 94), and a byte-level BPE tokenizer may have learned to split that emoji across two or three tokens depending on what was in the training corpus. Same for ideographic characters in Chinese, Japanese, and Korean text, where every character occupies three bytes in UTF-8 and the tokenizer’s merge rules don’t align with character boundaries.

If the server decodes each token independently and emits the result the moment the token is generated, it will sometimes emit half of a multi-byte character. The client receives a sequence of bytes that does not form valid UTF-8 mid-stream. Some renderers display a replacement glyph. Some try to re-decode on the next chunk and get the boundary wrong. Some crash silently. None of these are user-acceptable.

The fix is incremental detokenization with byte-level buffering. The server holds back any bytes that don’t form a complete UTF-8 sequence at the end of the current decode output and prepends them to the next token’s decoded bytes. Modern tokenizer libraries expose this directly. Hugging Face’s tokenizers library has a decode_stream mode that handles it. TikToken does it through a buffer-aware decode pattern. vLLM and SGLang both implement it in their default streaming paths.

Deeper: what byte-level BPE actually emits and why it bites.
A byte-level BPE tokenizer’s vocabulary is over byte values, not character values. During training, the tokenizer merges frequent byte pairs into single tokens. A multi-byte UTF-8 character is just a sequence of bytes from the tokenizer’s perspective, and there’s no guarantee that the bytes for one character end up in one token. After enough training, the tokenizer usually does learn to keep frequent characters intact, but rare characters (ones the corpus has few examples of) can be split arbitrarily. The implication for streaming: the server cannot just call decode([new_token_id]) and emit the result. It has to maintain a small byte buffer across tokens, decode opportunistically when the buffer ends on a UTF-8 character boundary, and hold the rest for the next step.

The same buffering mechanic handles a second case: tokens that decode to nothing (some special tokens, padding tokens, the BOS in some configurations) and tokens that decode to a substring of a larger sequence that hasn’t completed yet. In all of these cases the decoded output for a step might be empty, partial, or longer than the new token “should” contribute. The buffering layer absorbs all of it.

Channels, stops, and reasoning tokens

Once the token is detokenized, the server checks for stop conditions. There are four common ones:

The model emitted an end-of-sequence token (<|eot_id|> on Llama, the equivalent on every other model family).
The output contains a developer-supplied stop string, matched against detokenized text rather than token IDs.
The max_tokens limit has been reached.
A structured-decoding grammar engine has reached an accepting terminal state.

If any condition matches, the loop ends after the current token and the server moves to closing the stream. Otherwise the token is queued for emission to the client.

For models with reasoning modes (the o-series, DeepSeek-R1, Claude extended thinking, Gemini thinking, and the smaller open-weights models that have copied the pattern), there is a second step before emission: channel routing. These models emit two kinds of tokens. There are visible-answer tokens, which the client renders. There are hidden reasoning tokens, which the model uses internally and which the client either shows separately (collapsed by default) or does not show at all, depending on the vendor’s policy.

The split is delimited by special tokens the model was trained to emit. Anthropic’s extended thinking uses an explicit XML-like wrapper around the reasoning block. The o-series and Gemini thinking use opaque sentinel tokens that the API strips out and exposes only as a token-count for billing. DeepSeek-R1 emits the reasoning inline with a delimiter visible in the response. Each scheme has different implications for client UIs that want to show the reasoning trace.

The channel router inside the server reads the token stream and assigns each token to either the “thinking” channel or the “answer” channel based on which delimiters have been seen so far. Only answer-channel tokens get streamed to the client. Reasoning-channel tokens get accumulated for the final billing report and, depending on the API, returned in a separate field of the final response object.

Once classification is done, the answer-channel token is emitted as a stream chunk. The chunk size is typically one token, but some implementations batch a few tokens together for efficiency when the network is the bottleneck.

The streaming transport

The chunk leaves the inference server and starts its trip back to the client. The transport in 2026 is overwhelmingly Server-Sent Events over HTTP. SSE is a one-way streaming protocol: the server sends a sequence of small text-encoded events with a known framing (data: prefix, blank-line separator) and the client reads them as they arrive. The choice is not exotic. SSE works through every reasonable load balancer and CDN, requires no protocol upgrade, debugs easily with curl, and is implemented in every modern HTTP client library. The vendor-side simplicity is part of why it won out over WebSockets for this use case.

A subset of vendors offer gRPC streaming as an alternative, typically for enterprise customers running their own observability and connection management. gRPC streams have lower overhead per chunk and stronger ordering guarantees but require gRPC-aware infrastructure all the way to the client. They are rare on the public chat APIs and common on the internal RPC paths that vendor SDKs use under the hood.

A horizontal pipe carrying small labeled token packets from a datacenter on the left to a laptop on the right, with a translucent moderation valve partway along the pipe that can redirect or halt the flow.

The stream is not a dumb pipe. Several things happen to each chunk between the inference server and the client.

Streaming output moderation, optionally. A safety classifier watches the stream as it flows. If the model is producing content that violates the vendor’s policy, the moderator can redact specific tokens (replacing them with a placeholder), halt generation mid-stream (closing the stream with a policy error), or pass the chunks through unchanged. The moderation runs asynchronously with generation, which is the trick that makes it viable. The streaming throughput is not gated on the moderation latency, but a policy violation can still terminate the stream within a few hundred milliseconds.

Incremental usage accounting. The gateway counts tokens as they emit, updating the request’s token usage in something close to real time. This is what makes mid-stream cancellation accurate for billing. If a user closes the tab halfway through a 4,000-token response, the bill reflects the tokens that actually streamed, not the full intended output.

Tool-call interception. When the model is emitting a tool call rather than a normal response, the gateway parses the call as the tokens arrive. Different vendors expose this differently. OpenAI returns tool calls as structured fields in the streaming delta, with the function name and the argument JSON building up across chunks. Anthropic returns them as content blocks of a specific type. Either way, the streaming logic on the client has to assemble the tool call from chunks before it can be executed. Most client SDKs hide this complexity behind an event handler that fires when a complete tool call has been seen. Phase 6 covers what happens when the tool call exits the model side and the developer’s code takes over.

Buffering and back-pressure. Some clients are slower than the inference server. If the server emits tokens faster than the client can consume them, the bytes pile up somewhere in between. Most stacks rely on TCP back-pressure to throttle the server when the client’s buffer fills, which works but can stall the inference loop if it goes on too long. Production-quality vendor SDKs implement a small client-side buffer and a flow-control mechanism that keeps decode running smoothly even when the client is rendering slowly.

The chunks flow through the gateway, out to the edge, across the user’s connection, and into the client. The trip is typically under 200 ms once the first chunk is in flight, and subsequent chunks arrive at near-line speed once the pipe is warm.

Deeper: SSE versus gRPC versus WebSocket for LLM streaming.
SSE is the default because it requires nothing exotic: it is plain HTTP with a specific content type and a streaming response. Every CDN, every proxy, every browser, every HTTP library handles it. It is one-way (server to client), which is exactly the shape an LLM stream needs. WebSockets offer bidirectional communication, which LLM streams don’t need, and the upgrade negotiation costs extra round trips. gRPC offers better framing and ordering but requires HTTP/2 end-to-end and is not friendly to most consumer-facing infrastructure. The vendor that picks SSE pays no infrastructure penalty for it. The vendor that picks WebSockets or gRPC inherits compatibility headaches without a corresponding upside, which is why every major chat API converged on SSE somewhere between 2023 and 2024.

The client renders incrementally

The chunks arrive at the client one at a time. Each chunk contains a small string of text that needs to be appended to the current message and, if the message includes Markdown, rendered into formatted output as it grows.

The naive approach is to concatenate the chunks into a string and re-render the whole string from Markdown every time a chunk arrives. This works at small scale and breaks at every scale beyond that. Each Markdown re-parse is a non-trivial chunk of work. Re-rendering on every chunk produces flicker as the parser’s interpretation of partial input changes. A code block that has not seen its closing fence yet renders as inline code until the closing fence arrives, at which point it suddenly converts to a code block, flickering the whole layout in the process.

A chat bubble progressively filling in token by token, with a fenced code block and a Markdown table materializing as content arrives, while later text is still rendering.

The right approach is an incremental Markdown parser that maintains its own state across chunks. Tokens of Markdown structure (a fence open, a list item, an emphasis marker) are tracked as state transitions, and the parser knows what to do when the next chunk extends or contradicts the partial state. Fenced code blocks are rendered as code from the moment the opening fence arrives, with the language hint applied immediately. Tables are rendered as the rows arrive rather than waiting for the closing pipe of the last row. LaTeX math expressions are detected by their delimiters and rendered with KaTeX or MathJax as they complete.

Several specific cases trip up production renderers in 2026.

Nested code fences. A fenced code block whose content includes a Markdown code fence (which happens often when an LLM is explaining Markdown itself) needs to track the outer fence’s delimiter to know when it has actually closed. Most renderers handle the common case of triple-backtick outer with quadruple-backtick inner; many of them break on more exotic combinations.

Tables under streaming. A Markdown table renders correctly only after enough rows have arrived to know the column structure. Rendering each row as it arrives produces a layout that re-flows when the second row’s pipe-count overrides the first row’s. Some renderers cope by hiding the table until the first blank line after it; others render incrementally and flicker. The right answer (render with a placeholder column count, then re-flow once) is rare.

LaTeX inside code, code inside LaTeX. A $ inside a code block should not start a math expression. A backtick inside a math expression should not start a code span. Renderers that treat the input as a flat token stream get this wrong. The correct behavior requires the parser to respect block-level context when interpreting inline markers.

Citations and link hydration. Chat clients that show inline citations (Perplexity-style, the various RAG products, Claude’s web-search responses) need to detect citation markers as they arrive, hold the surrounding text until the citation is complete, and replace the marker with a styled link. Doing this incrementally without flicker is harder than it looks because the marker pattern depends on the model’s specific formatting convention.

Artifacts and canvas rendering. Some clients (Claude artifacts, ChatGPT canvas) detect specific block types and render them in a side pane rather than inline. The detection has to happen during streaming so the side pane opens as soon as the artifact is identified, not after the whole response is done.

When the stream ends, the client receives a terminal signal (an SSE [DONE] event for OpenAI-compatible APIs, a typed event in the various proprietary protocols), the message is committed to the conversation history, and the composer is unlocked for the next user input.

Deeper: how a real incremental Markdown parser handles an unclosed fence.
The parser maintains a stack of open block-level structures. When an opening fence arrives, the parser pushes a code-block frame onto the stack and switches its tokenizer mode to “literal” until a matching close fence arrives. While in literal mode, no further Markdown parsing happens to the content. If the chunk stream ends without a close fence (because the model crashed mid-response or the connection dropped), the parser closes the code block at the end of input rather than treating the whole content as inline text. The fallback behavior is what determines whether a partially-streamed response looks reasonable or looks broken. Most production parsers get this right in the happy path and ship subtle bugs on the edge cases, which is why testing your chat client against deliberate mid-stream failures is more useful than testing it against the common case.

A contestable claim worth sitting with. The model is the same. The token stream is the same. The user-perceived quality difference between chat clients in 2026 comes mostly from the client-side renderer’s behavior during streaming, and most renderers in production are still subtly broken on tables, nested code blocks, and citation hydration. The fix is not exotic. It is careful state-machine work on the client side, and it is invariably owned by a team that thinks of it as low-priority polish work. The renderer is not low-priority polish work. It is the entire UI for the product.

Then: the request closes

When the stream ends, the request is essentially done. The tokens have been generated, transported, and rendered. There are still a few things to do on the server side: final billing accounting, prompt-cache writes (if the request added new cacheable prefixes), trace span closure, and connection cleanup. There is also the special case where the response was a tool call instead of a final answer, which means the request loops back through the model with the tool’s output appended.

Phase 6 — Tool Loops, Termination, and What the User Actually Feels covers all of that, plus the latency model that ties the whole pipeline together: time-to-first-token versus time-per-output-token, the throughput-versus-latency tradeoffs, and what users actually perceive from all of this.

The wrapper is universal, even when it ends.