Phase 1 — From Enter Key to Datacenter
A request goes through three layers of mostly-not-the-model machinery before the GPU sees a token. The client serializes the conversation. The edge terminates the connection and runs the standard CDN gauntlet. The API gateway authenticates the caller, decides whether the caller is allowed to spend money this minute, picks a model version, checks the prompt cache, and either admits the request to the inference scheduler or sends back an error.
None of this is the model. All of it can break the model.
Most of the work in this phase is identical across every vendor that ships a chat-completion endpoint, which is the main reason it’s worth understanding in detail. Almost every production failure with an LLM API originates somewhere in this stretch. Wrong chat template format (that’s Phase 2). Stale auth token (here). Rate limit hit (here). Wrong model alias resolution (here). Prompt cache miss that the developer assumed was a hit (here). A nontrivial fraction of the support tickets a hosted LLM provider sees are debugging events from this part of the pipeline.
On the laptop

The moment a user presses Enter, the chat client takes nine actions in rapid succession before any of those actions involve a network hop.
Capture and lock. The composer fires its submit handler. The UI immediately locks the input field so a fast user can’t double-submit while the request is in flight. This sounds obvious; it is also the most common source of duplicated billing in custom-built clients written by people who didn’t know about it.
Client-side validation. Empty messages get rejected. Character limits get checked. Attachment counts and sizes get checked. The validation is purely a UX shortcut (every check runs again server-side because the client cannot be trusted), but it saves the user a round trip when the input is obviously malformed.
Conversation assembly. The client gathers the prior turns of the conversation, any client-supplied system prompt, and the new user message into an ordered message array of [{role, content}, ...]. This is the canonical shape the major APIs converged on around 2023, and it is now load-bearing across every client library, retrieval-augmented prompting framework, and agent runtime in the ecosystem.
Attachment handling, if any. Images, PDFs, and audio get resized and re-encoded client-side, uploaded to blob storage (S3, GCS, or a vendor-managed bucket), and replaced in the payload with URLs or base64 strings. The reason for the upload-then-reference dance instead of just shipping the bytes inline is that an image inlined as base64 in a chat-completion request adds 30+% to the request size for the encoding overhead alone, which matters at scale and at edge-latency thresholds.
JSON serialization. The message array, the model identifier, and all the generation parameters get packed into one JSON payload. The generation parameters are the developer’s lever on the inside of the model: temperature, top_p, top_k, max_tokens, stop, frequency_penalty, presence_penalty, tools, tool_choice, response_format, seed, stream. Phase 4 covers what each of these does mechanically; here they are just keys in a JSON object.
Auth attached. API key as a bearer token, a session JWT, or an OAuth token, depending on whether this is a B2B API call, a logged-in consumer session, or a developer key. The auth header is checked at every layer below this point (edge, gateway, and sometimes again at the inference scheduler), so the same token traverses the entire path.
DNS. The API hostname resolves through DNS, typically to an Anycast IP that maps to the nearest edge PoP regardless of which datacenter the actual inference will run in. This is a critical detail. A user in Frankfurt resolving api.openai.com does not get a US-East IP. They get the closest edge.
Transport. TLS handshake on a cold connection, or reuse of a warm HTTP/2 or HTTP/3 connection from the keep-alive pool. SDKs from the major vendors maintain warm pools by default; ad-hoc clients written against curl or requests without session reuse pay the handshake cost on every request.
Bytes transmitted. The request leaves the laptop.
Deeper: how much TLS handshake actually costs.
A cold TLS 1.3 handshake adds one round trip on top of the underlying TCP handshake, which itself is one round trip (or zero with TCP Fast Open, which most public APIs do not honor). For a user 50 ms from the nearest edge, that’s roughly 100 ms before the first request byte goes over the encrypted channel. A warm HTTP/2 connection skips both. The difference is the gap between a chat app that feels “instant” and one that doesn’t. Production SDKs default to connection pooling for this reason.
Across the wire

The packet hits the nearest edge PoP within tens of milliseconds. From there, five things happen before it reaches anything that knows about LLMs.
Connection termination. The edge terminates the TLS connection. This is what Anycast routing was for: connecting the user to the nearest network endpoint, not the nearest application server. The application server is somewhere in the middle of a continent. The edge is somewhere in the city.
DDoS protection. Volumetric attacks get absorbed at the edge before they can saturate the path to the gateway. Hosted LLM APIs are attractive DDoS targets because each request is expensive on the receiving side; an attacker that can force a few thousand inference requests per second can run up a victim’s bill or starve legitimate users. Edge-level DDoS mitigation absorbs this.
WAF inspection. The Web Application Firewall scans the request for known attack patterns: SQL injection signatures, common XSS payloads, the usual OWASP top-ten suspects. WAFs were not designed for natural-language LLM payloads, which is why they have a higher false-positive rate on chat-completion traffic than on traditional REST traffic. A user asking “how do I escape single quotes in SQL” can trip a default WAF rule. Vendors selectively tune their WAFs for LLM endpoints to avoid this, but the tuning is recent and uneven in 2026.
Coarse rate limiting. Per-IP and per-key rate limits at the edge catch obvious abuse before it consumes any gateway compute. The gateway will rate-limit again more precisely. The edge layer exists to stop the floods that would overwhelm the gateway itself.
Geographic routing and load balancing. The request gets routed to a specific serving region. Data-residency requirements may pin this. A request from a user in Germany may be required by contract or regulation to route to an EU-only inference cluster. The L7 load balancer in the chosen region then hands the request off to a specific API gateway instance.
Deeper: Anycast versus DNS-based geo-routing.
Anycast advertises the same IP from multiple physical locations and lets the network’s routing protocols pick the closest one based on BGP hop count. The user’s TCP packets just naturally arrive at the nearest PoP. The alternative is DNS-based geo-routing, where the DNS resolver returns different IPs to different geographic resolvers. Anycast is simpler and faster (no DNS-level decision required), but it depends on stable BGP paths and is harder to manually override. Most major LLM APIs use Anycast for the production endpoints and reserve DNS-based geo-routing for specific cases like data-residency-pinned regional endpoints (api.eu.openai.com,api.eu.anthropic.com, and equivalents).Deeper: why edges run rate limits and gateways run rate limits.
The two layers solve different problems. The edge cares about floods: more requests per second than the gateway can ingest. The gateway cares about quotas: more tokens per minute than a specific paying customer is entitled to. Pushing token-aware quota enforcement out to the edge would require synchronizing quota state across every PoP in real time, which is operationally expensive and not worth it for the rare flood-without-quota case. So the layers coexist, with edge throttling at the IP and connection level and gateway throttling at the customer and key level.
At the gateway

By the time the request reaches the API gateway, the network plumbing is done and the LLM-specific work begins. The gateway is the orchestration boundary. Ten things happen here, in roughly this order.
Authentication. The API key or JWT is validated. The org, project, and user identity are resolved. This is the lookup that turns “Bearer sk-…” into “this is acme-corp, project foo, user 7.”
Authorization. Once the identity is known, the gateway checks scope and entitlement. Is this key allowed to call this model? Are reasoning-mode tokens permitted on this account? Is the requested context length within the customer’s tier? Some of these checks are cheap (a key-permission lookup) and some require database round trips.
Rate limiting and quota. This is the precise version of what the edge did coarsely. Per-key requests-per-minute (RPM) and tokens-per-minute (TPM) limits are enforced with token-bucket algorithms. Concurrency limits cap how many in-flight requests a single key may have. Tier-based limits enforce paid-vs-free differentiation. A request can pass auth and authorization and still get rejected here.
Billing pre-check. Before the gateway spends any GPU compute, it verifies that the caller has credit balance or available spend on their tier. For pay-as-you-go customers, this is a balance check. For metered enterprise contracts, this is a contract-limit check. For free tier, this is a “you have N requests left this day” check. The economic model is the same across all of them: never spend GPU minutes on a customer that can’t pay for them.
Schema validation. The request JSON is checked against the API contract. Are all required fields present? Are the generation parameter values in their permitted ranges? Is the requested model an actual model? Is the tool schema well-formed? This catches developer-side bugs before they consume inference cycles.
Input moderation, optionally. A safety classifier screens the prompt. Some vendors run this synchronously and reject obvious policy violations before the request reaches the model. Others run it asynchronously and rely on output-side moderation (covered in Phase 5). Vendors with stricter policy postures (Anthropic, Google) tend to do more here; vendors targeting developer-platform use cases (OpenAI’s API tier, Mistral) tend to do less. The synchronous-versus-asynchronous choice is a latency-versus-policy tradeoff. Synchronous moderation adds 20 to 100 ms to every request’s time-to-first-token but catches violations before the GPU is engaged. Asynchronous lets the request proceed and audits results in parallel, which is faster for the user and trades off some policy strictness.
Model routing. The gateway resolves a model alias to a concrete version and a target GPU/TPU fleet. gpt-4o is not a model; it is an alias that routes to whichever specific version the vendor has currently designated as gpt-4o. Similarly claude-sonnet-4 routes to whichever claude-sonnet-4-YYYYMMDD is currently live. Vendors use this layer to A/B test new versions, run canary deployments to a small fraction of traffic, and shadow candidate models against production responses without affecting the caller’s billing. Production deployments that depend on a moving alias get bitten when the alias rolls over to a new version with subtly different output distributions, usually noticed first by the integration’s evaluation suite and only later by its customers. Pin to dated identifiers in production.
Prompt-cache lookup, optionally. The gateway hashes a stable prefix of the prompt (the system prompt, the static parts of the tool schemas, often a chunk of conversation history) and checks whether the inference fleet has a precomputed KV cache for that prefix. Anthropic exposes this explicitly through cache_control markers. OpenAI runs automatic prompt caching on prefixes of a certain length. Gemini calls it context caching. A cache hit at this stage skips most of prefill, which is by far the most expensive part of the request. Phase 3 goes deep on what prefix caching actually does at the KV level.
Observability. A trace span is opened. Metrics are emitted (request count, queue depth, model selected, expected token counts). The request is logged subject to the vendor’s retention and privacy policy. Most production failures get diagnosed from this telemetry, which is one reason vendors that don’t expose request-trace IDs to developers (some still don’t, in 2026) frustrate the people building on them. The trace ID is the only stable identifier that bridges client-side error logs, gateway-side rejection reasons, and inference-side actual processing telemetry. Without it, debugging a flaky integration is a guessing game.
Admission to the scheduler. Finally, the request is admitted into the inference scheduler’s queue. This is where Phase 3 begins.
A contestable claim worth sitting with. Every hosted-LLM API gateway in 2026 does roughly the same ten things in roughly the same order. The features above are commodity. The actual differentiation between vendors does not start at the gateway. It starts at admission to the scheduler — the queueing discipline, the batching strategy, the way prompt-cache hits propagate into KV memory, the way concurrent requests share GPU time — and ends at the streamed delta the client receives. Two vendors with identical gateways and different scheduler implementations will produce visibly different latency distributions for the same workload.
Deeper: stable-prefix hashing for prompt caches.
Prompt cache lookup needs to identify whether a given prompt shares a usable prefix with a previously seen prompt. The hash is over a normalized representation of the message array up to some boundary (typically the last user message, or up to an explicit cache-control marker). The implementation has to handle whitespace normalization (so trivial reformatting doesn’t break the hit rate), serialization-order determinism (the same content in a different JSON key order has to produce the same hash), and boundary alignment (cache blocks have a fixed granularity, often 256 or 512 tokens, so the prefix has to round to a boundary). The cache key is then queried against the KV memory allocated to the target model’s fleet. Hit rates above 80% on production traffic are common when the system prompt and tool schemas are stable across requests, which is the main reason vendors push so hard on prompt structure stability.
Then: admission to the scheduler
The request is now sitting in a queue in the inference scheduler, waiting for the next batch to assemble. Three things have happened that the model has not yet seen.
The chat template has not been applied to the message array. The text has not been tokenized. The position IDs have not been built. The KV cache has not been allocated. The actual interior of the model has not been touched.
Phase 2 — Building the Prompt covers the templating and tokenization that happen between admission and the first GPU operation. Phase 3 — Scheduling and Prefill covers the batching, the KV cache allocation, and the first compute-bound pass through the layers.
The wrapper is universal, and it ends here.