Jun 26, 2026chatgptaisystem designarchitecture designhldlldsoftware engineering applied aimodelsclaudegeminibackend engineeringredisstreams

System Design for ChatGPT

A deep dive into ChatGPT's system design. Learn to serve 225M DAU, stream sub-500ms responses, optimize expensive GPUs, and efficiently manage context at scale.

System Design of ChatGPT. How it handles 225 daily active users and 20k prompts/sec.

What is ChatGPT ???

Unless you've been living under a rock, you know what ChatGPT is. It's a conversational AI product where users send prompts in natural language and get responses streamed back from a large language model. Conversations are saved, so users can come back to an old chat and pick up right where they left off.

Functional Requirements

Users should be able to send a prompt int a chat and receive a response.
Users should be able to view past chats and resume a convo, with the chat's prior context carried into the prompt.

Non-functional Requirements

Non-functional requirements cover the properties of the system that matter to the user and the business. This is where the interviewers like to loose their balls.

ChatGPT servers over 225 million daily active users at the time of writing this article, so that's the scale we will design against.

With that in mind, here're the requirements that actually shape the design:

The system should have low time-to-first-token (< ~500ms), with continuous, smooth streaming after that. This is important for user experience as users don't like to stare at a blank screen after sending the prompt for too long.
The system should prioritize high availability over strong consistency for conversation state (~99.9%+). It's better to return an error or a degraded experience than to block the whole system on perfectly synchronized chat state.
The system should scale to: 225M DAU, ~20k prompts/sec at peak, ~120k concurrent in-flight streams).

Defining the Core Entities

To satisfy our requirements, we'll need the following entities:

User: 😆 obviously. what else you thought? It carries the user info like (free and paid tier), which is going to matter a lot once we get to fairness and scheduling.
Chat: A single convo thread. Belongs to one user, groups an ordered sequence of Messages.
Message: One turn in a chat, either a user prompt or an assistant response. Carries the chatId, a role, that content and a token count.

API Interface

Here, we'll define one or two endpoints as per requirements and move on.

First, a user starts a new chat. We use POST because we're creating a Chat entity.

POST /chat -> { chatId }

Body: {}

Next.the user sends a promtp gets a response back. This is the one endpoint that isn't a plain request/response. The assistant message is streamed back token by token, an wd we return a runId, a handle for this in-flight response tha the client uses to follow the stream.

We use POST, because we're creating a Message in the server.

POST /chats/{chatId}/messages -> Message (streamed via SSE) Body: { content
}For the second functional requirement, we list a user's chats for the sidebar and load the messages for one chat. Both are GETs with cursor pagination, since a heavy user can have thousands of chats and a long chat can have thousands of messages.

GET /chats?cursor={cursor}&limit={n} -> Chat[] GET /chats/{chatId}/messages?cursor={cursor}&limit={n} -> Message[]

Notice the userId never shows up in a path or body. It comes from the session token or JWT, and chat ownership is checked server-side on every request.

High-Level Design

We'll go one by one through the functional requirements. Both are short, and we're going to keep the design deliberately naive, with synchronous calls and no streaming or queues. The plan is to go simple and then layer on the complexities.

1) Users should be able to send a prompt and receive an AI-generated response

When a user opens a chat, types a prompt and hits enter, the client sends that prompt to our backend and eventually gets a response back. Here're the minimum set of components to make that happend:

Client: Basically the user's mode of interaction, could be a browser, or a mobile app.
API Gateway: The entry point of request. It handles auth, rate limiting and routes requests to the right services.
Chat Service: A stateless service that owns chat and message persistence and orchestrates the call to the model. It's cheap to run and easy to scale horizontally because of the stateless nature.
Database: I'll go with Postgres here, it will hold our Chat and Message .
Inference Service: Owns the GPU model workers that actaully run the LLM. We trea the model itself as a black box that takes in a prompt and returns a completion.

Here's how these interact when a user sends a prompt:

The client sends a POST request to /chats/{chatId}/messages.
Gateway authenticates the request and forwards it to the Chat Service.
The Chat Serivce writes the user's message to the messages table.
The Chat Service makes a synchronous call to the Inference Service, which runs the prompt through the model and returns the full completion once it's done.
The Chat Service writes the assistant message back to the messages table and returns it to the client.

Let's briefly acknowledge the elephant in the room. This is fully synchronous, so the client sits on that HTTP call until the entire response is generated, and a long response can take up to 30 seconds. That's 30 seconds of blank screen, which violates our TTFT requirement and feels broken. On top of that, the Chat Service is calling a GPU worker directly with no admission control, which falls apart the moment GPUs become the bottleneck. We'll fix the first problem with streaming and the second with a scheduling layer later on.

2) Users should be able to view past chats and resume a conversation.

Users expect to come back tomorrow, scroll their old conversations, open one, and keep going as if the model remembers everything. Two things have to happen here, a read path for past chats and context carry-over on the next turn.

We add the read endpoints off the existing Postgres tables and a context-loading step inside the Chat Service.

For the read path:

GET /chats returns the user's chats ordered by recent activity, cursor-paginated for the sidebar.
GET /chats/{chatId}/messages returns one chat's messages, cursor-paginated so a long conversation doesn't load all at once.

For context carry-over, when the user sends a follow-up prompt on an existing chat:

The Chat Service queries the messages table for the previous messages in that chatId, ordered by creation time.
It builds the prompt by concatenating those messages (with their roles, user vs assistant) followed by the new user message.
It sends that combined prompt to the Inference Service, just like the first turn.
The new assistant message gets written back to the messages table, so the next turn can read it too.

This is the simplest thing that works. But sending full history every turn has two obvious problems:

it breaks once a conversation grows past the model's context window
it gets more expensive every turn as input tokens are billed per call.

We'll fix them later

Diving Deep and fixing the bottlenecks

With the functional requirements met, it's time to go back and earn the non-functional requirements and make your interviewers wet.

How do we stream tokens back fast, and keep the stream smooth?

Our synchronous design makes the user wait up to 30 seconds for a blank screen to turn into a full answer.

As stated earlier, how fast the client receives a response is determined by the TTFT (time to first token) of the model which is purely a latency problem.

So, what is the best way to stream the resopnse from the server to the client???

Server Sent Events is purpose built for one way to server to client streaming. The client opens an ordinary HTTP request with an EventSource, the server holds that response open and keeps writing data: events to it as tokens are produced, and the browser fires an event for each one. It runs over plain HTTP with no protocol upgrade, so every proxy and load balancer in the path already handles it, and the browser starts rendering on the very first event.

Alright, SSE gets that first token onto the screen in milliseconds, which settles the responsiveness half of the requirement. All good... right?

The server holds the response open and writes tokens to it, as if one fixed server reliably sits between this user and the model for the full 30 seconds. Our Chat Service tier is stateless, horizontally scaled behind a load balancer, and redeployed all day long. The moment you take that seriously, two questions appear that the transport choice never touched. How does a token actually get from the worker that produced it over to whichever Chat Service instance is holding this user's SSE connection right now? And what happens to the stream when that instance is replaced mid-generation?

We can decouple the Chat Service and the model workers with a Redis stream between them, keyed by runId we mint when the generation starts.
The worker no longer cares which Chat Service instance is connected, and the Chat Service no longer cares which worker is generating. They rendezvous at runId.

The worker XADDs each token delta to the stream for that runId. The Chat Service instance reads with a blocking XREAD starting from the last entry ID this client has already seen, forwards new entries down the SSE connection, and keeps track of the latest ID as it goes.

We can keep the cost bounded with MAXLEN , so each stream retains only a recent window of tokens rather than growing without limit, and we can give the stream a short TTL so it's reclaimed once the generation is done.

For durability, when the generation finishes, the worker writes the complete message to the database and that persisted message is the durable copy a client can always refetch.

How do we route and schedule requests across GPU workers?

GPUs are the bottleneck, full stop. They're the most expensive resource in the system and the one in shortest supply, so how we route work to them decides both our cost and our latency under load. It's worth pausing on just how expensive. A frontier model is far too big to fit on a single GPU, so its weights get split across a whole box of them, and serving 120k concurrent streams means standing up thousands of those boxes. That puts you at tens of thousands of GPUs for this one model, and the labs running systems at this scale spend staggering amounts on compute, easily hundreds of millions to billions of dollars a year. When the hardware costs that much, every percentage point of utilization you leave on the table is real money, which is exactly what makes the scheduling decisions in this section worth the effort.

At peak, the traffic can spike upto 20k prompts/sec, and a very good pattern to handle spikes / bursty traffics is to introduce a queue between the Chat Service and the workers. The ChatService enqueues a generatoin request (prompt plus runId ) and returns fast.

But, the queue still treats each generation as a independent job on a worker, which leaves a lot of GPU performance unclaimed. GPUs are most efficient when they process many sequences together, and one-request-per-worker-slot doesn't exploit that.

So, we add continuous batching.Instead of running one sequence at a time, the worker generates for many sequences together, advancing every sequence in the batch by one token per forward pass, and it adds and drops sequences from the batch on the fly so a finished generation is immediately replaced by a queued one.

Keeping the batch full is what keeps the GPU busy, and it's the single biggest lever on utilization, enough that one replica can hold dozens of sequences in flight instead of one. Why batching pays off this much comes down to how a GPU actually spends its time, which is worth its own aside just after these options.

Then add backpressure so the system degrades predictably instead of melting. The queue has a bounded depth and an admission policy, so when it's too deep, we stop pretending we can keep up and start rejecting, deferring, or shedding requests rather than letting latency grow without bound. The nice property here is that capping the depth also caps the wait, since an admitted request can only ever be a bounded distance from the front. So a user is either served within that bound or, if we're past it, gets a fast "we're at capacity, try again" rather than being left on a spinner forever. A quick honest no beats an endless maybe.

Putting it together:

The Chat Service enqueues a generation request with its runId and a token-cost estimate.
Admission control checks queue depth. If we're over the limit, the request is shed or deferred.
A GPU worker pulls the request and folds it into its running batch via continuous batching.
The worker streams tokens to the runId stream as the batch generates.
When generation finishes (or is cancelled), the worker drops the sequence from its batch and pulls the next request.

How do we keep heavy users from monopolizing GPUs while giving paid tiers a better experience?

We can't do just traditional rate limiting, like using a tocket bucket in Redis for individual user using the userId as the key maintaining a count of in-flight requests. While it does solve the starvation problem, one user can no longer monopolize the pool, but it's still counting requests. It treats five 50-token replies the same as five 30k-token monsters, even though the latter is orders of magnitude more GPU.

The fix is to meter what's actually scarce, which is tokens (or estimated compute), not request count. When a generation comes in, we estimate how expensive it'll be from the prompt length and the requested max output, check that against how much of the user's token budget is left, and reject or delay it if they're over. The budget refills over time, the way an API quota does. Now the thing we're counting is the thing we're actually paying for.

Then layer tier priority on top. Paid users get bigger budgets, higher concurrency limits, and higher priority in the queue. Under normal load everyone's served fast and nobody notices; the tiers only diverge when capacity is tight, which is exactly when paying customers should feel the difference. Finally, make degradation explicit. When demand still exceeds capacity even after admission control, free users are throttled, deferred, or downgraded (routed to a smaller, cheaper model) before paid users feel anything.

Upgrading to Priority Queue for 'tier aware' routing

The challenge here is that estimating the cost up front is imperfect, you don't know the true output length until generation finishes, so budgets work off an estimate and reconcile against actuals afterwards.

As conversations get longer, how do we control inference cost without making the assistant feel forgetful?

A 50-turn chat at ~500 tokens per turn means we're shipping ~25k input tokens on the next prompt, and since input tokens are billed per call, cost and latency climb with every single turn. It has a hard ceiling, once the conversation grows past the model's context window the request simply can't fit.

The simplest fix is to keep the most recent N turns and drop everything older. It bounds cost, but the assistant becomes obviously forgetful. Reference something from earlier in a long chat and it has no idea what you're talking about.

The biggest lever here is prefix caching. Across turns in a single conversation, most of the prompt is identical from one turn to the next, since the system prompt and everything already said in the chat all repeat. Modern inference servers can cache the model's intermediate state (the KV cache) for a stable prompt prefix and reuse it instead of recomputing it from scratch every turn. Only the tail of the prompt, the newest message, actually changes, so prefix caching cuts both the cost and the latency of processing the input, which helps our TTFT goal directly. It pairs naturally with the sticky-ish routing we'd already want, sending a conversation's turns to a worker that already has its prefix warm.

The catch is that the cache has to be managed. A worker has finite memory and can only keep so many conversations' prefixes warm at once, so prefixes are evicted on something like an LRU basis.

Prefix caching makes re-reading the history cheap, but it doesn't move the hard ceiling, because a conversation can still outgrow the context window no matter how cheaply we process it. That's where a rolling summary comes in. Keep the most recent turns verbatim and compress older history into a running summary, so the prompt becomes the system prompt, then the summary of older turns, then the last few turns word-for-word, and finally the new user message. Recent context, where most follow-ups point, stays exact while older context is preserved in compressed form, so the assistant still remembers the gist of a long conversation without carrying every word. The summary updates in the background as the conversation grows, folding the oldest verbatim turns in as they age out. The two levers complement each other, with the summary keeping the prompt inside the window and prefix caching keeping the stable part cheap to process.

This is where another lever called RAG (Retrieval Augemented Generation) comes in. You can retrieve only the older facts relevant to the current turn instead of summarizing everything.

Challenges:

Summarization isn't free, it's an extra call to a cheaper model.
it can lose details.

Cancelling a run and reclaiming the GPU

When the user hits stop, the client makes a plain HTTP call to POST /chats/{chatId}/runs/{runId}/cancel.

The Chat Service flips that Generation's status to cancelled and publishes a cancel signal on a control channel keyed by the runId. The worker checks that channel between token batches, and the moment it sees the signal it drops the sequence and stops generating.

A cancelled 30-second generation that keeps running is pure wasted GPU, and GPU is the scarcest, most expensive thing in the whole system, so reclaiming it the instant the user stops caring is real money back.

Worth being clear that closing the tab is not a cancel. We built the Redis Stream and SSE reconnect precisely so a dropped connection isn't read as the end of a run, and like ChatGPT we keep generating in the background. The user can reopen the chat and reconnect to the stream, or just refetch the finished message from Postgres once it's done. Cancellation has to be an explicit signal from the user, never an accident of the network.

Alright, we went from the naive synchronous version into something that streams fast and stays smooth, schedules GPU efficiently and keeps cost in check

Here's roughly how it all fits together:

Things we didn't cover

There's plenty of things that we didn't do, that's there in ChatGPT:

Safety and moderation
How are the weights split across different GPUs and why does a model need a whole box of GPUs.
Speculative decoding.
Multi modal input and cross-chat memory.

Well, that's it for this one.

BTW, I'm looking for opportunities as a Software Engineer / Applied AI Engineer, if you're hiring or know someone who's hiring, you can reach me out @moyezrabbbani.work@gmail.com or DM me on X and LinkedIn.

Thanks for reading, you can check more of my blogs here.