What AI chatbots are actually doing under the hood
This article is the first of three "state of play" posts that explain how Large Language Models work, aimed at readers with the level of understanding I had in mid-2022: techies with no deep AI knowledge. It grows out of part 19 in my series working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)".
As a techie, I'm sometimes asked by less-geeky friends how tools like ChatGPT, Claude, Grok, or DeepSeek actually work under the hood. Over time I've refined and improved my answer, both because I found better ways to explain it, and because as I started to learn how to build one myself from scratch, my own understanding got better.
This post is an attempt to describe in plain English what is going on. It's made up of a series of descriptions, each one building on the one before but refining it a bit, until we reach something that's strictly accurate, but which might have been a bit overwhelming if presented in one go.
If you're reading this as a techie who wants to learn about AI, then I definitely recommend that you read until the end. But if you're reading for general interest, you can safely stop reading at any time -- hopefully you'll wind up with a more solid understanding of what's going on, even if it doesn't include all of the tiny details.
Before we kick off, though: AI bots aren't just chatbots any more; many are multimodal -- for example, ChatGPT can analyse images you provide or generate its own. They also have "thinking" modes, which allow them to ponder their answers before replying. I won't cover those aspects -- just the text-only systems that we had 12 months or so ago, in 2024. And I'll only be talking about inference -- that is, using an existing AI, rather than the training process used to create them.
So, with all that said, let's kick off! We can start by looking at next-word prediction.
Next-word prediction
AI chatbots are based on Large Language Models: LLMs. Most of the ones we use are based on some version of Generative Pretrained Transformers (GPT), a design that OpenAI came up with 1 -- hence "ChatGPT".
These LLMs work by predicting the next token -- for now, let's treat that as meaning "word" -- in a text, given the text so far.
Let's say that you fed one of them this text:
The fat cat sat on
It might predict that the next word is "the".
You can use this to generate longer texts -- if you just added the previous prediction onto the original text and fed it back in:
The fat cat sat on the
...then it might predict "mat" as the next one. That's the "Generative" bit in GPT: they generate text.
Chatbots work by using a clever trick to make use of this next-word prediction. Imagine that you have a bot, and the user starts a conversation with the question "What is the capital of France?". You put that into a template, and you feed your LLM something like this:
The following is a transcript of a conversation between a user, "User", and an
AI bot, "Bot". The bot is helpful and friendly.
User: What is the capital of France?
Bot:
...and then ask it to predict the next word. Remember that the LLM will look at all of that text, and then try to work out what the most likely word would be in that specific context.
It might come up with "The" -- after all, that's a reasonable word for the imaginary bot in this transcript to use to start its reply. You would feed in the whole thing with that "The" tacked onto the end, and it might predict that the word after that should be "capital". Repeat again and again, and you might get, word-by-word, the rest of the bot's response: "of France is Paris. Is there anything else I can help you with?".
But there's nothing making it stop there; it's predicting a transcript, so it would then predict "User: " and would start trying to guess what the user might ask next! So you stop it when it predicts "User:" and turn control over to the human that is using it, so that they can ask their next question.
When they've typed in their question and clicked the button to send it to our chatbot, we need to start the prediction process again. LLMs don't remember anything -- each time they're starting from scratch with a fresh text, trying to work out what the next token might be. So we provide the LLM with the whole transcript so far -- the intro, the previous user question and bot response, the new message from the user, finishing with "Bot: " again so that it knows that it's time to start generating the bot's answer.
The LLMs we use day-to-day essentially do that -- there are lots of differences (for example, they've been trained on thousands of transcripts so they don't need the whole "The following is..." first sentence) and there are advances beyond that, but that's the core idea.
So, what are these LLMs? They are created by setting up a deep neural network -- loosely speaking, a program inspired by how brains work -- with a specific design called "Transformers". It's then trained on a vast amount of data -- imagine all of the text on the Internet, downloaded and tidied up (at least to some degree).
It starts with a large number of random numbers that control how the network works (known as parameters or weights -- hundreds of billions of them for models like ChatGPT), and the process of showing it text and using that to refine those numbers to make it better at predicting the next word is called (logically enough) training -- specifically pre-training, which is the P in GPT. 2
Given all of that, we have our first working description: an LLM receives some text and works out what the next word is most likely to be. Usefully, we can then stick that on at the end of the original text and repeat the process to generate more text. And we can turn it into a chatbot by making it try to complete a transcript of a bot's conversation with a user, putting the real user's messages and any previous responses it's generated in as part of that transcript.
Tokens
There are a lot of words in the English language, and we don't want our LLM
to need to think about all of them. We also don't want to limit it to English,
and we want it to be able to handle random stuff -- if the user types
skjflksdjflsdfjsd in as their question, it wouldn't be great if the LLM crashed
saying "Unknown word".
So we break our text down into tokens, which are sequences of letters that appear commonly in the training data. There can be quite a lot of them -- the tokeniser created for version 2 of the original GPT from OpenAI has a list of over 50,000! So a lot of short words wind up having their own tokens, at least in English. For example, using that GPT-2 tokeniser, "The fat cat" breaks down into these three tokens:
'The', ' fat', ' cat'
The Portuguese equivalent "O gato gordo" (presumably less-represented in the training data) breaks down to more tokens:
'O', ' g', 'ato', ' g', 'ord', 'o'
Longer and rarer words, regardless of the language, can wind up being split into different tokens. For example, "Pseudoscientists prevaricate habitually" is the following GPT-2 tokens:
'P', 'se', 'udos', 'cient', 'ists', ' pre', 'var', 'icate', ' habit', 'ually'
(One thing that might stand out here is that the tokens include the spaces before
the words where necessary -- there are different tokens for 'fat' and ' fat'.
Best to consider that as "one of those things" for the purposes of this post.)
The LLM has a vocabulary of the different tokens it knows about -- 50,000 tokens for GPT-2. They're common words or bits of words, plus all of the letters in the alphabet (including "special" ones like "é") so that it can "spell things out" letter by letter, or understand inputs letter by letter if it is provided with something unusual. The tokeniser gives each token a numerical ID; for example, "The fat cat"'s tokens above have IDs 464, 3735 and 3797 respectively for GPT-2.
So, to refine what the LLM is doing a little: it's receiving a list of token IDs that represent the text that we want it to process, and it's predicting the ID of the most likely next token.
Logits and probability
There are obviously quite a lot of different tokens that could come after "The fat cat sat on the". For example, "desk" might be a valid one:
Photo taken while writing this. Tail blocking the mouse, as usual. Why are cats like this?
The LLM needs to reflect this. So it doesn't just output the predicted next token -- instead, it outputs a vector (ie. a list) of numbers, one for every token in its vocabulary. The higher the number in the nth position in the list, the more likely it thinks that the token with the ID n is the next one.
"The fat cat sat on the mat" is a cliché that it will probably have encountered lots of times when it was being trained on its huge pile of text, so the position relating to the token ID for "mat" will have a very large number in the vector. But "desk" is still fairly plausible, so that token's ID will have quite a high number too. On the other hand, "eat" is really unlikely ("The fat cat sat on the eat" makes no sense), so that would be a smaller number.
These numbers are called logits 3. That's a neural networking term of art, but basically means "numbers produced by an AI that, if we normalised them in some way, would make sense as probabilities".
So our next refinement is that the LLM receives a list of token IDs that represent the text that we want to process, and outputs a set of logits: one number for every token in the vocabulary, each of which represents the likelihood of that token being the next one.
Tying that back to the chatbot: we could just pick the most likely token each time around, but in practice it's good to inject a bit of randomness into the process -- we pick one based on those probabilities, so that we'll mostly say that cats are sitting on mats, but sometimes they sit on desks or laps, rarely on dogs, and (because the probability is so low) never on ungrammatical choices like "eat".
Predict ALL teh tokens!!!
That's still not quite accurate, but we're almost there. One extra wrinkle is that the LLM doesn't just predict the next token for the sequence as a whole. Instead, for every token we provide it in the input sequence, it will predict a next one based on that token and the ones that came before it.
For example, if we feed it "The fat cat sat on the", then it will produce (in parallel) predictions for the next token for all of these sequences:
- "The"
- "The fat"
- "The fat cat"
- "The fat cat sat"
- "The fat cat sat on"
- "The fat cat sat on the"
So, you feed it a sequence of tokens, and it produces a sequence of vectors of logits, one vector for each of the input tokens. The logits for each token are the next-token predictions given the text up to and including that token. Let's make that a bit more concrete.
In its predictions for the first sequence above, where it is just "The", there will probably be a ton of possibilities, as lots of tokens could come next -- so in the logits vector there will be lots of high numbers, for "cat" but also "dog" or "naming".
For the second sequence, "The fat", things will be a little tighter, as there's a limit to the kinds of things that can be fat, but still "man" or "controller" might be quite likely.
Likewise, for all of the other sequences, there will be options, but in general as the input sequence gets longer, the more it will "home in" on a particular completion. So when it comes to predict the logits for the full sequence, it's looking at a very recognisable cliché, so they will probably be pretty centred around "mat".
When you're using an LLM in its normal next-token-prediction mode, doing all of these extra predictions might sound wasteful -- you're going to ignore all but the last logits vector, the predictions for the full sequence. But the predictions for the shortened sequences are really kind of side-effects of the prediction of the next token for the sequence as a whole. There's only a tiny bit of wastage in practice.
And when the LLM is being trained, they're actually kind of useful -- because you'll give it the sequence of token IDs representing "The fat cat sat on the" as its input, and tell it that its target output is the sequence representing "fat cat sat on the mat" -- that is, the input sequence shifted one token forward. It will learn that "fat" is a reasonable thing to follow the one-token sequence "The", that "cat" is a good follow-on to "The fat", and so on. 4
So, here's our final refinement of what the LLM does: it receives a sequence of token IDs that represent the text that we want to process, and outputs a sequence of vectors of logits, one for each token in the input sequence. The logits for each token are the prediction for what comes next, based only on the tokens from the start up to and including that specific token.
For our chatbot, we just throw away all apart from the last of those, and then use it to work out a sensible next token for the sequence as a whole based on the logits.
And that's it!
You're all set -- you know what's going into a GPT-style LLM, and what's coming out, and how that can be used to create a chatbot.
If you're reading this out of general interest, I hope it actually was (generally) interesting. Any comments would be much appreciated -- is there anything that's unclear or confusing? Or on the other hand, is there anything simple and obvious that I overexplained?
If you're reading it as a techie who wants to learn more, you're hopefully in a much better position than I was when I started on my project to learn how to build an LLM from scratch, when I pretty much only understood the second "Tokens" level above.
If you're curious about how those hundreds of billions of parameters actually take a sequence of numbers and turn them into a list of logits, that's exactly what the next post in this mini-series will start to unpack. It will be on the mathematical concepts that I had to learn -- understanding LLMs, at least at the level needed to build one (as opposed to the level needed to come up with an idea like GPT in the first place), doesn't need much more than high-school maths. So the post will go over that "not much more".
Hope you’ll join me for that one!
Thanks to Michael Mangion and Ricardo Guimarães for commenting on earlier versions of this post.
-
GPT is an architecture that was created at OpenAI, based on work by many others -- notably the authors of the foundational paper Attention is all you need. ↩
-
Post-training is when you show it task-specific things later on to make it better at a given job -- for example, if you take an LLM trained on the Internet in general and show it a whole bunch of chat transcripts to make it learn to specialise in completing them -- that is, to make it a good chatbot. ↩
-
Pronounced with a soft "g", kind of like "lodge-its". ↩
-
This took me a while to understand; I thought it was odd that the predicted sequence during training was not just the single next token in December of my series on building an LLM, and only understood why in May. In my defence, I was quite busy. ↩