How do LLMs work?
This article is the last of three "state of play" posts that explain how Large Language Models work, aimed at readers with the level of understanding I had in mid-2022: techies with no deep AI knowledge. It grows out of part 19 in my series working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)".
In my last two posts, I've described what LLMs do -- what goes in, what comes out, and how we use that to create things like chatbots -- and covered the maths you need to start understanding what goes on inside them. Now the tough bit: how do we use that maths to do that work? This post will give you at least a rough understanding of what's going on -- and I'll link to more detailed posts throughout if you want to read more.
As in my last posts, though, some caveats before we start: what I'm covering here is what you need to know to understand inference -- that is, what goes on inside an existing AI when you use it to generate text, rather than the training process used to create them. I'll write about training in the future. (This also means that I'm skipping the dropout portions of the code that I've covered previously. I'll bring that back in when I get on to training.)
I'll also be ignoring batching. I'll be talking about giving an LLM a single input sequence and getting the outputs for that sequence. In reality they're sent a whole bunch of input sequences at once, and work out outputs for all of them in parallel. It's actually not all that hard to add that on, but I felt it would muddy the waters a bit to include it here.
Finally, just to set expectations up-front; when I say "how do LLMs work", I'm talking about the structure that they have. They're essentially a series of mathematical operations, and these are meaningful and comprehensible. However, the specific numbers -- the parameters, aka weights -- that are used in these operations are learned through the training process -- essentially, showing an LLM a huge pile of text (think "all of the Internet") and adjusting its weights so that it gets really good at predicting the next token in a sequence given what it sees when you do that.
Why one set of parameters might be better at that job than another is not something we understand in depth, and is a highly active research area, AI interpretability. Back in 2024, Anthropic managed to find which parts of their LLM represented the concept of the Golden Gate Bridge, and put a demo version of their Claude chatbot online that had those parts "strengthened", which gave surprisingly funny results. But doing that was really hard. We can't just look at an LLM and say "ah, that's where it thinks about such-and-such" -- people need to do things like ablations, where they remove part of it and see what effect that has on the results (which has its own problems).
But while the meanings of the specific parameters that come out of the training process are hard to work out, there's still something important that we can understand -- the specific set of calculations that we use those parameters in.
People often say of LLMs that they are just large arrays of impenetrable numbers that no-one understands, and there's an element of truth in that. But it would be more accurate to say that each LLM is made up of a set of arrays of numbers -- yes, impenetrable ones -- which are doing things where, while the specific details might be unclear, the overall process is something we can understand.
Perhaps a metaphor is useful here: with a human brain, we don't know where the concept of "cat" is held, or what goes on when someone thinks about a cat. But we do know the general layout of the brain -- visual processing goes on in one place, audio in another, memories are controlled by this bit, and so on.
LLMs are a specific series of calculations, and which calculations are effective was determined by humans thinking about things, and we can understand what those calculations are doing. They're not just completely random neural networks that somehow magically do their work, having learned what to do by training.
So, with all that said, let's take a look at what those calculations are, and how they work.
An addendum to 'the maths you need to start understanding LLMs'
My last post, about the maths you need to start understanding LLMs, took off on Hacker News over the weekend.
It's always nice to see lots of people reading and -- I hope! -- enjoying something that you've written. But there's another benefit. If enough people read something, some of them will spot errors or confusing bits -- "given enough eyeballs, all bugs are shallow".
Commenter bad_ash made the excellent point that in the phrasing I originally had, a naive reader might think that activation functions are optional in neural networks in general, which of course isn't the case. What I was trying to say was that we can use one without an activation function for other purposes (and we do in LLMs). I've fixed the wording to (hopefully) make that a bit clearer.
ThankYouGodBless made a thoughtful comment about vector normalisation and cosine similarity, which was a great point in itself, but it also made something clear: although the post linked to an article I wrote back in February that covered the dot product of vectors, it really needed its own section on that. Without understanding what the dot product is, and how it relates to similarity, it's hard to get your head around how attention mechanisms work. I've added a section to the post, but for the convenience of anyone following along over RSS, here's what I said:
The dot product
The dot product is an operation that works on two vectors of the same length. It simply means that you multiply the corresponding elements, then add up the results of those multiplications:
Or, more concretely:
This is useful for a number of things, but the most interesting is that the dot product of two vectors of roughly the same length is quite a good measure of how close they are to pointing in the same direction -- that is, it's a measure of similarity. If you want a perfect comparison, you can scale them both so that they have a length of one, and then the dot product is exactly equal to the cosine of the angle between them (which is logically enough called cosine similarity).
But even without that kind of precise normalisation (which requires calculating squares and roots, so it's kind of expensive), so long as the vectors are close in length, it gives us meaningful numbers -- so, for example, it can give us a quick-and-dirty way to see how similar two embeddings are.
Unfortunately the proof of why the dot product is a measure of similarity is a bit tricky, but this thread by Tivadar Danka is reasonably accessible if you want to get into the details.
See you next time!
As promised, up next: how do we put all of that together, along with the high-level stuff I described about LLMs in my last post, to understand how an LLM works?
The maths you need to start understanding LLMs
This article is the second of three "state of play" posts that explain how Large Language Models work, aimed at readers with the level of understanding I had in mid-2022: techies with no deep AI knowledge. It grows out of part 19 in my series working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". You can read the first post in this mini-series here.
Actually coming up with ideas like GPT-based LLMs and doing serious AI research requires serious maths. But the good news is that if you just want to understand how they work, while it does require some maths, if you studied it at high-school at any time since the 1960s, you did all of the groundwork then: vectors, matrices, and so on.
One thing to note -- what I'm covering here is what you need to know to understand inference -- that is, using an existing AI, rather than the training process used to create them. That's also not much beyond high-school maths, but I'll be writing about it later on.
So, with that caveat, let's dig in!