Writing an LLM from scratch, part 12 -- multi-head attention
In this post, I'm wrapping up chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered batches, which -- somewhat to my disappointment -- didn't involve completely new (to me) high-order tensor multiplication, but instead relied on batched and broadcast matrix multiplication. That was still interesting on its own, however, and at least was easy enough to grasp that I didn't disappear down a mathematical rabbit hole.
The last section of chapter 3 is about multi-head attention, and while it wasn't too hard to understand, there were a couple of oddities that I want to write down -- as always, primarily to get it all straight in my own head, but also just in case it's useful for anyone else.
So, the first question is, what is multi-head attention?
Writing an LLM from scratch, part 11 -- batches
I'm still working through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered dropout, which was nice and easy.
This time I'm moving on to batches. Batches allow you to run a bunch of different input sequences through an LLM at the same time, generating outputs for each in parallel, which can make training and inference more efficient -- if you've read my series on fine-tuning LLMs you'll probably remember I spent a lot of time trying to find exactly the right batch sizes for speed and for the memory I had available.
This was something I was originally planning to go into in some depth, because there's some fundamental maths there that I really wanted to understand better. But the more time I spent reading into it, the more of a rabbit hole it became -- and I had decided on a strict "no side quests" rule when working through this book.
So in this post I'll just present the basic stuff, the stuff that was necessary for me to feel comfortable with the code and the operations described in the book. A full treatment of linear algebra and higher-order tensor operations will, sadly, have to wait for another day...
Let's start off with the fundamental problem of why batches are a bit tricky in an LLM.
Writing an LLM from scratch, part 10 -- dropout
I'm still chugging through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered causal attention, which was pretty simple when it came down to it. Today it's another quick and easy one -- dropout.
The concept is pretty simple: you want knowledge to be spread broadly across your model, not concentrated in a few places. Doing that means that all of your parameters are pulling their weight, and you don't have a bunch of them sitting there doing nothing.
So, while you're training (but, importantly, not during inference) you randomly ignore certain parts -- neurons, weights, whatever -- each time around, so that their "knowledge" gets spread over to other bits.
Simple enough! But the implementation is a little more fun, and there were a couple of oddities that I needed to think through.
Writing an LLM from scratch, part 9 -- causal attention
My trek through Sebastian Raschka's "Build a Large Language Model (from Scratch)" continues... Self-attention was a hard nut to crack, but now things feel a bit smoother -- fingers crossed that lasts! So, for today: a quick dive into causal attention, the next part of chapter 3.
Causal attention sounds complicated, but it just means that when we're looking at a word, we don't pay any attention to the words that come later. That's pretty natural -- after all, when we're reading something, we don't need to look at words later on to understand what the word we're reading now means (unless the text is really badly written).
It's
"causal" in the sense of causality -- something can't have an effect on something that came before
it, just like causes can't some after effects in reality. One big plus about
getting that is that I finally now understand why I was using a class called
AutoModelForCausalLM
for my earlier experiments in fine-tuning LLMs.
Let's take a look at how it's done.
Writing an LLM from scratch, part 8 -- trainable self-attention
This is the eighth post in my trek through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm blogging about bits that grab my interest, and things I had to rack my brains over, as a way to get things straight in my own head -- and perhaps to help anyone else that is working through it too. It's been almost a month since my last update -- and if you were suspecting that I was blogging about blogging and spending time getting LaTeX working on this site as procrastination because this next section was always going to be a hard one, then you were 100% right! The good news is that -- as so often happens with these things -- it turned out to not be all that tough when I really got down to it. Momentum regained.
If you found this blog through the blogging-about-blogging, welcome! Those posts were not all that typical, though, and I hope you'll enjoy this return to my normal form.
This time I'm covering section 3.4, "Implementing self-attention with trainable weights". How do we create a system that can learn how to interpret how much attention to pay to words in a sentence, when looking at other words -- for example, that learns that in "the fat cat sat on the mat", when you're looking at "cat", the word "fat" is important, but when you're looking at "mat", "fat" doesn't matter as much?
Writing an LLM from scratch, part 7 -- wrapping up non-trainable self-attention
This is the seventh post in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Each time I read part of it, I'm posting about what I found interesting or needed to think hard about, as a way to help get things straight in my own head -- and perhaps to help anyone else that is working through it too.
This post is a quick one, covering just section 3.3.2, "Computing attention weights for all input tokens". I'm covering it in a post on its own because it gets things in place for what feels like the hardest part to grasp at an intuitive level -- how we actually design a system that can learn how to generate attention weights, which is the subject of the next section, 3.4. My linear algebra is super-rusty, and while going through this one, I needed to relearn some stuff that I think I must have forgotten sometime late last century...
Writing an LLM from scratch, part 6b -- a correction
This is a correction to the sixth in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".
I realised while writing the next part that I'd made a mistake -- while trying to get an intuitive understanding of attention mechanisms, I'd forgotten an important point from the end of my third post. When we convert our tokens into embeddings, we generate two for each one:
- A token embedding that represents the meaning of the token in isolation
- A position embedding that represents where it is in the input sequence.
These two are added element-wise to get an input embedding, which is what is fed into the attention mechanism. However, in my last post I'd forgotten completely about the position embedding and had been talking entirely in terms of token embeddings.
Surprisingly, though, this doesn't actually change very much in that post -- so I've made a few updates there to reflect the change. The most important difference, at least to my mind, is that the fake non-trainable attention mechanism used -- the dot product of the input embeddings -- is, while still excessively basic, not quite as bad as it was. My old example was that in
the fat cat sat on the mat
...the token embeddings for the two "the"s would be the same, so they'd have super-high attention scores for each other. When we consider that it would be the dot product of the input embeddings instead, they'd no longer be identical because they would have different position embeddings. However, the underlying point holds that they would be too closely attending to each other.
Anyway, if you're reading along, I don't think you need to go back and re-read it (unless you particularly want to!). I'm just posting this here for the record :-)
Writing an LLM from scratch, part 6 -- starting to code self-attention
This is the sixth in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Each time I read part of it, I'm posting about what I found interesting as a way to help get things straight in my own head -- and perhaps to help anyone else that is working through it too. This post covers just one subsection of the trickiest chapter in the book -- subsection 3.3.1, "A simple self-attention mechanism without trainable weights". I feel that there's enough in there to make up a post on its own. For me, it certainly gave me one key intuition that I think is a critical part of how everything fits together.
As always, there may be errors in my understanding below -- I've cross-checked and run the whole post through Claude, ChatGPT o1, and DeepSeek r1, so I'm reasonably confident, but caveat lector :-) With all that said, let's go!
Writing an LLM from scratch, part 5 -- more on self-attention
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it. In retrospect, it was kind of adorable that I thought I could get it all done over my Christmas break, given that I managed just the first two-and-a-half chapters! However, now that the start-of-year stuff is out of the way at work, hopefully I can continue. And at least the two-week break since my last post in this series has given things some time to stew.
In the last post I was reading about attention mechanisms and how they work, and was a little thrown by the move from attention to self-attention, and in this blog post I hope to get that all fully sorted so that I can move on to the rest of chapter 3, and then the rest of the book. Rashka himself said on X that this chapter "might be the most technical one (like building the engine of a car) but it gets easier from here!" That's reassuring, and hopefully it means that my blog posts will speed up too once I'm done with it.
But first: on to attention and what it means in the LLM sense.
Writing an LLM from scratch, part 4
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.
Here's a link to the previous post in this series.
Today I read through chapter 3, which introduces and explains attention mechanisms -- the core architecture that allows LLMs to "understand" the meaning of text in terms of the relationships between words. This feels like the core of the book; at least, for me, it's the part of the underlying workings of LLMs that I understand the least. I knew it was something to do with the LLM learning which other words to pay attention to when looking at a particular one, but that's pretty much it.
And it's a tough chapter. I finished with what I felt was a good understanding at a high level of how the calculations that make up self-attention in an LLM work -- but not of how self-attention itself works. That is, I understood how to write one, in terms of the steps to follow mathematically, but not why that specific code would be what I would write or why we would perform those mathematical operations.
I think this was because I tried to devour it all in a day, so I'm going to go through much more slowly, writing up notes on each section each day.
Today, I think, I can at least cover the historical explanation of how attention mechanisms came to be in the first place, because that seems reasonably easy to understand.