Writing an LLM from scratch, part 14 -- the complexity of self-attention at scale

Posted on 14 May 2025 in AI, Python, LLM from scratch, TIL deep dives |

Between reading chapters 3 and 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I'm taking a break to solidify a few things that have been buzzing through my head as I've worked through it. Last time I posted about how I currently understand the "why" of the calculations we do for self-attention. This time, I want to start working through my budding intuition on how this algorithm behaves as we scale up context length.. As always, this is to try to get my own thoughts clear in my head, with the potential benefit of helping out anyone else at the same stage as me -- if you want expert explanations, I'm afraid you'll need to look elsewhere :-)

The particular itch I want to scratch is around the incredible increases in context lengths over the last few years. When ChatGPT first came out in late 2022, it was pretty clear that it had a context length of a couple of thousand tokens; conversations longer than that became increasingly surreal. But now it's much better -- OpenAI's GPT-4.1 model has a context window of 1,047,576 tokens, and Google's Gemini 1.5 Pro is double that. Long conversations just work -- and the only downside is that you hit rate limits faster if they get too long.

It's pretty clear that there's been some impressive engineering going into achieving that. And while understanding those enhancements to the basic LLM recipe is one of the side quests I'm trying to avoid while reading this book, I think it's important to make sure I'm clear in my head what the problems are, even if I don't look into the solutions.

So: why is context length a problem?

[ Read more ]


Writing an LLM from scratch, part 13 -- the 'why' of attention, or: attention heads are dumb

Posted on 8 May 2025 in AI, Python, LLM from scratch, TIL deep dives |

Now that I've finished chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" -- having worked my way through multi-head attention in the last post -- I thought it would be worth pausing to take stock before moving on to Chapter 4.

There are two things I want to cover, the "why" of self-attention, and some thoughts on context lengths. This post is on the "why" -- that is, why do the particular set of matrix multiplications described in the book do what we want them to do?

As always, this is something I'm doing primarily to get things clear in my own head -- with the possible extra benefit of it being of use to other people out there. I will, of course, run it past multiple LLMs to make sure I'm not posting total nonsense, but caveat lector!

Let's get into it. As I wrote in part 8 of this series:

I think it's also worth noting that [what's in the book is] very much a "mechanistic" explanation -- it says how we do these calculations without saying why. I think that the "why" is actually out of scope for this book, but it's something that fascinates me, and I'll blog about it soon.

That "soon" is now :-)

[ Read more ]


Writing an LLM from scratch, part 12 -- multi-head attention

Posted on 21 April 2025 in AI, Python, LLM from scratch, TIL deep dives |

In this post, I'm wrapping up chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered batches, which -- somewhat to my disappointment -- didn't involve completely new (to me) high-order tensor multiplication, but instead relied on batched and broadcast matrix multiplication. That was still interesting on its own, however, and at least was easy enough to grasp that I didn't disappear down a mathematical rabbit hole.

The last section of chapter 3 is about multi-head attention, and while it wasn't too hard to understand, there were a couple of oddities that I want to write down -- as always, primarily to get it all straight in my own head, but also just in case it's useful for anyone else.

So, the first question is, what is multi-head attention?

[ Read more ]


Writing an LLM from scratch, part 11 -- batches

Posted on 19 April 2025 in AI, Python, LLM from scratch, TIL deep dives |

I'm still working through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered dropout, which was nice and easy.

This time I'm moving on to batches. Batches allow you to run a bunch of different input sequences through an LLM at the same time, generating outputs for each in parallel, which can make training and inference more efficient -- if you've read my series on fine-tuning LLMs you'll probably remember I spent a lot of time trying to find exactly the right batch sizes for speed and for the memory I had available.

This was something I was originally planning to go into in some depth, because there's some fundamental maths there that I really wanted to understand better. But the more time I spent reading into it, the more of a rabbit hole it became -- and I had decided on a strict "no side quests" rule when working through this book.

So in this post I'll just present the basic stuff, the stuff that was necessary for me to feel comfortable with the code and the operations described in the book. A full treatment of linear algebra and higher-order tensor operations will, sadly, have to wait for another day...

Let's start off with the fundamental problem of why batches are a bit tricky in an LLM.

[ Read more ]


Writing an LLM from scratch, part 10 -- dropout

Posted on 19 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

I'm still chugging through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered causal attention, which was pretty simple when it came down to it. Today it's another quick and easy one -- dropout.

The concept is pretty simple: you want knowledge to be spread broadly across your model, not concentrated in a few places. Doing that means that all of your parameters are pulling their weight, and you don't have a bunch of them sitting there doing nothing.

So, while you're training (but, importantly, not during inference) you randomly ignore certain parts -- neurons, weights, whatever -- each time around, so that their "knowledge" gets spread over to other bits.

Simple enough! But the implementation is a little more fun, and there were a couple of oddities that I needed to think through.

[ Read more ]


Writing an LLM from scratch, part 9 -- causal attention

Posted on 9 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

My trek through Sebastian Raschka's "Build a Large Language Model (from Scratch)" continues... Self-attention was a hard nut to crack, but now things feel a bit smoother -- fingers crossed that lasts! So, for today: a quick dive into causal attention, the next part of chapter 3.

Causal attention sounds complicated, but it just means that when we're looking at a word, we don't pay any attention to the words that come later. That's pretty natural -- after all, when we're reading something, we don't need to look at words later on to understand what the word we're reading now means (unless the text is really badly written).

It's "causal" in the sense of causality -- something can't have an effect on something that came before it, just like causes can't some after effects in reality. One big plus about getting that is that I finally now understand why I was using a class called AutoModelForCausalLM for my earlier experiments in fine-tuning LLMs.

Let's take a look at how it's done.

[ Read more ]


Writing an LLM from scratch, part 8 -- trainable self-attention

Posted on 4 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

This is the eighth post in my trek through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm blogging about bits that grab my interest, and things I had to rack my brains over, as a way to get things straight in my own head -- and perhaps to help anyone else that is working through it too. It's been almost a month since my last update -- and if you were suspecting that I was blogging about blogging and spending time getting LaTeX working on this site as procrastination because this next section was always going to be a hard one, then you were 100% right! The good news is that -- as so often happens with these things -- it turned out to not be all that tough when I really got down to it. Momentum regained.

If you found this blog through the blogging-about-blogging, welcome! Those posts were not all that typical, though, and I hope you'll enjoy this return to my normal form.

This time I'm covering section 3.4, "Implementing self-attention with trainable weights". How do we create a system that can learn how to interpret how much attention to pay to words in a sentence, when looking at other words -- for example, that learns that in "the fat cat sat on the mat", when you're looking at "cat", the word "fat" is important, but when you're looking at "mat", "fat" doesn't matter as much?

[ Read more ]


Basic matrix maths for neural networks: in practice

Posted on 22 February 2025 in AI, TIL deep dives |

This is the second post in my short series of tutorials on matrix operations for neural networks, targeted at beginners, and at people who have some practical experience, but who haven't yet dug into the underlying theory. Again, if you're an experienced ML practitioner, you should skip this post -- though if you want to read it anyway, any comments or suggestions for improvements would be much appreciated!

In my last post in the series, I showed how to derive the formulae to run a neural network from the basic principles of matrix maths. I gave two formulae that are generally used in mathematical treatments of NNs -- one with a separate bias matrix:

Z^=WX+B

...and one with the bias terms baked into the weights matrix, and the inputs matrix extended with a row of 1s at the bottom:

Z^=WX

However, I finished off by saying that in real production implementations, people normally use this instead:

Z^=XWT+B

...which you might have seen in production PyTorch code looking like this:

Z = X @ W.T + B

This post explores why that form of the equation works better in practice.

[ Read more ]


Basic matrix maths for neural networks: the theory

Posted on 20 February 2025 in AI, TIL deep dives |

I thought it would be worth writing a post on how matrix multiplication is used to calculate the output of neural networks. We use matrices because they make the maths easier, and because GPUs can work with them efficiently, allowing us to do a whole bunch of calculations with a single step -- so it's really worth having a solid grounding in what the underlying operations are.

If you're an experienced ML practitioner, you should skip this post. But you might find it useful if you're a beginner -- or if, like me until I started working through this, you've coded neural networks and used matrix operations for them, but apart from working through an example or two by hand, you've never thought through the details.

In terms of maths, I'll assume that you know what a vector is, what a matrix is, and have some vague memories of matrix multiplication from your schooldays, but that's it -- everything else I will define.

In terms of neural networks, I'll assume that you are aware of their basic layout and how they work in a general sense -- but there will be diagrams for clarity and I'll define specific terms.

So, with expectations set, let's go!

[ Read more ]


Writing an LLM from scratch, part 7 -- wrapping up non-trainable self-attention

Posted on 7 February 2025 in AI, Python, LLM from scratch, TIL deep dives |

This is the seventh post in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Each time I read part of it, I'm posting about what I found interesting or needed to think hard about, as a way to help get things straight in my own head -- and perhaps to help anyone else that is working through it too.

This post is a quick one, covering just section 3.3.2, "Computing attention weights for all input tokens". I'm covering it in a post on its own because it gets things in place for what feels like the hardest part to grasp at an intuitive level -- how we actually design a system that can learn how to generate attention weights, which is the subject of the next section, 3.4. My linear algebra is super-rusty, and while going through this one, I needed to relearn some stuff that I think I must have forgotten sometime late last century...

[ Read more ]