Writing an LLM from scratch, part 9 -- causal attention

Posted on 9 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

My trek through Sebastian Raschka's "Build a Large Language Model (from Scratch)" continues... Self-attention was a hard nut to crack, but now things feel a bit smoother -- fingers crossed that lasts! So, for today: a quick dive into causal attention, the next part of chapter 3.

Causal attention sounds complicated, but it just means that when we're looking at a word, we don't pay any attention to the words that come later. That's pretty natural -- after all, when we're reading something, we don't need to look at words later on to understand what the word we're reading now means (unless the text is really badly written).

It's "causal" in the sense of causality -- something can't have an effect on something that came before it, just like causes can't some after effects in reality. One big plus about getting that is that I finally now understand why I was using a class called AutoModelForCausalLM for my earlier experiments in fine-tuning LLMs.

Let's take a look at how it's done.

[ Read more ]


Writing an LLM from scratch, part 8 -- trainable self-attention

Posted on 4 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

This is the eighth post in my trek through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm blogging about bits that grab my interest, and things I had to rack my brains over, as a way to get things straight in my own head -- and perhaps to help anyone else that is working through it too. It's been almost a month since my last update -- and if you were suspecting that I was blogging about blogging and spending time getting LaTeX working on this site as procrastination because this next section was always going to be a hard one, then you were 100% right! The good news is that -- as so often happens with these things -- it turned out to not be all that tough when I really got down to it. Momentum regained.

If you found this blog through the blogging-about-blogging, welcome! Those posts were not all that typical, though, and I hope you'll enjoy this return to my normal form.

This time I'm covering section 3.4, "Implementing self-attention with trainable weights". How do we create a system that can learn how to interpret how much attention to pay to words in a sentence, when looking at other words -- for example, that learns that in "the fat cat sat on the mat", when you're looking at "cat", the word "fat" is important, but when you're looking at "mat", "fat" doesn't matter as much?

[ Read more ]


Basic matrix maths for neural networks: in practice

Posted on 22 February 2025 in AI, TIL deep dives |

This is the second post in my short series of tutorials on matrix operations for neural networks, targeted at beginners, and at people who have some practical experience, but who haven't yet dug into the underlying theory. Again, if you're an experienced ML practitioner, you should skip this post -- though if you want to read it anyway, any comments or suggestions for improvements would be much appreciated!

In my last post in the series, I showed how to derive the formulae to run a neural network from the basic principles of matrix maths. I gave two formulae that are generally used in mathematical treatments of NNs -- one with a separate bias matrix:

Z^=WX+B

...and one with the bias terms baked into the weights matrix, and the inputs matrix extended with a row of 1s at the bottom:

Z^=WX

However, I finished off by saying that in real production implementations, people normally use this instead:

Z^=XWT+B

...which you might have seen in production PyTorch code looking like this:

Z = X @ W.T + B

This post explores why that form of the equation works better in practice.

[ Read more ]


Basic matrix maths for neural networks: the theory

Posted on 20 February 2025 in AI, TIL deep dives |

I thought it would be worth writing a post on how matrix multiplication is used to calculate the output of neural networks. We use matrices because they make the maths easier, and because GPUs can work with them efficiently, allowing us to do a whole bunch of calculations with a single step -- so it's really worth having a solid grounding in what the underlying operations are.

If you're an experienced ML practitioner, you should skip this post. But you might find it useful if you're a beginner -- or if, like me until I started working through this, you've coded neural networks and used matrix operations for them, but apart from working through an example or two by hand, you've never thought through the details.

In terms of maths, I'll assume that you know what a vector is, what a matrix is, and have some vague memories of matrix multiplication from your schooldays, but that's it -- everything else I will define.

In terms of neural networks, I'll assume that you are aware of their basic layout and how they work in a general sense -- but there will be diagrams for clarity and I'll define specific terms.

So, with expectations set, let's go!

[ Read more ]


Writing an LLM from scratch, part 7 -- wrapping up non-trainable self-attention

Posted on 7 February 2025 in AI, Python, LLM from scratch, TIL deep dives |

This is the seventh post in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Each time I read part of it, I'm posting about what I found interesting or needed to think hard about, as a way to help get things straight in my own head -- and perhaps to help anyone else that is working through it too.

This post is a quick one, covering just section 3.3.2, "Computing attention weights for all input tokens". I'm covering it in a post on its own because it gets things in place for what feels like the hardest part to grasp at an intuitive level -- how we actually design a system that can learn how to generate attention weights, which is the subject of the next section, 3.4. My linear algebra is super-rusty, and while going through this one, I needed to relearn some stuff that I think I must have forgotten sometime late last century...

[ Read more ]


Writing an LLM from scratch, part 6 -- starting to code self-attention

Posted on 21 January 2025 in AI, LLM from scratch, TIL deep dives |

This is the sixth in my series of notes on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Each time I read part of it, I'm posting about what I found interesting as a way to help get things straight in my own head -- and perhaps to help anyone else that is working through it too. This post covers just one subsection of the trickiest chapter in the book -- subsection 3.3.1, "A simple self-attention mechanism without trainable weights". I feel that there's enough in there to make up a post on its own. For me, it certainly gave me one key intuition that I think is a critical part of how everything fits together.

As always, there may be errors in my understanding below -- I've cross-checked and run the whole post through Claude, ChatGPT o1, and DeepSeek r1, so I'm reasonably confident, but caveat lector :-) With all that said, let's go!

[ Read more ]


Writing an LLM from scratch, part 5 -- more on self-attention

Posted on 11 January 2025 in AI, LLM from scratch, TIL deep dives |

I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it. In retrospect, it was kind of adorable that I thought I could get it all done over my Christmas break, given that I managed just the first two-and-a-half chapters! However, now that the start-of-year stuff is out of the way at work, hopefully I can continue. And at least the two-week break since my last post in this series has given things some time to stew.

In the last post I was reading about attention mechanisms and how they work, and was a little thrown by the move from attention to self-attention, and in this blog post I hope to get that all fully sorted so that I can move on to the rest of chapter 3, and then the rest of the book. Rashka himself said on X that this chapter "might be the most technical one (like building the engine of a car) but it gets easier from here!" That's reassuring, and hopefully it means that my blog posts will speed up too once I'm done with it.

But first: on to attention and what it means in the LLM sense.

[ Read more ]


Writing an LLM from scratch, part 4

Posted on 28 December 2024 in AI, LLM from scratch, TIL deep dives |

I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.

Here's a link to the previous post in this series.

Today I read through chapter 3, which introduces and explains attention mechanisms -- the core architecture that allows LLMs to "understand" the meaning of text in terms of the relationships between words. This feels like the core of the book; at least, for me, it's the part of the underlying workings of LLMs that I understand the least. I knew it was something to do with the LLM learning which other words to pay attention to when looking at a particular one, but that's pretty much it.

And it's a tough chapter. I finished with what I felt was a good understanding at a high level of how the calculations that make up self-attention in an LLM work -- but not of how self-attention itself works. That is, I understood how to write one, in terms of the steps to follow mathematically, but not why that specific code would be what I would write or why we would perform those mathematical operations.

I think this was because I tried to devour it all in a day, so I'm going to go through much more slowly, writing up notes on each section each day.

Today, I think, I can at least cover the historical explanation of how attention mechanisms came to be in the first place, because that seems reasonably easy to understand.

[ Read more ]


Writing an LLM from scratch, part 3

Posted on 26 December 2024 in AI, Python, LLM from scratch, TIL deep dives |

I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.

Here's a link to the previous post in this series.

Today I was working through the second half of Chapter 2, "Working with text data", which I'd started just before Christmas. Only two days off, so it was reasonably fresh in my mind :-)

[ Read more ]


Writing an LLM from scratch, part 2

Posted on 23 December 2024 in AI, Python, LLM from scratch, TIL deep dives |

I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and planning to post every day (or at least, every day I read some of it -- Christmas day I suspect I'll not be posting) with notes on what I found interesting.

Here's a link to the previous post in this series.

I had been planning to do a chapter a day, but that is looking optimistic for such a dense book! So today, I've read the first half or so of Chapter 2, "Working with text data". This gives an overview of the pre-processing that happens to text before it hits the LLM, goes on to describe a simple tokenization system (complete with source code), and then briefly covers the byte pair encoding method that we'll actually be using for the LLM.

[ Read more ]