Writing an LLM from scratch, part 12 -- multi-head attention

Posted on 21 April 2025 in AI, Python, LLM from scratch, TIL deep dives |

In this post, I'm wrapping up chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered batches, which -- somewhat to my disappointment -- didn't involve completely new (to me) high-order tensor multiplication, but instead relied on batched and broadcast matrix multiplication. That was still interesting on its own, however, and at least was easy enough to grasp that I didn't disappear down a mathematical rabbit hole.

The last section of chapter 3 is about multi-head attention, and while it wasn't too hard to understand, there were a couple of oddities that I want to write down -- as always, primarily to get it all straight in my own head, but also just in case it's useful for anyone else.

So, the first question is, what is multi-head attention?

[ Read more ]


Writing an LLM from scratch, part 11 -- batches

Posted on 19 April 2025 in AI, Python, LLM from scratch, TIL deep dives |

I'm still working through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered dropout, which was nice and easy.

This time I'm moving on to batches. Batches allow you to run a bunch of different input sequences through an LLM at the same time, generating outputs for each in parallel, which can make training and inference more efficient -- if you've read my series on fine-tuning LLMs you'll probably remember I spent a lot of time trying to find exactly the right batch sizes for speed and for the memory I had available.

This was something I was originally planning to go into in some depth, because there's some fundamental maths there that I really wanted to understand better. But the more time I spent reading into it, the more of a rabbit hole it became -- and I had decided on a strict "no side quests" rule when working through this book.

So in this post I'll just present the basic stuff, the stuff that was necessary for me to feel comfortable with the code and the operations described in the book. A full treatment of linear algebra and higher-order tensor operations will, sadly, have to wait for another day...

Let's start off with the fundamental problem of why batches are a bit tricky in an LLM.

[ Read more ]