Writing an LLM from scratch, part 12 -- multi-head attention

Posted on 21 April 2025 in AI, Python, LLM from scratch, TIL deep dives |

In this post, I'm wrapping up chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered batches, which -- somewhat to my disappointment -- didn't involve completely new (to me) high-order tensor multiplication, but instead relied on batched and broadcast matrix multiplication. That was still interesting on its own, however, and at least was easy enough to grasp that I didn't disappear down a mathematical rabbit hole.

The last section of chapter 3 is about multi-head attention, and while it wasn't too hard to understand, there were a couple of oddities that I want to write down -- as always, primarily to get it all straight in my own head, but also just in case it's useful for anyone else.

So, the first question is, what is multi-head attention?

[ Read more ]


Writing an LLM from scratch, part 11 -- batches

Posted on 19 April 2025 in AI, Python, LLM from scratch, TIL deep dives |

I'm still working through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered dropout, which was nice and easy.

This time I'm moving on to batches. Batches allow you to run a bunch of different input sequences through an LLM at the same time, generating outputs for each in parallel, which can make training and inference more efficient -- if you've read my series on fine-tuning LLMs you'll probably remember I spent a lot of time trying to find exactly the right batch sizes for speed and for the memory I had available.

This was something I was originally planning to go into in some depth, because there's some fundamental maths there that I really wanted to understand better. But the more time I spent reading into it, the more of a rabbit hole it became -- and I had decided on a strict "no side quests" rule when working through this book.

So in this post I'll just present the basic stuff, the stuff that was necessary for me to feel comfortable with the code and the operations described in the book. A full treatment of linear algebra and higher-order tensor operations will, sadly, have to wait for another day...

Let's start off with the fundamental problem of why batches are a bit tricky in an LLM.

[ Read more ]


Dropout and mandatory vacation

Posted on 24 March 2025 in AI, Musings |

As I was dozing off the other night, after my post on dropout, it popped into my mind that it's not dissimilar to something many financial firms do. They require certain key employees to take at least two consecutive weeks of holiday every year -- not because they're kind employers who believe in a healthy work-life balance (source: I worked for one) but because it makes sure the firm is functioning safely and effectively, at a small cost in performance.

There are two reasons this helps:

  1. It reduces key-person risk. By enforcing vacation like this, they make absolutely sure that the business can continue to operate even if some people are out. If stuff goes wrong while they're out, then obviously processes are broken or other people don't have the knowledge they need to pick up the slack. So long as it's well-managed, those problems can be fixed, which means that if the key people quit, there's less damage done. Think of it as being like reducing the bus number of a dev team.
  2. It can uncover misbehaviour. Let's imagine a trader is doing something they shouldn't -- maybe fraud, or perhaps just covering up for their mistakes so that they don't get fired. They might be able to manage that by shuffling balances around if they're in the office every day, but two weeks out should mean that whoever is covering for them will work out that something isn't right.

Now, I should admit that the second of those (a) doesn't really apply to dropout 1 and (b) is probably the more important of the two from a bank's perspective.

But the first, I think, is a great metaphor for dropout during training. What we want to do is make sure that no particular parameter is "key"; we want the knowledge and intelligence to be spread across the model as a whole.

That also clears up a couple of questions I had about dropout:

Now, is this a perfect metaphor, or even a great one? Maybe not. But it works for me, and I thought I'd share it in case it's useful for anyone else. And I'm going to be looking for similar organisational metaphors for other ML techniques -- I think they are a useful way to clarify things, especially for those of us who (for better or for worse) have spent time in the trenches of actual organisations.


  1. There might be some applicability to alignment training of multi-agent models, though? 


Writing an LLM from scratch, part 10 -- dropout

Posted on 19 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

I'm still chugging through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered causal attention, which was pretty simple when it came down to it. Today it's another quick and easy one -- dropout.

The concept is pretty simple: you want knowledge to be spread broadly across your model, not concentrated in a few places. Doing that means that all of your parameters are pulling their weight, and you don't have a bunch of them sitting there doing nothing.

So, while you're training (but, importantly, not during inference) you randomly ignore certain parts -- neurons, weights, whatever -- each time around, so that their "knowledge" gets spread over to other bits.

Simple enough! But the implementation is a little more fun, and there were a couple of oddities that I needed to think through.

[ Read more ]


Adding /llms.txt

Posted on 18 March 2025 in Blogkeeping, Website design, AI |

The /llms.txt file is an idea from Jeremy Howard. Rather than making LLMs parse websites with HTML designed to make it look pretty for humans, why not publish the same content separately as Markdown? It's generally not much extra effort, and could make your content more discoverable and useful for people using AIs.

I think its most useful for things like software documentation; Stripe and Anthropic seem to think so too, having both recently added it for theirs.

It's less obviously useful for a blog like this. But I write everything here in Markdown anyway, and just run it through markdown2 and some Jinja2 templates to generate the HTML, so I thought adding support would be a bit of fun; here it is.

One thing that isn't covered by the proposal, at least as far as I could see, is how LLMs should know that there is a special version of the site just for them. A link tag with type set to alternate seemed like a good idea for that; I already had one to help RSS readers find the feed URL:

<link rel="alternate" type="application/rss+xml" title="Giles Thomas" href="/feed/rss.xml" />

...so with a quick check of the docs to make sure I wasn't doing anything really stupid, I decided on this:

<link rel="alternate" type="text/markdown" title="LLM-friendly version" href="/llms.txt" />

There were a couple of other mildly interesting implementation details.

[ Read more ]


The RSS feed now has the full text

Posted on 18 March 2025 in Blogkeeping, Website design |

The other day I asked whether the RSS feed for this blog should have the full text of a post. The majority view was that I should, so I've updated the feed just now. Please do let me know if you see any problems!

[ Read more ]


Should RSS feeds contain the full blog post?

Posted on 16 March 2025 in Blogkeeping, Website design |

A question for people who read this blog over RSS -- should the feed contain the full posts? I split every article into "above the fold" and "below the fold" sections. On the site itself, this serves a useful function -- you can scan through the front page or a topic page and see the intros to different posts so that you can work out what might be of interest to you.

Right now, I only put the "above the fold" content into the RSS feed. There's no particular reason for that, I was just copying certain other blogs I read. But I've noticed recently that some particularly good blogs use the full post text. Perhaps I should switch to that?

Pros: you can read everything in a post without having to click a link and switch app.

Cons: some of my lab notes posts can get absurdly long, and perhaps that wouldn't work well in RSS readers.

What do you think?

[ Read more ]


Writing an LLM from scratch, part 9 -- causal attention

Posted on 9 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

My trek through Sebastian Raschka's "Build a Large Language Model (from Scratch)" continues... Self-attention was a hard nut to crack, but now things feel a bit smoother -- fingers crossed that lasts! So, for today: a quick dive into causal attention, the next part of chapter 3.

Causal attention sounds complicated, but it just means that when we're looking at a word, we don't pay any attention to the words that come later. That's pretty natural -- after all, when we're reading something, we don't need to look at words later on to understand what the word we're reading now means (unless the text is really badly written).

It's "causal" in the sense of causality -- something can't have an effect on something that came before it, just like causes can't some after effects in reality. One big plus about getting that is that I finally now understand why I was using a class called AutoModelForCausalLM for my earlier experiments in fine-tuning LLMs.

Let's take a look at how it's done.

[ Read more ]


Writing an LLM from scratch, part 8 -- trainable self-attention

Posted on 4 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

This is the eighth post in my trek through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm blogging about bits that grab my interest, and things I had to rack my brains over, as a way to get things straight in my own head -- and perhaps to help anyone else that is working through it too. It's been almost a month since my last update -- and if you were suspecting that I was blogging about blogging and spending time getting LaTeX working on this site as procrastination because this next section was always going to be a hard one, then you were 100% right! The good news is that -- as so often happens with these things -- it turned out to not be all that tough when I really got down to it. Momentum regained.

If you found this blog through the blogging-about-blogging, welcome! Those posts were not all that typical, though, and I hope you'll enjoy this return to my normal form.

This time I'm covering section 3.4, "Implementing self-attention with trainable weights". How do we create a system that can learn how to interpret how much attention to pay to words in a sentence, when looking at other words -- for example, that learns that in "the fat cat sat on the mat", when you're looking at "cat", the word "fat" is important, but when you're looking at "mat", "fat" doesn't matter as much?

[ Read more ]


It's still worth blogging in the age of AI

Posted on 24 February 2025 in Blogkeeping, Musings |

My post about blogging as writing the tutorial that you wished you'd found really took off on Hacker News. There were a lot of excellent comments, but one thing kept coming up: what's the point in blogging if people are using ChatGPT, Claude and DeepSeek to spoon-feed them answers? Who, apart from the AIs, will read what you write?

I was asking myself the same question when I started blogging semi-regularly again last year, and this post is an attempt to summarise why I decided that it was worthwhile. The TL;DR: blogging isn't just about being read -- it's about learning and thinking, and having a durable proof that you can do both.

[ Read more ]