Dropout and mandatory vacation

Posted on 24 March 2025 in AI, Musings |

As I was dozing off the other night, after my post on dropout, it popped into my mind that it's not dissimilar to something many financial firms do. They require certain key employees to take at least two consecutive weeks of holiday every year -- not because they're kind employers who believe in a healthy work-life balance (source: I worked for one) but because it makes sure the firm is functioning safely and effectively, at a small cost in performance.

There are two reasons this helps:

  1. It reduces key-person risk. By enforcing vacation like this, they make absolutely sure that the business can continue to operate even if some people are out. If stuff goes wrong while they're out, then obviously processes are broken or other people don't have the knowledge they need to pick up the slack. So long as it's well-managed, those problems can be fixed, which means that if the key people quit, there's less damage done. Think of it as being like reducing the bus number of a dev team.
  2. It can uncover misbehaviour. Let's imagine a trader is doing something they shouldn't -- maybe fraud, or perhaps just covering up for their mistakes so that they don't get fired. They might be able to manage that by shuffling balances around if they're in the office every day, but two weeks out should mean that whoever is covering for them will work out that something isn't right.

Now, I should admit that the second of those (a) doesn't really apply to dropout 1 and (b) is probably the more important of the two from a bank's perspective.

But the first, I think, is a great metaphor for dropout during training. What we want to do is make sure that no particular parameter is "key"; we want the knowledge and intelligence to be spread across the model as a whole.

That also clears up a couple of questions I had about dropout:

Now, is this a perfect metaphor, or even a great one? Maybe not. But it works for me, and I thought I'd share it in case it's useful for anyone else. And I'm going to be looking for similar organisational metaphors for other ML techniques -- I think they are a useful way to clarify things, especially for those of us who (for better or for worse) have spent time in the trenches of actual organisations.


  1. There might be some applicability to alignment training of multi-agent models, though? 


Writing an LLM from scratch, part 10 -- dropout

Posted on 19 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

I'm still chugging through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered causal attention, which was pretty simple when it came down to it. Today it's another quick and easy one -- dropout.

The concept is pretty simple: you want knowledge to be spread broadly across your model, not concentrated in a few places. Doing that means that all of your parameters are pulling their weight, and you don't have a bunch of them sitting there doing nothing.

So, while you're training (but, importantly, not during inference) you randomly ignore certain parts -- neurons, weights, whatever -- each time around, so that their "knowledge" gets spread over to other bits.

Simple enough! But the implementation is a little more fun, and there were a couple of oddities that I needed to think through.

[ Read more ]


Adding /llms.txt

Posted on 18 March 2025 in Blogkeeping, Website design, AI |

The /llms.txt file is an idea from Jeremy Howard. Rather than making LLMs parse websites with HTML designed to make it look pretty for humans, why not publish the same content separately as Markdown? It's generally not much extra effort, and could make your content more discoverable and useful for people using AIs.

I think its most useful for things like software documentation; Stripe and Anthropic seem to think so too, having both recently added it for theirs.

It's less obviously useful for a blog like this. But I write everything here in Markdown anyway, and just run it through markdown2 and some Jinja2 templates to generate the HTML, so I thought adding support would be a bit of fun; here it is.

One thing that isn't covered by the proposal, at least as far as I could see, is how LLMs should know that there is a special version of the site just for them. A link tag with type set to alternate seemed like a good idea for that; I already had one to help RSS readers find the feed URL:

<link rel="alternate" type="application/rss+xml" title="Giles Thomas" href="/feed/rss.xml" />

...so with a quick check of the docs to make sure I wasn't doing anything really stupid, I decided on this:

<link rel="alternate" type="text/markdown" title="LLM-friendly version" href="/llms.txt" />

There were a couple of other mildly interesting implementation details.

[ Read more ]


The RSS feed now has the full text

Posted on 18 March 2025 in Blogkeeping, Website design |

The other day I asked whether the RSS feed for this blog should have the full text of a post. The majority view was that I should, so I've updated the feed just now. Please do let me know if you see any problems!

[ Read more ]


Should RSS feeds contain the full blog post?

Posted on 16 March 2025 in Blogkeeping, Website design |

A question for people who read this blog over RSS -- should the feed contain the full posts? I split every article into "above the fold" and "below the fold" sections. On the site itself, this serves a useful function -- you can scan through the front page or a topic page and see the intros to different posts so that you can work out what might be of interest to you.

Right now, I only put the "above the fold" content into the RSS feed. There's no particular reason for that, I was just copying certain other blogs I read. But I've noticed recently that some particularly good blogs use the full post text. Perhaps I should switch to that?

Pros: you can read everything in a post without having to click a link and switch app.

Cons: some of my lab notes posts can get absurdly long, and perhaps that wouldn't work well in RSS readers.

What do you think?

[ Read more ]


Writing an LLM from scratch, part 9 -- causal attention

Posted on 9 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

My trek through Sebastian Raschka's "Build a Large Language Model (from Scratch)" continues... Self-attention was a hard nut to crack, but now things feel a bit smoother -- fingers crossed that lasts! So, for today: a quick dive into causal attention, the next part of chapter 3.

Causal attention sounds complicated, but it just means that when we're looking at a word, we don't pay any attention to the words that come later. That's pretty natural -- after all, when we're reading something, we don't need to look at words later on to understand what the word we're reading now means (unless the text is really badly written).

It's "causal" in the sense of causality -- something can't have an effect on something that came before it, just like causes can't some after effects in reality. One big plus about getting that is that I finally now understand why I was using a class called AutoModelForCausalLM for my earlier experiments in fine-tuning LLMs.

Let's take a look at how it's done.

[ Read more ]


Writing an LLM from scratch, part 8 -- trainable self-attention

Posted on 4 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

This is the eighth post in my trek through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm blogging about bits that grab my interest, and things I had to rack my brains over, as a way to get things straight in my own head -- and perhaps to help anyone else that is working through it too. It's been almost a month since my last update -- and if you were suspecting that I was blogging about blogging and spending time getting LaTeX working on this site as procrastination because this next section was always going to be a hard one, then you were 100% right! The good news is that -- as so often happens with these things -- it turned out to not be all that tough when I really got down to it. Momentum regained.

If you found this blog through the blogging-about-blogging, welcome! Those posts were not all that typical, though, and I hope you'll enjoy this return to my normal form.

This time I'm covering section 3.4, "Implementing self-attention with trainable weights". How do we create a system that can learn how to interpret how much attention to pay to words in a sentence, when looking at other words -- for example, that learns that in "the fat cat sat on the mat", when you're looking at "cat", the word "fat" is important, but when you're looking at "mat", "fat" doesn't matter as much?

[ Read more ]