Moving from Fabric3 to Fabric
I decided to see how long I could go without coding anything after starting my sabbatical -- I was thinking I could manage a month or so, but it turns out that the answer was one week. Ah well. But it's a little less deep-tech than normal -- just the story of some cruft removal I've wanted to do for some time, and I'm posting it because I'm pretty sure I know some people who are planning to do the same upgrade -- hopefully it will be useful for them :-)
I have a new laptop on the way, and wanted to polish up the script I have to install the OS. I like all of my machines to have a pretty consistent Arch install, with a few per-machine tweaks for different screen sizes and whatnot. The process is:
- Boot the new machine from the Arch install media, and get it connected to the LAN.
- Run the script on another machine on the same network -- provide it with the root password (having temporarily enabled root login via SSH) and the IP.
- Let it do its thing, providing it with extra information every now and then -- for example, after reboots it will ask whether the IP address is still the same.
This works pretty well, and I've been using it since 2017. The way it interacts with the machine is by using Fabric. I've never taken to declarative machine setup systems like Ansible -- I always find you wind up re-inventing procedural logic in them eventually, and it winds up being a mess -- so a tool like Fabric is ideal. You can just run commands over the network, upload and download files, and so on.
The problem was that I was using Fabric3. When I started writing these scripts in 2017, it was a bit of a weird time for Fabric. It didn't support Python 3 yet, so the only way to work in a modern Python was to use Fabric3, a fork that just added that.
I think the reason behind the delay in Python 3 support for the main project was that the team behind it were in the process of redesigning it with a new API, and wanted to batch the changes together; when Fabric 2.0.0 came out in 2018, with a completely different usage model, it was Python 3 compatible. (It does look like they backported the Python 3 stuff to the 1.x series later -- at least on PyPI there is a release of 1.15.0 in 2022 that added it.)
So, I was locked in to an old dependency, Fabric3, which hadn't been updated since 2018. This felt like something I should fix, just to keep things reasonably tidy. But that meant completely changing the model of how my scripts ran -- this blog post is a summary of what I had to do. The good news is: it was actually really simple, and the new API is definitely an improvement.
Leaving PythonAnywhere
Today was my last day at PythonAnywhere, the platform I co-founded back in 2011. We sold the business to Anaconda in 2022, and I'm confident that it's in good hands -- in particular, Glenn, Filip and Piotr from the original team are all staying on, and with the help of Nina, Lee and Alex who joined after the acquisition, plus the support of the larger Anaconda organisation, there's a fantastic future ahead for the platform and the community we've built around it.
It's been quite the ride! Back in late 2005, Robert, Patrick and I founded a company called Resolver Systems. Our goal was to free the world from messy Excel spreadsheets, and we built an amazing team1 that created an amazing tool, called Resolver One -- a spreadsheet designed to integrate Python from the ground up. Unfortunately we made all of the textbook startup mistakes -- in particular, building in secret for more than a year and only getting it in front of potential users when that was done. Added to that, we had the bad luck of releasing a product targeting financial users in 2008, just in time for many of our potential customers to start going bust. Ooops.
We pushed ahead for a while, but eventually in 2010 we pivoted to Dirigible, a cloud-based version of Resolver One -- a Pythonic spreadsheet, but web-based. And we'd learned our lession: we got it from idea to first release in less than a month, and started building up a group of active users.
Within six months or so, we realised that our users were using Dirigible differently to how we imagined. They were ignoring the spreadsheet grid entirely and just using it as a Python cloud IDE. "It would be great if I could host a website through this thing", they told us. "Any chance of adding on a MySQL database so that I can store stuff there?"
We had a direction to take it -- one that people really seemed keen on! So in early 2011, it was time to pivot again, and Dirigible-with-the-spreadsheet-grid-removed became PythonAnywhere.
From there, things started taking off; I posted about the first ten years on the company blog with the details, but I remember the milestones: cocktails under Smithfield Market when we hit 1,000 customers; the day our revenue covered our AWS bill; the day it covered everything except salaries, then everyone’s salaries except mine -- and then, the day we broke even, independent and sustainable.
A few years after that, towards the end of 2021, a mutual friend put me in touch with Peter Wang at Anaconda. We talked about possibilities for working together; after a while, the idea of an acquisition came up, and it was obviously the right move. We had a popular and growing platform with technology that allowed us to host hundreds of thousands of users efficiently enough to offer free accounts; with Anaconda's leading position in the Python market -- especially in data science -- and its resources, not only could PythonAnywhere grow faster and better, but we could also help Anaconda build new products to cement its position. So, after some negotiations, in June 2022 we became part of the Anaconda family.
There were sad moments too, of course; of the original team, only Glenn and I remained (both from the Resolver days). Jonathan had moved on not long after we pivoted to PythonAnywhere, then Hansel after we'd been running for a few years, then Harry 2, and then Conrad.
Now, finally, it's time for me to move on too. It's hard to leave -- I feel like I've been finding excuses to delay it for months!
But fourteen years is a long time to work on something, and the pressure of running a startup, followed by the stress of due diligence and everything else that goes into an acquisition, has all built up. I stayed on at Anaconda as PythonAnywhere team lead for three years after the sale to help with the transition and make sure everything was in good shape. And -- as far as one can be in a changing world -- I'm confident that it is now.
It's been an amazing fourteen years, and I'd like to thank everyone who was involved -- the team, of course, but also the community around the platform. To everyone who's ever signed up, subscribed, filed a bug, made a suggestion, or told a friend -- to all of our paying customers, and everyone who's supported us in other ways -- thank you! PythonAnywhere exists thanks to you.
So, what next? My plan is to take a year off to clear my head, reset, and relax. Sara, my wife, is betting that I'll last less than six months before I get bored and start something else.
She may be right.
In the meantime, you can expect posting to continue here as normal -- maybe even more than in the past, what with all the extra time :-)
And finally -- if you have any PythonAnywhere stories to share, please do leave a comment below. I'd love to hear from you!
-
Including the inimitable Michael Foord, who was sadly taken from us before his time earlier this year. ↩
-
Not only an awesome developer, but also a gonzo marketer par excellence, and the author of what is either the best or the worst company newsletter ever written. ↩
Writing an LLM from scratch, part 15 -- from context vectors to logits; or, can it really be that simple?!
Having worked through chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and spent some time digesting the concepts it introduced (most recently in my post on the complexity of self-attention at scale), it's time for chapter 4.
I've read it through in its entirety, and rather than working through it section-by-section in order, like I did with the last one, I think I'm going to jump around a bit, covering each new concept and how I wrapped my head around it separately. This chapter is a lot easier conceptually than the last, but there were still some "yes, but why do we do that?" moments.
The first of those is the answer to a question I'd been wondering about since at least part 6 in this series, and probably before. The attention mechanism is working through the (tokenised, embedded) input sequence and generating these rich context vectors, each of which expresses the "meaning" of its respective token in the context of the words that came before it. How do we go from there to predicting the next word in the sequence?
The answer, at least in the form of code showing how it happens, leaped out at
me the first time I looked at the first listing in this chapter, for the initial
DummyGPTModel
that will be filled in as we go through it.
In its __init__
, we create our token and position embedding mappings,
and an object to handle dropout, then
the multiple layers of attention heads (which
are a bit more complex than the heads we've been working with so far, but more on that later),
then some kind of normalisation layer, then:
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
...and then in the forward
method, we run our tokens through all of that and then:
logits = self.out_head(x)
return logits
The x
in that second bit of code is our context vectors from all of that hard
work the attention layers did -- folded, spindled and mutilated a little by things
like layer normalisation and being run through feed-forward networks with GELU (about
both of which I'll go into in future posts) -- but ultimately just the context vectors.
And all we do to convert it into these logits, the output of the LLM, is run it through a single neural network layer. There's not even a bias, or an activation function -- it's basically just a single matrix multiplication!
My initial response was, essentially, WTF. Possibly WTFF. Gradient descent over neural networks is amazingly capable at learning things, but this seemed quite a heavy lift. Why would something so simple work? (And also, what are "logits"?)
Unpicking that took a bit of thought, and that's what I'll cover in this post.
Writing an LLM from scratch, part 14 -- the complexity of self-attention at scale
Between reading chapters 3 and 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I'm taking a break to solidify a few things that have been buzzing through my head as I've worked through it. Last time I posted about how I currently understand the "why" of the calculations we do for self-attention. This time, I want to start working through my budding intuition on how this algorithm behaves as we scale up context length.. As always, this is to try to get my own thoughts clear in my head, with the potential benefit of helping out anyone else at the same stage as me -- if you want expert explanations, I'm afraid you'll need to look elsewhere :-)
The particular itch I want to scratch is around the incredible increases in context lengths over the last few years. When ChatGPT first came out in late 2022, it was pretty clear that it had a context length of a couple of thousand tokens; conversations longer than that became increasingly surreal. But now it's much better -- OpenAI's GPT-4.1 model has a context window of 1,047,576 tokens, and Google's Gemini 1.5 Pro is double that. Long conversations just work -- and the only downside is that you hit rate limits faster if they get too long.
It's pretty clear that there's been some impressive engineering going into achieving that. And while understanding those enhancements to the basic LLM recipe is one of the side quests I'm trying to avoid while reading this book, I think it's important to make sure I'm clear in my head what the problems are, even if I don't look into the solutions.
So: why is context length a problem?
Writing an LLM from scratch, part 13 -- the 'why' of attention, or: attention heads are dumb
Now that I've finished chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" -- having worked my way through multi-head attention in the last post -- I thought it would be worth pausing to take stock before moving on to Chapter 4.
There are two things I want to cover, the "why" of self-attention, and some thoughts on context lengths. This post is on the "why" -- that is, why do the particular set of matrix multiplications described in the book do what we want them to do?
As always, this is something I'm doing primarily to get things clear in my own head -- with the possible extra benefit of it being of use to other people out there. I will, of course, run it past multiple LLMs to make sure I'm not posting total nonsense, but caveat lector!
Let's get into it. As I wrote in part 8 of this series:
I think it's also worth noting that [what's in the book is] very much a "mechanistic" explanation -- it says how we do these calculations without saying why. I think that the "why" is actually out of scope for this book, but it's something that fascinates me, and I'll blog about it soon.
That "soon" is now :-)
Writing an LLM from scratch, part 12 -- multi-head attention
In this post, I'm wrapping up chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered batches, which -- somewhat to my disappointment -- didn't involve completely new (to me) high-order tensor multiplication, but instead relied on batched and broadcast matrix multiplication. That was still interesting on its own, however, and at least was easy enough to grasp that I didn't disappear down a mathematical rabbit hole.
The last section of chapter 3 is about multi-head attention, and while it wasn't too hard to understand, there were a couple of oddities that I want to write down -- as always, primarily to get it all straight in my own head, but also just in case it's useful for anyone else.
So, the first question is, what is multi-head attention?
Writing an LLM from scratch, part 11 -- batches
I'm still working through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered dropout, which was nice and easy.
This time I'm moving on to batches. Batches allow you to run a bunch of different input sequences through an LLM at the same time, generating outputs for each in parallel, which can make training and inference more efficient -- if you've read my series on fine-tuning LLMs you'll probably remember I spent a lot of time trying to find exactly the right batch sizes for speed and for the memory I had available.
This was something I was originally planning to go into in some depth, because there's some fundamental maths there that I really wanted to understand better. But the more time I spent reading into it, the more of a rabbit hole it became -- and I had decided on a strict "no side quests" rule when working through this book.
So in this post I'll just present the basic stuff, the stuff that was necessary for me to feel comfortable with the code and the operations described in the book. A full treatment of linear algebra and higher-order tensor operations will, sadly, have to wait for another day...
Let's start off with the fundamental problem of why batches are a bit tricky in an LLM.
Dropout and mandatory vacation
As I was dozing off the other night, after my post on dropout, it popped into my mind that it's not dissimilar to something many financial firms do. They require certain key employees to take at least two consecutive weeks of holiday every year -- not because they're kind employers who believe in a healthy work-life balance (source: I worked for one) but because it makes sure the firm is functioning safely and effectively, at a small cost in performance.
There are two reasons this helps:
- It reduces key-person risk. By enforcing vacation like this, they make absolutely sure that the business can continue to operate even if some people are out. If stuff goes wrong while they're out, then obviously processes are broken or other people don't have the knowledge they need to pick up the slack. So long as it's well-managed, those problems can be fixed, which means that if the key people quit, there's less damage done. Think of it as being like reducing the bus number of a dev team.
- It can uncover misbehaviour. Let's imagine a trader is doing something they shouldn't -- maybe fraud, or perhaps just covering up for their mistakes so that they don't get fired. They might be able to manage that by shuffling balances around if they're in the office every day, but two weeks out should mean that whoever is covering for them will work out that something isn't right.
Now, I should admit that the second of those (a) doesn't really apply to dropout 1 and (b) is probably the more important of the two from a bank's perspective.
But the first, I think, is a great metaphor for dropout during training. What we want to do is make sure that no particular parameter is "key"; we want the knowledge and intelligence to be spread across the model as a whole.
That also clears up a couple of questions I had about dropout:
- It slows down training. Yes, if you're doing dropout, you'll see your error falling more slowly than if you don't -- just like the trading desk sees their performance drop a bit when their top trader is on mandatory vacation. But that's a cost you pay to gain performance at other times -- the rest of the year for the bank, or at inference time for the model.
- Do you keep gradients for, and back-prop to, the dropped-out parameters? No, just like the bank wouldn't put the people who were out of the office through training for issues that came up during their absence. They'd train the people or fix the systems that had problems instead.
Now, is this a perfect metaphor, or even a great one? Maybe not. But it works for me, and I thought I'd share it in case it's useful for anyone else. And I'm going to be looking for similar organisational metaphors for other ML techniques -- I think they are a useful way to clarify things, especially for those of us who (for better or for worse) have spent time in the trenches of actual organisations.
-
There might be some applicability to alignment training of multi-agent models, though? ↩
Writing an LLM from scratch, part 10 -- dropout
I'm still chugging through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered causal attention, which was pretty simple when it came down to it. Today it's another quick and easy one -- dropout.
The concept is pretty simple: you want knowledge to be spread broadly across your model, not concentrated in a few places. Doing that means that all of your parameters are pulling their weight, and you don't have a bunch of them sitting there doing nothing.
So, while you're training (but, importantly, not during inference) you randomly ignore certain parts -- neurons, weights, whatever -- each time around, so that their "knowledge" gets spread over to other bits.
Simple enough! But the implementation is a little more fun, and there were a couple of oddities that I needed to think through.
Adding /llms.txt
The /llms.txt file is an idea from Jeremy Howard. Rather than making LLMs parse websites with HTML designed to make it look pretty for humans, why not publish the same content separately as Markdown? It's generally not much extra effort, and could make your content more discoverable and useful for people using AIs.
I think its most useful for things like software documentation; Stripe and Anthropic seem to think so too, having both recently added it for theirs.
It's less obviously useful for a blog like this. But I write everything here in Markdown anyway, and just run it through markdown2 and some Jinja2 templates to generate the HTML, so I thought adding support would be a bit of fun; here it is.
One thing that isn't covered by the proposal, at least as far as I could see, is
how LLMs should know that there is a special version of the site just for them. A link
tag with type
set to alternate
seemed like a good idea for that; I already had
one to help RSS readers find the feed URL:
<link rel="alternate" type="application/rss+xml" title="Giles Thomas" href="/feed/rss.xml" />
...so with a quick check of the docs to make sure I wasn't doing anything really stupid, I decided on this:
<link rel="alternate" type="text/markdown" title="LLM-friendly version" href="/llms.txt" />
There were a couple of other mildly interesting implementation details.