Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results
I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI:
- A simple cross entropy loss over a fixed test set.
- The results for an instruction fine-tune test that's covered in the book.
Here were the results I got, sorted by the loss:
| Test loss | IFT score | |
|---|---|---|
| OpenAI weights: medium | 3.231 | 38.53 |
| OpenAI weights: small | 3.500 | 22.98 |
| Cloud FineWeb, 8x A100 40 GiB | 3.674 | 17.09 |
| Cloud FineWeb, 8x H100 80 GiB | 3.725 | 11.98 |
| Cloud FineWeb, 8x A100 80 GiB | 3.730 | 11.71 |
| Cloud FineWeb, 8x B200 160 GiB | 3.771 | 13.89 |
| Local FineWeb train | 3.944 | 16.01 |
| Local FineWeb-Edu extended train | 4.135 | 14.55 |
| Local FineWeb-Edu train | 4.167 | 16.86 |
Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern.
I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now.
In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change.
Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud
I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that:
- I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU.
- If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right.
In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way?
Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing.
So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?"
Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation.
If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results.
Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090
Having worked through the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware?
The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series, I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time.
But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC.
Additionally, Andrej Karpathy recently announced nanochat,
"the best ChatGPT that $100 can buy". He mentions on the main page that he's trained
a model called d32, with 32 Transformer layers, which has 1.9B parameters, for about $800.
His smaller 20-layer d20 model, with 561M parameters, he says should be trainable
in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the
$100 total price.
What's even more interesting about nanochat is that it's built with PyTorch; initially
I'd got the impression that it was based on his pure C/CUDA llm.c,
which I would imagine would give a huge speedup. But no -- he's using the same stack
as I have been in this series!
Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help.
This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project.
But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs.
Here's the full story.
Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch
I recently posted about Andrej Karpathy's classic 2015 essay, "The Unreasonable Effectiveness of Recurrent Neural Networks". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about.
This post is a bit more hands-on. To understand how these RNNs really work, it's
best to write some actual code, so I've implemented a version of Karpathy's
original code using PyTorch's built-in
LSTM
class -- here's the repo. I've tried
to stay as close as possible to the original, but I believe
it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising,
given that he wrote it using Torch, the Lua-based predecessor to PyTorch.)
In this
post, I'll walk through how it works, as of commit daab2e1. In follow-up posts, I'll dig in further,
actually implementing my own RNNs rather than relying on PyTorch's.
All set?
The fixed length bottleneck and the feed forward network
This post is a kind of note-to-self of a hitch I'm having in my understanding of the mechanics of LLMs at this point in my journey. Please treat it as the musings of a learner, and if you have suggestions on ways around this minor roadblock, comments below would be very welcome!
Having read about and come to the seeds of a working understanding of the role of the feed-forward network in a GPT-style LLM, something has come to mind that I'm still working my way through. It's likely due to a bug in at least one of the mental models I've constructed so far, so what I'd like to do in this post is express the issue as clearly as I can. Hopefully having done that I'll be able to work my way through it in the future, and will be able to post about the solution.
The core of the issue is that the feed-forward network operates on a per-context-vector basis -- that is, the context vectors for each and every token are processed by the same one-hidden-layer neural network in parallel, with no crosstalk between them -- the inter-token communication is all happening in the attention mechanism.
But this means that the amount of data that the FFN is handling is fixed -- it's a vector of numbers, with a dimensionality determined by the LLM's architecture -- 768 for the 124M parameter GPT-2 model I'm studying.
Here's the issue: in my mental model of the LLM, the attention mechanism is working out what to think about, but the FFN is what's doing the thinking (for hand-wavy values of "thinking"). So, given that it's thinking about one context vector at a time, there's a limit to how much it can think about -- just whatever can be represented in those 768 dimensions for this size of GPT-2.
This reminds me very much of the fixed-length bottleneck that plagued early encoder-decoder translation systems. There's a limit to how much data you can jam into a single vector.
Now, this is an error of some kind on my side -- I'm far from being knowledgable enough about LLMs or AI in general to be able to spot problems like this. And I'm pretty sure that the answer lies in one of my mental models being erroneous.
It seems likely that it's related to the interplay between the attention mechanism and the FFNs; that's certainly what's come through in my discussions with various AIs about it. But none of the explanations I've read has been quite enough to gel for me, so in this post I'll detail the issue as well as I can, so that later on I can explain the error in my ways :-)
Moving from Fabric3 to Fabric
I decided to see how long I could go without coding anything after starting my sabbatical -- I was thinking I could manage a month or so, but it turns out that the answer was one week. Ah well. But it's a little less deep-tech than normal -- just the story of some cruft removal I've wanted to do for some time, and I'm posting it because I'm pretty sure I know some people who are planning to do the same upgrade -- hopefully it will be useful for them :-)
I have a new laptop on the way, and wanted to polish up the script I have to install the OS. I like all of my machines to have a pretty consistent Arch install, with a few per-machine tweaks for different screen sizes and whatnot. The process is:
- Boot the new machine from the Arch install media, and get it connected to the LAN.
- Run the script on another machine on the same network -- provide it with the root password (having temporarily enabled root login via SSH) and the IP.
- Let it do its thing, providing it with extra information every now and then -- for example, after reboots it will ask whether the IP address is still the same.
This works pretty well, and I've been using it since 2017. The way it interacts with the machine is by using Fabric. I've never taken to declarative machine setup systems like Ansible -- I always find you wind up re-inventing procedural logic in them eventually, and it winds up being a mess -- so a tool like Fabric is ideal. You can just run commands over the network, upload and download files, and so on.
The problem was that I was using Fabric3. When I started writing these scripts in 2017, it was a bit of a weird time for Fabric. It didn't support Python 3 yet, so the only way to work in a modern Python was to use Fabric3, a fork that just added that.
I think the reason behind the delay in Python 3 support for the main project was that the team behind it were in the process of redesigning it with a new API, and wanted to batch the changes together; when Fabric 2.0.0 came out in 2018, with a completely different usage model, it was Python 3 compatible. (It does look like they backported the Python 3 stuff to the 1.x series later -- at least on PyPI there is a release of 1.15.0 in 2022 that added it.)
So, I was locked in to an old dependency, Fabric3, which hadn't been updated since 2018. This felt like something I should fix, just to keep things reasonably tidy. But that meant completely changing the model of how my scripts ran -- this blog post is a summary of what I had to do. The good news is: it was actually really simple, and the new API is definitely an improvement.
Writing an LLM from scratch, part 15 -- from context vectors to logits; or, can it really be that simple?!
Having worked through chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and spent some time digesting the concepts it introduced (most recently in my post on the complexity of self-attention at scale), it's time for chapter 4.
I've read it through in its entirety, and rather than working through it section-by-section in order, like I did with the last one, I think I'm going to jump around a bit, covering each new concept and how I wrapped my head around it separately. This chapter is a lot easier conceptually than the last, but there were still some "yes, but why do we do that?" moments.
The first of those is the answer to a question I'd been wondering about since at least part 6 in this series, and probably before. The attention mechanism is working through the (tokenised, embedded) input sequence and generating these rich context vectors, each of which expresses the "meaning" of its respective token in the context of the words that came before it. How do we go from there to predicting the next word in the sequence?
The answer, at least in the form of code showing how it happens, leaped out at
me the first time I looked at the first listing in this chapter, for the initial
DummyGPTModel that will be filled in as we go through it.
In its __init__, we create our token and position embedding mappings,
and an object to handle dropout, then
the multiple layers of attention heads (which
are a bit more complex than the heads we've been working with so far, but more on that later),
then some kind of normalisation layer, then:
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
...and then in the forward method, we run our tokens through all of that and then:
logits = self.out_head(x)
return logits
The x in that second bit of code is our context vectors from all of that hard
work the attention layers did -- folded, spindled and mutilated a little by things
like layer normalisation and being run through feed-forward networks with GELU (about
both of which I'll go into in future posts) -- but ultimately just the context vectors.
And all we do to convert it into these logits, the output of the LLM, is run it through a single neural network layer. There's not even a bias, or an activation function -- it's basically just a single matrix multiplication!
My initial response was, essentially, WTF. Possibly WTFF. Gradient descent over neural networks is amazingly capable at learning things, but this seemed quite a heavy lift. Why would something so simple work? (And also, what are "logits"?)
Unpicking that took a bit of thought, and that's what I'll cover in this post.
Writing an LLM from scratch, part 12 -- multi-head attention
In this post, I'm wrapping up chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered batches, which -- somewhat to my disappointment -- didn't involve completely new (to me) high-order tensor multiplication, but instead relied on batched and broadcast matrix multiplication. That was still interesting on its own, however, and at least was easy enough to grasp that I didn't disappear down a mathematical rabbit hole.
The last section of chapter 3 is about multi-head attention, and while it wasn't too hard to understand, there were a couple of oddities that I want to write down -- as always, primarily to get it all straight in my own head, but also just in case it's useful for anyone else.
So, the first question is, what is multi-head attention?
Writing an LLM from scratch, part 11 -- batches
I'm still working through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered dropout, which was nice and easy.
This time I'm moving on to batches. Batches allow you to run a bunch of different input sequences through an LLM at the same time, generating outputs for each in parallel, which can make training and inference more efficient -- if you've read my series on fine-tuning LLMs you'll probably remember I spent a lot of time trying to find exactly the right batch sizes for speed and for the memory I had available.
This was something I was originally planning to go into in some depth, because there's some fundamental maths there that I really wanted to understand better. But the more time I spent reading into it, the more of a rabbit hole it became -- and I had decided on a strict "no side quests" rule when working through this book.
So in this post I'll just present the basic stuff, the stuff that was necessary for me to feel comfortable with the code and the operations described in the book. A full treatment of linear algebra and higher-order tensor operations will, sadly, have to wait for another day...
Let's start off with the fundamental problem of why batches are a bit tricky in an LLM.
Writing an LLM from scratch, part 9 -- causal attention
My trek through Sebastian Raschka's "Build a Large Language Model (from Scratch)" continues... Self-attention was a hard nut to crack, but now things feel a bit smoother -- fingers crossed that lasts! So, for today: a quick dive into causal attention, the next part of chapter 3.
Causal attention sounds complicated, but it just means that when we're looking at a word, we don't pay any attention to the words that come later. That's pretty natural -- after all, when we're reading something, we don't need to look at words later on to understand what the word we're reading now means (unless the text is really badly written).
It's
"causal" in the sense of causality -- something can't have an effect on something that came before
it, just like causes can't some after effects in reality. One big plus about
getting that is that I finally now understand why I was using a class called
AutoModelForCausalLM for my earlier experiments in fine-tuning LLMs.
Let's take a look at how it's done.