Writing an LLM from scratch, part 32e -- Interventions: the learning rate
I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".
In my training code, I have this code to create the optimiser:
optimizer = torch.optim.AdamW(
model.parameters(),
lr=0.0004, weight_decay=0.1
)
The values in there -- 0.0004 for the learning rate, and 0.1 for the weight
decay -- were just copied from the tiny training run that we do in section 5.2 of
the book.
What do those values actually mean, and are those really the right values for them?
I felt I had a good handle on the learning rate, at least -- it's one of the first things you learn when you start looking at machine learning of any kind -- but how would you go about working out what the correct value for it was? On top of that, when I was reading the Chinchilla paper a while back, I noticed they repeatedly referred to a "cosine cycle" for the learning rate, which didn't fit into anything I'd learned about before.
The weight decay was pretty much an unknown for me -- I know it is a parameter controlling the behaviour of the optimiser, but I don't know how it does that.
In this post I want to look into the learning rate, and these mysterious cosines; I'll write a follow-up about the weight decay later.
Writing an LLM from scratch, part 32d -- Interventions: adding attention bias
I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This is the third intervention I'm trying: adding bias to the attention weight matrices.
In the code from the book, we have this:
class MultiHeadAttention(nn.Module):
def __init__(
self,
d_in, d_out,
context_length,
dropout,
num_heads,
qkv_bias=False
):
...
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
...
def forward(self, x):
...
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
So: we initialise the weights , and as linear layers rather than
simple matrices of weights, and have a parameter qkv_bias to say whether or not we should
add bias to those. In all of our trains so far we've set that to False.
Why do we have this parameter, and where did it come from?
Writing an LLM from scratch, part 32b -- Interventions: gradient clipping
I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka's book "Build a Large Language Model (from Scratch)".
In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the minimal training run in the book, just modified to run using DDP on an 8x A100 40 GiB/GPU machine in the cloud. There are a bunch of "interventions" I want to try to see if they'll make it better, as measured by the loss they get on a test set. I'll do a post for each intervention, and this is the first: gradient clipping.
Writing an LLM from scratch, part 32a -- Interventions: training a baseline model
I'm rounding out my series of posts on Sebastian Raschka's book "Build a Large Language Model (from Scratch)" by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090, and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better.
For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud. That led to some refinements in the prompt-following test I was using, and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub.
Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to.
Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)
I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series, and wanted to share them with anyone that was interested. I managed to get it done, but it was kind of tricky to get right.
The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end.
This post is the tutorial I wish I'd found before I started, and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-)
Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it?
You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on.
That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library, using models that had been uploaded to their hub.
What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this:
from transformers import pipeline
pipe = pipeline(task="text-generation", model="some-hf-user/some-model-name", trust_remote_code=True)
out = pipe(
"Every effort moves you",
max_new_tokens=20,
do_sample=True,
temperature=1.4,
top_k=25,
)
print(out[0]["generated_text"])
...rather than something daunting like this code
with its 24 lines just to sample a few tokens from the model.
Or to train it using code like what you see in this notebook
-- a bit of config then trainer.train --
rather than like this,
with its >100-line train function.
Here's what I had to do to get it working.
Writing an LLM from scratch, part 31 -- the models are now on Hugging Face
As part of my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally, and four in the cloud. I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something closer to the original OpenAI weights' loss on my own hardware, or at least on something I can rent without breaking the bank.
It makes sense to share these models somewhere, both so that other people can take a look if they like, and also to build the knowledge of how to do it so that if I produce something more interesting in the future, I'll know how to share that too.
Raschka's code is all released under the Apache v2 open source license, so I can share my stuff under the same license without worrying about triggering any legal issues. So: I've put all of the models I've trained so far on Hugging Face under that license, and made them reasonably HF-native (I'll explain what I mean by that later).
From the post where I trained the models locally, we have:
gpjt/1xrtx3090m24-fineweb-- the first model in that post, trained on a roughly Chinchilla-optimal number of tokens (20x the number of parameters) from FineWeb.gpjt/1xrtx3090m24-fineweb-edu-- the second model, trained on the same number of tokens from FineWeb-Edu.gpjt/1xrtx3090m24-fineweb-edu-2x-- the third one, which is thegpjt/1xrtx3090m24-fineweb-edumodel trained further on another roughly Chinchilla-optimal number of tokens from the same dataset.
Then, from the post where I trained on a bunch of different kinds of machines on Lambda Labs, four models (with two checkpoints from one of them):
gpjt/8xa100m40-- trained on a 8x A100, 40 GiB/GPU machine.gpjt/8xb200m160-- trained on a 8x B200, 160 GiB/GPU machine.gpjt/8xh100m80-best-- trained on a 8x H100, 80 GiB/GPU machine. The best validation loss for this train was not in the last iteration, so this is the checkpoint with the best loss.gpjt/8xh100m80-latest-- this one is the final checkpoint from the one above.gpjt/8xa100m80-- trained on a 8x A100, 80 GiB/GPU machine.
You can see how they compare on my evals at the bottom of this post.
I wanted to make them all usable within the Hugging Face ecosystem -- that is, I didn't want to just dump a bunch of weights and code into repos there, but rather to have something that someone coming to them without much context could make sense of. Let's dig into that.
Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results
I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI:
- A simple cross entropy loss over a fixed test set.
- The results for an instruction fine-tune test that's covered in the book.
Here were the results I got, sorted by the loss:
| Test loss | IFT score | |
|---|---|---|
| OpenAI weights: medium | 3.231 | 38.53 |
| OpenAI weights: small | 3.500 | 22.98 |
| Cloud FineWeb, 8x A100 40 GiB | 3.674 | 17.09 |
| Cloud FineWeb, 8x H100 80 GiB | 3.725 | 11.98 |
| Cloud FineWeb, 8x A100 80 GiB | 3.730 | 11.71 |
| Cloud FineWeb, 8x B200 160 GiB | 3.771 | 13.89 |
| Local FineWeb train | 3.944 | 16.01 |
| Local FineWeb-Edu extended train | 4.135 | 14.55 |
| Local FineWeb-Edu train | 4.167 | 16.86 |
Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern.
I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now.
In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change.
Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud
I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that:
- I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU.
- If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right.
In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way?
Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing.
So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?"
Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation.
If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results.
Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090
Having worked through the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware?
The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series, I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time.
But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC.
Additionally, Andrej Karpathy recently announced nanochat,
"the best ChatGPT that $100 can buy". He mentions on the main page that he's trained
a model called d32, with 32 Transformer layers, which has 1.9B parameters, for about $800.
His smaller 20-layer d20 model, with 561M parameters, he says should be trainable
in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the
$100 total price.
What's even more interesting about nanochat is that it's built with PyTorch; initially
I'd got the impression that it was based on his pure C/CUDA llm.c,
which I would imagine would give a huge speedup. But no -- he's using the same stack
as I have been in this series!
Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help.
This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project.
But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs.
Here's the full story.
Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch
I recently posted about Andrej Karpathy's classic 2015 essay, "The Unreasonable Effectiveness of Recurrent Neural Networks". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about.
This post is a bit more hands-on. To understand how these RNNs really work, it's
best to write some actual code, so I've implemented a version of Karpathy's
original code using PyTorch's built-in
LSTM
class -- here's the repo. I've tried
to stay as close as possible to the original, but I believe
it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising,
given that he wrote it using Torch, the Lua-based predecessor to PyTorch.)
In this
post, I'll walk through how it works, as of commit daab2e1. In follow-up posts, I'll dig in further,
actually implementing my own RNNs rather than relying on PyTorch's.
All set?