Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090
Having worked through the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware?
The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series, I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time.
But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC.
Additionally, Andrej Karpathy recently announced nanochat,
"the best ChatGPT that $100 can buy". He mentions on the main page that he's trained
a model called d32, with 32 Transformer layers, which has 1.9B parameters, for about $800.
His smaller 20-layer d20 model, with 561M parameters, he says should be trainable
in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the
$100 total price.
What's even more interesting about nanochat is that it's built with PyTorch; initially
I'd got the impression that it was based on his pure C/CUDA llm.c,
which I would imagine would give a huge speedup. But no -- he's using the same stack
as I have been in this series!
Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help.
This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project.
But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs.
Here's the full story.
Writing an LLM from scratch, part 27 -- what's left, and what's next?
On 22 December 2024, I wrote:
Over the Christmas break (and probably beyond) I'm planning to work through Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I'm expecting to get through a chapter or less a day, in order to give things time to percolate properly. Each day, or perhaps each chapter, I'll post here about anything I find particularly interesting.
More than ten months and 26 blog posts later, I've reached the end of the main body of the book -- there's just the appendices to go. Even allowing for the hedging, my optimism was adorable.
I don't want to put anyone else off the book by saying that, though! I expect most people will get through it much faster. I made a deliberate decision at the start to write up everything I learned as I worked through it, and that, I think, has helped me solidify things in my mind much better than I would have done if I'd only been reading it and doing the exercises. But on the other hand, writing things up does take a lot of time, much more than the actual learning does. It's worth it for me, but probably isn't for everyone.
So, what next? I've finished the main body of the book, and built up a decent backlog as I did so. What do I need to do before I can treat my "LLM from scratch" journey as done? And what other ideas have come up while I worked through it that might be good bases for future, similar series?
Writing an LLM from scratch, part 26 -- evaluating the fine-tuned model
This post is on the second half of chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In the last post I covered the part of the chapter that covers instruction fine-tuning; this time round, we evaluate our model -- particularly interestingly, we try using another, smarter, model to judge how good its responses are.
Once again, Raschka's explanation in this section is very clear, and there's not that much that was conceptually new to me, so I don't have that many notes -- in fact, this post is probably the shortest one in my series so far!
Writing an LLM from scratch, part 25 -- instruction fine-tuning
This post is on the first part of chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", which covers instruction fine-tuning.
In my last post, I went through a technique which I'd found could sometimes make it possible to turn non-fine-tuned models into reasonable chatbots; perhaps unsurprisingly, the GPT-2 model isn't powerful enough to work that way.
So, with that proven, it was time to do the work :-) This post covers the first half of the chapter, where we actually do the fine-tuning; I'll post later about the second part, where we start evaluating the model that we get.
Just as with the last chapter, what we're doing here is essentially plugging together the various things we've built so far, and Raschka's explanation is very clear, so I don't have that much in the way of notes -- but here are the bits that made me pause.
Writing an LLM from scratch, part 24 -- the transcript hack
Chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" explains how we fine-tune our LLM to follow instructions -- essentially turning a model that can do next-token completion for text generation into something we can use for a chatbot.
Back when I first started looking into LLMs, I used a setup that didn't require that, and got surprisingly good results, at least with later OpenAI models.
The trick was to present the text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at the LLM:
User: Provide a synonym for 'bright'
Bot:
...you would instead prepare it with an introductory paragraph, like this:
This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'. The bot is very intelligent and always answers the human's questions
with a useful reply.
User: Provide a synonym for 'bright'
Bot:
Earlier OpenAI models couldn't do this when I accessed them through the API, but later ones could.
How does our GPT-2 model stack up with this kind of thing -- and for comparison, how about a newer, more sophisticated base (as in, not instruction fine-tuned) model?
Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch
I recently posted about Andrej Karpathy's classic 2015 essay, "The Unreasonable Effectiveness of Recurrent Neural Networks". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about.
This post is a bit more hands-on. To understand how these RNNs really work, it's
best to write some actual code, so I've implemented a version of Karpathy's
original code using PyTorch's built-in
LSTM
class -- here's the repo. I've tried
to stay as close as possible to the original, but I believe
it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising,
given that he wrote it using Torch, the Lua-based predecessor to PyTorch.)
In this
post, I'll walk through how it works, as of commit daab2e1. In follow-up posts, I'll dig in further,
actually implementing my own RNNs rather than relying on PyTorch's.
All set?
Writing an LLM from scratch, part 23 -- fine-tuning for classification
In chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", we finally trained our LLM (having learned essential aspects like cross entropy loss and perplexity along the way). This is amazing -- we've gone from essentially zero to a full pretrained model. But pretrained models aren't all that useful in and of themselves -- we normally do further training to specialise them on a particular task, like being a chatbot.
Chapter 6 explains a -- to me -- slightly surprising thing that we can do with this kind of fine-tuning. We take our LLM and convert it into a classifier that assesses whether or not a given piece of text is spam. That's simple enough that I can cover everything in one post -- so here it is :-)
Writing an LLM from scratch, part 22 -- finally training our LLM!
This post wraps up my notes on chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Understanding cross entropy loss and perplexity were the hard bits for me in this chapter -- the remaining 28 pages were more a case of plugging bits together and running the code, to see what happens.
The shortness of this post almost feels like a damp squib. After writing so much in the last 22 posts, there's really not all that much to say -- but that hides the fact that this part of the book is probably the most exciting to work through. All these pieces developed with such care, and with so much to learn, over the preceding 140 pages, with not all that much to show -- and suddenly, we have a codebase that we can let rip on a training set -- and our model starts talking to us!
I trained my model on the sample dataset that we use in the book, the 20,000 characters of "The Verdict" by Edith Wharton, and then ran it to predict next tokens after "Every effort moves you". I got:
Every effort moves you in," was down surprise a was one of lo "I quote.
Not bad for a model trained on such a small amount of data (in just over ten seconds).
The next step was to download the weights for the original 124M-parameter version of GPT-2 from OpenAI, following the instructions in the book, and then to load them into my model. With those weights, against the same prompt, I got this:
Every effort moves you as far as the hand can go until the end of your turn unless something interrupts your control flow. As you may observe I
That's amazingly cool. Coherent enough that you could believe it's part of the instructions for a game.
Now, I won't go through the remainder of the chapter in detail -- as I said, it's essentially just plugging together the various bits that we've gone through so far, even though the results are brilliant. In this post I'm just going to make a few brief notes on the things that I found interesting.
Revisiting Karpathy’s 'The Unreasonable Effectiveness of Recurrent Neural Networks'
Being on a sabbatical means having a bit more time on my hands than I'm used to, and I wanted to broaden my horizons a little. I've been learning how current LLMs work by going through Sebastian Raschka's book "Build a Large Language Model (from Scratch)", but how about the history -- where did this design come from? What did people do before Transformers?
Back when it was published in 2015, Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" went viral.
It's easy to see why. While interesting stuff had been coming out of AI labs for some time, for those of us in the broader tech community, it still felt like we were in an AI winter. Karpathy's post showed that things were in fact moving pretty fast -- he showed that he could train recurrent neural networks (RNNs) on text, and get them to generate surprisingly readable results.
For example, he trained one on the complete works of Shakespeare, and got output like this:
KING LEAR:
O, if you were a feeble sight, the courtesy of your law,
Your sight and several breath, will wear the gods
With his heads, and my hands are wonder'd at the deeds,
So drop upon your lordship's head, and your opinion
Shall be against your honour.
As he says, you could almost (if not quite) mistake it for a real quote! And this is from a network that had to learn everything from scratch -- no tokenising, just bytes. It went from generating random junk like this:
bo.+\x94G5YFM,}Hx'E{*T]v>>,2pw\nRb/f{a(3n.\xe2K5OGc
...to learning that there was such a thing as words, to learning English words, to learning the rules of layout required for a play.
This was amazing enough that it even hit the mainstream. A meme template you still see everywhere is "I forced a bot to watch 10,000 episodes of $TV_SHOW and here's what it came up with" -- followed by some crazy parody of the TV show in question. (A personal favourite is this one by Keaton Patti for "Queer Eye".)
The source of that meme template was actually a real thing -- a developer called Andy Herd trained an RNN on scripts from "Friends", and generated an almost-coherent but delightfully quirky script fragment. Sadly I can't find it on the Internet any more (if anyone has a copy, please share!) -- Herd is no longer on X/Twitter, and there seems to be no trace of the fragment, just news stories about it. But that was in early 2016, just after Karpathy's blog post. People saw it, thought it was funny, and (slightly ironically) discovered that humans could do better.
So, this was a post that showed techies in general how impressive the results you could get from then-recent AI were, and that had a viral impact on Internet culture. It came out in 2015, two years before "Attention Is All You Need", which introduced the Transformers architecture that powers essentially all mainstream AI these days. (It's certainly worth mentioning that the underlying idea wasn't exactly unknown, though -- near the end of the post, Karpathy explicitly highlights that the "concept of attention is the most interesting recent architectural innovation in neural networks".)
I didn't have time to go through it and try to play with the code when it came out, but now that I'm on sabbatical, it's the perfect time to fix that! I've implemented my own version using PyTorch, and you can clone and run it. Some sample output after training on the Project Gutenberg Complete Works of Shakespeare:
SOLANIO.
Not anything
With her own calling bids me, I look down,
That we attend for letters—are a sovereign,
And so, that love have so as yours; you rogue.
We are hax on me but the way to stop.
[_Stabs John of London. But fearful, Mercutio as the Dromio sleeps
fallen._]
ANTONIO.
Yes, then, it stands, and is the love in thy life.
There's a README.md in the repo with full instructions about how to use it --
I wrote the code myself (with some AI guidance on how to use the APIs), but Claude
was invaluable for taking a look at the codebase and generating much better and
more useful instructions on how to use it than I would have done :-)
This code is actually "cheating" a bit, because Karpathy's original repo
has a full implementation of several kinds of RNNs (in Lua, which is what the
original Torch framework was based on), while I'm using PyTorch's
built-in LSTM class, which implements a Long Short-Term Memory network -- the specific
kind of RNN used to generate the samples in the post (though not in the code snippets,
which are from "vanilla" RNNs).
Over the next few posts in this series (which I'll interleave with "LLM from scratch" ones), I'll cover:
- A writeup of the PyTorch code as it currently is.
- Implementation of a regular RNN in PyTorch, showing why it's not as good as an LSTM.
- Implementation of an LSTM in PyTorch, which (hopefully) will work as well as the built-in one.
However, in this first post I want to talk about the original article and highlight how the techniques differ from what I've seen while learning about modern LLMs.
If you're interested (and haven't already zoomed off to start generating your own version of "War and Peace" using that repo), then read on!
Writing an LLM from scratch, part 21 -- perplexed by perplexity
I'm continuing through chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", which covers training the LLM. Last time I wrote about cross entropy loss. Before moving on to the next section, I wanted to post about something that the book only covers briefly in a sidebar: perplexity.
Back in May, I thought I had understood it:
Just as I was finishing this off, I found myself thinking that logits were interesting because you could take some measure of how certain the LLM was about the next token from them. For example, if all of the logits were the same number, it would mean that the LLM has absolutely no idea what token might come back -- it's giving an equal chance to all of them. If all of them were zero apart from one, which was a positive number, then it would be 100% sure about what the next one was going to be. If you could represent that in a single number -- let's say, 0 means that it has only one candidate and 1 means that it hasn't even the slightest idea what is most likely -- then it would be an interesting measure of how certain the LLM was about its choice.
Turns out (unsurprisingly) that I'd re-invented something that's been around for a long time. That number is called perplexity, and I imagine that's why the largest AI-enabled web search engine borrowed that name.
I'd misunderstood. From the post on cross entropy, you can see that the measure that I was talking about in May was something more like the simple Shannon entropy of the LLM's output probabilities. That's a useful number, but perplexity is something different.
Its actual calculation is really simple -- you just raise the base of the logarithms
you were using in your cross entropy loss to the power of that loss. So if you were
using the natural logarithm to work out your loss , perplexity would
be , if you were using the base-2 logarithm then it would be , and so on.
PyTorch uses the natural logarithm, so you'd use the matching torch.exp function.
Raschka says that perplexity "measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset", and that it "is often considered more interpretable than the raw [cross entropy] loss value because it signifies the effective vocabulary size about which the model is uncertain at each step."
This felt like something I would like to dig into a bit.