Writing an LLM from scratch, part 25 -- instruction fine-tuning

Posted on 29 October 2025 in AI, LLM from scratch, TIL deep dives |

This post is on the first part of chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", which covers instruction fine-tuning.

In my last post, I went through a technique which I'd found could sometimes make it possible to turn non-fine-tuned models into reasonable chatbots; perhaps unsurprisingly, the GPT-2 model isn't powerful enough to work that way.

So, with that proven, it was time to do the work :-) This post covers the first half of the chapter, where we actually do the fine-tuning; I'll post later about the second part, where we start evaluating the model that we get.

Just as with the last chapter, what we're doing here is essentially plugging together the various things we've built so far, and Raschka's explanation is very clear, so I don't have that much in the way of notes -- but here are the bits that made me pause.

[ Read more ]


Writing an LLM from scratch, part 24 -- the transcript hack

Posted on 28 October 2025 in AI, LLM from scratch, TIL deep dives |

Chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" explains how we fine-tune our LLM to follow instructions -- essentially turning a model that can do next-token completion for text generation into something we can use for a chatbot.

Back when I first started looking into LLMs, I used a setup that didn't require that, and got surprisingly good results, at least with later OpenAI models.

The trick was to present the text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at the LLM:

User: Provide a synonym for 'bright'

Bot:

...you would instead prepare it with an introductory paragraph, like this:

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot:

Earlier OpenAI models couldn't do this when I accessed them through the API, but later ones could.

How does our GPT-2 model stack up with this kind of thing -- and for comparison, how about a newer, more sophisticated base (as in, not instruction fine-tuned) model?

[ Read more ]


A classifier using Qwen3

Posted on 24 October 2025 in AI |

I wanted to build on what I'd learned in chapter 6 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". That chapter takes the LLM that we've built, and then turns it into a spam/ham classifier. I wanted to see how easy it would be to take another LLM -- say, one from Hugging Face -- and do the same "decapitation" trick on it: removing the output head and replacing it with a small linear layer that outputs class logits

Turns out it was really easy! I used Qwen/Qwen3-0.6B-Base, and you can see the code here.

The only real difference between our normal PyTorch LLMs and one based on Hugging Face is that the return value when you call your model is a ModelOutput object with more to it than just the output from the model itself. But it has a logits field on it to get the raw output, and with that update, the code works largely unchanged. The only other change I needed to make was to change the padding token from the fixed 50256 that the code from the book uses to tokenizer.pad_token_id.

ChatGPT wrote a nice, detailed README for it, so hopefully it's a useful standalone artifact.


Retro Language Models: Rebuilding Karpathy’s RNN in PyTorch

Posted on 24 October 2025 in AI, Retro Language Models, Python, TIL deep dives |

I recently posted about Andrej Karpathy's classic 2015 essay, "The Unreasonable Effectiveness of Recurrent Neural Networks". In that post, I went through what the essay said, and gave a few hints on how the RNNs he was working with at the time differ from the Transformers-based LLMs I've been learning about.

This post is a bit more hands-on. To understand how these RNNs really work, it's best to write some actual code, so I've implemented a version of Karpathy's original code using PyTorch's built-in LSTM class -- here's the repo. I've tried to stay as close as possible to the original, but I believe it's reasonably PyTorch-native in style too. (Which is maybe not all that surprising, given that he wrote it using Torch, the Lua-based predecessor to PyTorch.)

In this post, I'll walk through how it works, as of commit daab2e1. In follow-up posts, I'll dig in further, actually implementing my own RNNs rather than relying on PyTorch's.

All set?

[ Read more ]


Writing an LLM from scratch, part 23 -- fine-tuning for classification

Posted on 22 October 2025 in AI, LLM from scratch, TIL deep dives |

In chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", we finally trained our LLM (having learned essential aspects like cross entropy loss and perplexity along the way). This is amazing -- we've gone from essentially zero to a full pretrained model. But pretrained models aren't all that useful in and of themselves -- we normally do further training to specialise them on a particular task, like being a chatbot.

Chapter 6 explains a -- to me -- slightly surprising thing that we can do with this kind of fine-tuning. We take our LLM and convert it into a classifier that assesses whether or not a given piece of text is spam. That's simple enough that I can cover everything in one post -- so here it is :-)

[ Read more ]


Writing an LLM from scratch, part 22 -- finally training our LLM!

Posted on 15 October 2025 in AI, LLM from scratch, TIL deep dives |

This post wraps up my notes on chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Understanding cross entropy loss and perplexity were the hard bits for me in this chapter -- the remaining 28 pages were more a case of plugging bits together and running the code, to see what happens.

The shortness of this post almost feels like a damp squib. After writing so much in the last 22 posts, there's really not all that much to say -- but that hides the fact that this part of the book is probably the most exciting to work through. All these pieces developed with such care, and with so much to learn, over the preceding 140 pages, with not all that much to show -- and suddenly, we have a codebase that we can let rip on a training set -- and our model starts talking to us!

I trained my model on the sample dataset that we use in the book, the 20,000 characters of "The Verdict" by Edith Wharton, and then ran it to predict next tokens after "Every effort moves you". I got:

Every effort moves you in," was down surprise a was one of lo "I quote.

Not bad for a model trained on such a small amount of data (in just over ten seconds).

The next step was to download the weights for the original 124M-parameter version of GPT-2 from OpenAI, following the instructions in the book, and then to load them into my model. With those weights, against the same prompt, I got this:

Every effort moves you as far as the hand can go until the end of your turn unless something interrupts your control flow. As you may observe I

That's amazingly cool. Coherent enough that you could believe it's part of the instructions for a game.

Now, I won't go through the remainder of the chapter in detail -- as I said, it's essentially just plugging together the various bits that we've gone through so far, even though the results are brilliant. In this post I'm just going to make a few brief notes on the things that I found interesting.

[ Read more ]


Revisiting Karpathy’s 'The Unreasonable Effectiveness of Recurrent Neural Networks'

Posted on 11 October 2025 in AI, Retro Language Models, TIL deep dives |

Being on a sabbatical means having a bit more time on my hands than I'm used to, and I wanted to broaden my horizons a little. I've been learning how current LLMs work by going through Sebastian Raschka's book "Build a Large Language Model (from Scratch)", but how about the history -- where did this design come from? What did people do before Transformers?

Back when it was published in 2015, Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks" went viral.

It's easy to see why. While interesting stuff had been coming out of AI labs for some time, for those of us in the broader tech community, it still felt like we were in an AI winter. Karpathy's post showed that things were in fact moving pretty fast -- he showed that he could train recurrent neural networks (RNNs) on text, and get them to generate surprisingly readable results.

For example, he trained one on the complete works of Shakespeare, and got output like this:

KING LEAR:
O, if you were a feeble sight, the courtesy of your law,
Your sight and several breath, will wear the gods
With his heads, and my hands are wonder'd at the deeds,
So drop upon your lordship's head, and your opinion
Shall be against your honour.

As he says, you could almost (if not quite) mistake it for a real quote! And this is from a network that had to learn everything from scratch -- no tokenising, just bytes. It went from generating random junk like this:

bo.+\x94G5YFM,}Hx'E{*T]v>>,2pw\nRb/f{a(3n.\xe2K5OGc

...to learning that there was such a thing as words, to learning English words, to learning the rules of layout required for a play.

This was amazing enough that it even hit the mainstream. A meme template you still see everywhere is "I forced a bot to watch 10,000 episodes of $TV_SHOW and here's what it came up with" -- followed by some crazy parody of the TV show in question. (A personal favourite is this one by Keaton Patti for "Queer Eye".)

The source of that meme template was actually a real thing -- a developer called Andy Herd trained an RNN on scripts from "Friends", and generated an almost-coherent but delightfully quirky script fragment. Sadly I can't find it on the Internet any more (if anyone has a copy, please share!) -- Herd is no longer on X/Twitter, and there seems to be no trace of the fragment, just news stories about it. But that was in early 2016, just after Karpathy's blog post. People saw it, thought it was funny, and (slightly ironically) discovered that humans could do better.

So, this was a post that showed techies in general how impressive the results you could get from then-recent AI were, and that had a viral impact on Internet culture. It came out in 2015, two years before "Attention Is All You Need", which introduced the Transformers architecture that powers essentially all mainstream AI these days. (It's certainly worth mentioning that the underlying idea wasn't exactly unknown, though -- near the end of the post, Karpathy explicitly highlights that the "concept of attention is the most interesting recent architectural innovation in neural networks".)

I didn't have time to go through it and try to play with the code when it came out, but now that I'm on sabbatical, it's the perfect time to fix that! I've implemented my own version using PyTorch, and you can clone and run it. Some sample output after training on the Project Gutenberg Complete Works of Shakespeare:

SOLANIO.
Not anything
With her own calling bids me, I look down,
That we attend for letters—are a sovereign,
And so, that love have so as yours; you rogue.
We are hax on me but the way to stop.

[_Stabs John of London. But fearful, Mercutio as the Dromio sleeps
fallen._]

ANTONIO.
Yes, then, it stands, and is the love in thy life.

There's a README.md in the repo with full instructions about how to use it -- I wrote the code myself (with some AI guidance on how to use the APIs), but Claude was invaluable for taking a look at the codebase and generating much better and more useful instructions on how to use it than I would have done :-)

This code is actually "cheating" a bit, because Karpathy's original repo has a full implementation of several kinds of RNNs (in Lua, which is what the original Torch framework was based on), while I'm using PyTorch's built-in LSTM class, which implements a Long Short-Term Memory network -- the specific kind of RNN used to generate the samples in the post (though not in the code snippets, which are from "vanilla" RNNs).

Over the next few posts in this series (which I'll interleave with "LLM from scratch" ones), I'll cover:

  1. A writeup of the PyTorch code as it currently is.
  2. Implementation of a regular RNN in PyTorch, showing why it's not as good as an LSTM.
  3. Implementation of an LSTM in PyTorch, which (hopefully) will work as well as the built-in one.

However, in this first post I want to talk about the original article and highlight how the techniques differ from what I've seen while learning about modern LLMs.

If you're interested (and haven't already zoomed off to start generating your own version of "War and Peace" using that repo), then read on!

[ Read more ]


Writing an LLM from scratch, part 21 -- perplexed by perplexity

Posted on 7 October 2025 in AI, LLM from scratch, TIL deep dives |

I'm continuing through chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", which covers training the LLM. Last time I wrote about cross entropy loss. Before moving on to the next section, I wanted to post about something that the book only covers briefly in a sidebar: perplexity.

Back in May, I thought I had understood it:

Just as I was finishing this off, I found myself thinking that logits were interesting because you could take some measure of how certain the LLM was about the next token from them. For example, if all of the logits were the same number, it would mean that the LLM has absolutely no idea what token might come back -- it's giving an equal chance to all of them. If all of them were zero apart from one, which was a positive number, then it would be 100% sure about what the next one was going to be. If you could represent that in a single number -- let's say, 0 means that it has only one candidate and 1 means that it hasn't even the slightest idea what is most likely -- then it would be an interesting measure of how certain the LLM was about its choice.

Turns out (unsurprisingly) that I'd re-invented something that's been around for a long time. That number is called perplexity, and I imagine that's why the largest AI-enabled web search engine borrowed that name.

I'd misunderstood. From the post on cross entropy, you can see that the measure that I was talking about in May was something more like the simple Shannon entropy of the LLM's output probabilities. That's a useful number, but perplexity is something different.

Its actual calculation is really simple -- you just raise the base of the logarithms you were using in your cross entropy loss to the power of that loss. So if you were using the natural logarithm to work out your loss L, perplexity would be eL, if you were using the base-2 logarithm log2 then it would be 2L, and so on. PyTorch uses the natural logarithm, so you'd use the matching torch.exp function.

Raschka says that perplexity "measures how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset", and that it "is often considered more interpretable than the raw [cross entropy] loss value because it signifies the effective vocabulary size about which the model is uncertain at each step."

This felt like something I would like to dig into a bit.

[ Read more ]


Writing an LLM from scratch, part 20 -- starting training, and cross entropy loss

Posted on 2 October 2025 in AI, LLM from scratch, TIL deep dives |

Chapter 5 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" explains how to train the LLM. There are a number of things in there that required a bit of thought, so I'll post about each of them in turn.

The chapter starts off easily, with a few bits of code to generate some sample text. Because we have a call to torch.manual_seed at the start to make the random number generator deterministic, you can run the code and get exactly the same results as appear in the book, which is an excellent sanity check.

Once that's covered, we get into the core of the first section: how do we write our loss function?

[ Read more ]