Writing an LLM from scratch, part 25 -- instruction fine-tuning

Posted on 29 October 2025 in AI, LLM from scratch, TIL deep dives

This post is on the first part of chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", which covers instruction fine-tuning.

In my last post, I went through a technique which I'd found could sometimes make it possible to turn non-fine-tuned models into reasonable chatbots; perhaps unsurprisingly, the GPT-2 model isn't powerful enough to work that way.

So, with that proven, it was time to do the work :-) This post covers the first half of the chapter, where we actually do the fine-tuning; I'll post later about the second part, where we start evaluating the model that we get.

Just as with the last chapter, what we're doing here is essentially plugging together the various things we've built so far, and Raschka's explanation is very clear, so I don't have that much in the way of notes -- but here are the bits that made me pause.

The input format

This was quite interesting. In the past, all of the templates I've seen for instruction following have been designed for chatbots -- that's what we tend to use LLMs for, after all. There's a system prompt and then a format for "message from user", and another for "message from bot".

In my series on fine-tuning, where I learned how to fine-tune an 8B-parameter Llama 3 base model to work as a chatbot, I used the format for Llama 2, which is not dissimilar to the Phi3 one that's given as an example in the book. The Alpaca-style one is quite different; it is designed for more of a one-shot interaction than it is for chat:

Below is an instruction that describes a task.  Write a response that
appropriately completes the request.

### Instruction:

<some instructions>


### Input:

<optional, some input>

### Response:

Now, Alpaca dates from early 2023, and it looks like they used that prompt following a paper "Self-Instruct: Aligning Language Models with Self-Generated Instructions".

I had to think a bit about why one would use that, and I think the core is that this was early days (all of two years ago!) and LLMs had very short context lengths and weren't very smart. Chat uses a lot of tokens! You need the system prompt, and then every conversational turn so far. With our GPT-2 model we have just 1024 tokens to play with -- and Alpaca wasn't much better, as it was built as a fine-tune of Meta's original Llama model, which (according to the model card) had a context length of 4096 tokens.

Chat is a good way to interact with a model, as the multiple conversational turns allow you to build up large amounts of context for the model to play with, meaning that (hopefully) it will be able to give good answers. But if that context doesn't fit into the context length, then it's not so good. Early chatbots, I believe, worked around this by replacing the "transcript" with a summary, but there's only so much you can fit into a 4k-token one. 1 Maybe modern ones do this too, but with GPT-5 having a 400,000-token context window it's not so important.

So, in Alpaca times, people were thinking in terms of one-shot interactions with LLMs, and the pattern they chose was targeted at that, so that you could get all of the interesting information and a reply into one sequence.

An interesting bit of history! (Again, two years ago is history. Cripes.)

The custom collation

This was explained well in the book, but it's an interesting enough point that I thought it was worth going over.

Last time around we had a bunch of text messages as our inputs to the model. We found the longest one, and then padded them all out to the same length with end-of-sequence tokens, which meant that we could construct batches -- naturally, every input in a batch has to be the same size.

This time around we're being a bit smarter. Although every item in a given batch needs to be the same length, batches themselves can be of different lengths -- that is, if our batch size was 8, and the longest sequence in our first batch was 73 tokens long, then we would make our first batch 8×73 -- but then, if the longest sequence in our second batch was only 60 tokens long, then the second batch could be 8×60. We only need to pad out sequences to match the longest sequence in their batch, and that saves us time when running the model.

That got me thinking about inference at scale -- the kind of thing that LLM providers like OpenAI or Anthropic do. They're going to be receiving very large numbers of sequences to complete, and of course they are going to be running them through in batches. But padding tokens are kind of a waste of inference GPU cycles.

They'll have a bunch of different instances of their models running on different machines to handle all of these requests, and they almost certainly have some kind of code to try to route sequences of similar length to the same instances. To take a toy example, if you had a batch size of two and received six sequences, with lengths of 2, 9, 100, 11, 3 and 120, then you'd want to route them so that one instance received the (2, 3) pair, another the (9, 11), and another (100, 120) -- that minimises the amount of padding required and saves wasted cycles.

Following on from that, it looks like we could actually improve the book's code by doing something similar here, grouping similarly-sized inputs together. That would be quite complicated, though, so probably not worth it in an educational context like this.

Anyway, our collator needs to handle the variable-length batches, and through various drafts we converge on one that does it, with one tweak.

Masking out padding tokens for loss

This was a really important and interesting bit. Let's say that we're feeding in a 20-token input, in a batch where the longest of the other sequences is 30 tokens long. That means that we have ten padding tokens at the end. Let's represent that input sequence like this:

01234567890123456789xxxxxxxxxx

The numbers are our token IDs, and I've used x to represent the end-of-sequence token that we use for padding.

Now, we need our target sequence to predict for that. The first version that we come up with in the book looks like this:

1234567890123456789xxxxxxxxxxx

So, we've just done the normal trick of shifting left by one character, and we've added an extra end-of-sequence token at the end to make the lengths match.

But as the next step, we replace all of the padding tokens, apart from the one right at the end of the "real" part of the sequence, with an invalid token ID, -100. Using I to represent that, we have:

1234567890123456789xIIIIIIIIII

The core thing to remember here is that we honestly don't care what the model generates after it's done the real, unpadded sequence, plus an end-of-sequence token. It could generate random junk and it wouldn't matter, because it's already done the important part of predicting next tokens for all of the input sequence.

The -100 is a magic number that PyTorch's cross_entropy function uses in target sequences to say "ignore this position". I must admit that as a software engineer, it gives me a bit of an "ick" -- magic numbers are never nice -- but it does make sense. Negative numbers are invalid targets when you're comparing predictions across tokens -- which have indexes from zero up. In general, if you're predicting categories -- which essentially we are with tokens -- then the "minus one'th" token doesn't make sense. You could use any other negative number, but -1 might cause confusion (being used heavily in ML code to get the last element of a sequence) and if you're going to use any other negative number it might as well be -100.

"Purer" solutions would be hard, anyway. We're working with a PyTorch tensor here, so it has to be a number -- which rules out using something like None or some kind of special object. You could keep an "ignore after this index" number, but you'd need as many of them as you have items in the batch and it would be just another thing to keep track of. You could even keep a tensor of boolean "ignore these tokens" of the same size as your batch -- a mask -- but that would have the same problem of being something to pass around in your code.

As I understand it, those last two solutions are actually used in some systems -- imagine that your outputs were not logits to create a probability distribution across categories or tokens, but were meaningful numbers in and of themselves. Pretty much any number you picked might be a valid output from the model. You wouldn't be using cross entropy loss in those cases anyway, of course, but you'd need to keep some record of where the padding starts so that you can ignore it.

One final thing that is worth noting that we only add the -100s on to the targets. This makes sense, as all of the inputs will be fed into the LLM, so things that aren't valid tokens are going to make the embedding layer very unhappy. That also explains why we firstly add them on to the sequence as regular padding and then convert them to -100 for the targets: it allows us to add on the padding, then get the input sequence as all but the last token, then get the targets as tokens 1 up to the end. After that's done we run the code to replace all but the first end-of-sequence padding tokens with -100 on the targets.

Randomness revisited

As with the last chapter, I got different results to the ones in the book; something different about the order of execution in my version of the code when compared to Raschka's meant that despite all of the careful use of torch.random_seed, the numbers didn't quite match up. But, again as before, they were close and the trends in -- for example -- loss were the same, so the right things were happening.

Training

When I finally ran the train on my RTX 3090, it took 48 seconds; I watched it in nvtop and saw that it was using 9GiB VRAM. Due to the usual differences in randomness, I got slightly different results to the book -- but similar enough to not be any cause for concern:

Loss over time training for two epochs

Also, due to a typo, I accidentally ran it with five epochs -- that took two minutes. I noticed that validation loss started rising fairly steadily after epoch 2, with train loss dropping -- clearly overfitting.

Loss over time training for five epochs

Presumably Raschka chose two epochs for exactly that reason :-)

Nits

A couple of things that I noticed while working through the code; when I first ran the download script, I got module 'urllib' has no attribute 'request'. That's because of a typo in the import at the start -- instead of

import urllib

...it should be:

import urllib.request

The other thing that tripped me up was the original custom_collate_draft_1. We add on a padding token, then pad out the sequence with more padding tokens, then remove the last one. I found that confusing -- why not just add on the required number in the first place rather than adding on an extra one and then deleting it?

It became clear later on; it's to make it mirror the next function, which adds on an extra end-of-sequence token for our targets, but having this anticipatory code in there with no explanation in the first draft made me start doubting my sanity for a little while...

Minor points, though.

Wrapping up

So, that was it for the first half of chapter 7 in the book. The next bit looks like fun -- we're going to use a smart model to evaluate our relatively dumb one on how well it follows instructions. Definitely looking forward to that :-)

Here's a link to the next post in this series.


  1. I experimented with ChatGPT 3.5 at around the time Alpaca came out and came to the conclusion that it had a similar context length, of about 4k tokens. It looked like it worked around it by, when the transcript started reaching the context length, spinning off a separate instance to summarise it into a "story so far" kind of thing, which was then injected in to the start of the chat instead of the full context.

    My experiment was to say "my favourite colour is green, please remember that", then to send a quote of about 4,000 words from "Moby Dick", prefacing that with either "this is unimportant, please ignore" or "this is important, please remember". Next, I'd ask what my favourite colour was again.

    If I told it that the quote was unimportant, then it would remember, but if I told it that it was important, it would think my favourite colour was blue.

    Asking it for transcripts of the conversation so far would give a reasonable one, skipping the quote, if the quote was tagged as unimportant, but would give a completely hallucinated one if the quote was tagged important.