Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

Posted on 9 January 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI:

  1. A simple cross entropy loss over a fixed test set.
  2. The results for an instruction fine-tune test that's covered in the book.

Here were the results I got, sorted by the loss:

Test loss IFT score
OpenAI weights: medium 3.231 38.53
OpenAI weights: small 3.500 22.98
Cloud FineWeb, 8x A100 40 GiB 3.674 17.09
Cloud FineWeb, 8x H100 80 GiB 3.725 11.98
Cloud FineWeb, 8x A100 80 GiB 3.730 11.71
Cloud FineWeb, 8x B200 160 GiB 3.771 13.89
Local FineWeb train 3.944 16.01
Local FineWeb-Edu extended train 4.135 14.55
Local FineWeb-Edu train 4.167 16.86

Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern.

I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now.

In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change.

[ Read more ]


Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

Posted on 7 January 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that:

  1. I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU.
  2. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right.

In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way?

Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing.

So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?"

Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation.

If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results.

[ Read more ]