Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

Posted on 4 February 2026 in AI, LLM from scratch, TIL deep dives, Python

I'm rounding out my series of posts on Sebastian Raschka's book "Build a Large Language Model (from Scratch)" by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090, and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better.

For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud. That led to some refinements in the prompt-following test I was using, and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub.

Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to.

The interventions

I listed a number of possible interventions at the end of the RTX 3090 post; I'm not going to do them all, but for completeness, here's the full list:

I'm going to work through each of those apart from the first two and the batch size (and will retrospectively add links to the list above when I do), trying a train with just that intervention and nothing else, on a cloud machine. Once that's done, I'll bake all of the things that helped into the training loop, and do another local train -- with gradient accumulation to make the batch size match the cloud instances'.

The cloud machine size that I decided to use for this was the one that came out the most cost-effective (and due to its VRAM size, had the best loss) in my earlier cloud training test: an 8x A100 machine with 40 GiB VRAM per GPU.

But first, we need a baseline model.

Why a new baseline?

I've already done a train on an 8x A100 40 GiB machine -- why do we need a new one?

In my cloud training post, I came to the conclusion that the cost in terms of training time of running a periodic validation loop as we trained was not really worth it, at least in this case. Two of the biggest reasons to have validation during training are to work out when you're overfitting on a multi-epoch train, and to see how your model can handle datasets that it has not been trained on.

In a single-epoch train like this, you're not going to overfit -- every sample it sees will be new to it -- and the training loss itself is over samples it's not been trained on at the time it was calculated, for the same reason (though of course it will be trained on them as soon as we do the backward pass starting with that loss).

Of course, it's not perfect -- a big benefit of the validation loss is that it's over the same held-back dataset on every run -- and there are arguments for keeping it (albeit, perhaps doing full runs less frequently than I was). But for these experiments, I decided that I'd simply drop it.

I also wanted to introduce a consistent random seed at the start of the training loop. I didn't have that in my cloud trains, and of course if we want to have solid results on whether each intervention really does improve matters, then we need one so that we can be sure they're all starting from the same point.

Both of those meant that I couldn't use the earlier train on the 8x A100 40 GiB machine as a baseline; I'd need a new one, introducing those two changes: no validation during the training run (using training loss as a proxy), and setting a random seed at the start for reproducibility.

So: what was the baseline train going to look like?

Creating the baseline

The first step was to strip out the validation code and to replace it with code that just took periodic checkpoints, keeping track of which one had the best average training loss over the period since the previous one. Next, I decided to plot on the training chart that is generated during the run not just the training loss, but also an indicator of the maximum and minimum training loss over all of the steps in that period. Then I added the random seed, which I set to 42.

A couple of bugfixes, and we were left with this version of the code.

One thing to highlight: in the train.json file that specifies the various training parameters, I set the per-GPU micro-batch size to 12 rather than the 13 I'd used on this size of machine earlier. Two reasons for that:

Firstly, I'm going to want to do a local run with gradient accumulation later, using all of the helpful interventions. With gradient accumulation, you do a number of steps with batches that you can fit into your memory, but you don't update the gradients each time. After a number of those, you do one big update based on the accumulated gradients -- hence the name. The full batch is all of those smaller batches taken together.

If I want that to closely match the cloud train, I'll want the accumulated batches to be the same size as each global batch in the cloud.

Now, on my local machine, I can fit a batch of 6 into VRAM. So that means that the full batch needs to be divisible by 6 1. On the cloud train, with a micro-batch of 13 and 8 GPUs, we had an overall batch size of 104 in the previous train. 104 is not divisible by 6: no joy. But with a micro-batch size of 12, we have an overall batch of 12×8=96, which means we'd be able to do gradient accumulation and do a parameter update every 96÷6=16 steps.

Secondly, while my estimate of the ideal overall batch size was based on a rather arbitrary bit of curve-fitting, it did say that 97 was the ideal size. So it could be interesting to see whether it did help!

So, having coded that up and set up the configuration, it was time to run it.

Here's the training chart it came up with:

Baseline training run on an 8x A100 with 40 GiB/GPU

Note the loss spikes at around global steps 4,200, 13,000 and 23,000. Those are important, I'll explain why later.

The training run reported this at the end:

Training complete in 12,243.523 seconds
Tokens seen: 3,260,252,160
Throughput: 266,284 tokens/second
Final train loss: 3.743

So it took about 3h24m to train, even less than we expected from the previous cloud experiments' estimates of how long it would take excluding validation. About US$35 in cost.

Here is the model on Hugging Face Hub.

Let's see how it looks.

Evals

For these intervention posts, I won't run the instruction-following tests, as they can only be run against a batch of models in one go to get results that are consistent with each other.

But the smoke test -- how does it complete the sequence Every effort moves you is worthwhile:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_smoke.py runs/8xa100m40-baseline/model.json runs/8xa100m40-baseline/checkpoints/best/model.safetensors
Every effort moves you in on a good cause.
If it doesn’t work you would like to join the

Looks good! Reasonably coherent.

Now we can find the loss on our held-back test set:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_loss.py datasets/ runs/8xa100m40-baseline/model.json runs/8xa100m40-baseline/checkpoints/best/model.safetensors
Fetching 4 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 990.57it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3200/3200 [04:53<00:00, 10.91it/s]
Loss against our test dataset: 3.692

That's a bit worse than the 3.674 we got for the original cloud train. Either the calculations of the optimal batch size I did were not quite right (entirely likely, they were very ad-hoc) or the model weights we started with, given the random seed we're using, just happened to lead us in a slightly worse direction (also plausible). Either way, it's in line with what we expected, and is still better than the test loss of 3.725 that we got with the second-best machine in the cloud comparison post (the 8x H100 80 GiB with a global batch size of 216).

So: we have a solid baseline model -- before we wrap up, let's consider those spikes in the loss that I called out in the training chart.

The loss spikes

Random spikes in the loss are a Bad Thing, right? Certainly they're a bad thing for a train in general, especially if you don't know for sure what's causing them. But my working assumption has been that they're caused by exploding gradients -- for some specific sample in the dataset, the gradients have gone up to some insanely high value, and we've had a bad update to our parameters as a result. It hasn't completely knocked the model back to its starting point, but it does take some time to recover, so we lose the benefit of some of our training.

If that is the case -- and it's not just something like a batch happening to have stuff that's wildly different to the rest of the training data, or something weird in the optimiser -- then gradient clipping is the solution. I wanted to see if it would help the model quality in general, but of course if we hadn't had any loss spikes in this baseline train it would have been hard to see if that was the case!

So I was very glad to see them here, as if there had been none I would either have had to do a gradient clipping experiment with no real expectation of it helping -- or do another baseline train with a different random seed in the hope that that caused some spikes, which would have cost another US$35.

All in all, it was good to see them there, as it sets us up well for that experiment.

Wrapping up

So, we've trained a baseline model that we can make changes to -- the interventions I listed at the start -- and get a pretty reliable understanding of whether or not they help the quality of the final model. With that in place, we're in a good position to start running those intervention tests!

Given the loss spike situation in that chart, I think that a solid first one to go for -- even though it was the last in that list at the top of this post -- is gradient clipping. Where are those loss spikes coming from, and if it's exploding gradients, what happens if we limit the damage they do with gradient clipping?

Stay tuned! I've already done the training run for that (while I wrote this one up), so I should be able to post about it tomorrow.

Here's a link to the next post in this series.


  1. Well, you could potentially do something with batches of different sizes, but that would be fiddly.