10Gb/s Ethernet: what I actually did to get it working in my home

Posted on 29 April 2026 in TIL, Gadgets |

Having learned enough about 10Gb/s Ethernet to be comfortable about setting it up in my house, it was time to bite the bullet: order it from the ISP, buy some kit, and get started.

I already had 2.5Gb/s working. The apartment has structured cabling -- each room has one or more RJ45 sockets in the wall, and there's a patch panel downstairs by our front door that has a matching patch socket for each wall socket. So when we moved in, I simply set things up so that there was a 2.5Gb/s switch down by the patch panel, and wired everything together there. Most of our stuff works over WiFi, of course, but I needed a wired backbone to connect the excessive number of computers in my study both to each other, and to the outside world.

What did I need to do?

[ Read more ]


10Gb/s Ethernet: what I had to (re)learn

Posted on 28 April 2026 in TIL, Gadgets |

My ISP recently started offering a 10Gb option, and my "shiny new thing!" Pavlovian response immediately kicked in. So of course, I had to upgrade the wired networking in my home -- which meant I had to learn a few things to get it all working, and relearn a bunch of stuff I'd forgotten over the years.

Wired networking for home and small offices hasn't really moved forward that much in the last 20-odd years. Back in 2006, gigabit Ethernet was standard for businesses, and most home users moved to it not long after. Perhaps due to the rise of WiFi for most "last few metres" connections, it's pretty much stagnated there, perhaps with a bit of a push towards 2.5Gb/s more recently.

But with faster ISP connections arriving, I think things are starting to become a bit more interesting. Even the fastest WiFi 7 connections are only able to get up to around 6Gb/s to a single device -- and that's in an ideal "super-fast machine sitting right next to the AP in a shielded lab" setup.

Here's what I had to drag up from my memory, and the new stuff I had to learn, in order to get this all working. I'll write about the background in this post, and then tomorrow I'll post about what I actually put in place.

[ Read more ]


Writing an LLM from scratch, part 33 -- what I learned from finally getting round to the appendices

Posted on 22 April 2026 in AI, LLM from scratch, TIL deep dives |

After finishing the main body of "Build a Large Language Model (from Scratch)", I set myself three follow-on goals.

The first was training a full GPT-2-small-style base model myself. That was reasonably easy to do but unlocked a bunch of irresistible side quests; having finally got to the end of those, it's time to move on to the others: reading through the book's appendices, and building my own GPT-2 style model in JAX.

This post is about the appendices. The TL;DR: there was stuff in there that could have saved me time in my side-questing, but I think that having to work those things out from scratch probably helped me learn them better.

[ Read more ]


Writing an LLM from scratch, part 32m -- Interventions: conclusion

Posted on 21 April 2026 in AI, LLM from scratch, TIL deep dives, Python |

Last November, when I finished the main body of "Build a Large Language Model (from Scratch)", I set myself a number of follow-on goals. One was "training the full GPT-2 base model myself".

I've reached the end of that journey, with a model that is almost -- if not quite -- as good as GPT-2 small, trained in 44 hours on my own machine, so I thought it would be worth summarising how it went.

[ Read more ]


Writing an LLM from scratch, part 32l -- Interventions: updated instruction fine-tuning results

Posted on 20 April 2026 in AI, LLM from scratch, TIL deep dives |

I've been working on a GPT-2-small-style LLM based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and have tried a bunch of different things to see if I could get it to approach the quality of the original OpenAI GPT-2-small, measured in terms of loss on a held-back test dataset. After working through them, in my last post, I managed to train one that was almost (if not quite) there.

Now, back before I started digging into these interventions, I was doing three evals for each model I built; a smoke test (to see if it could give a coherent completion to "Every effort moves you"), a test for that test set loss, and an instruction-following test that fine-tuned the model on the Alpaca dataset, got it to generate results for a test set of instructions, and then used an LLM as a judge to score them.

The idea behind this was that the loss on the test set was an interesting technical measure of the quality of a model, but it didn't really tell us much about how useful it might be in reality.

Unfortunately, in January, I realised that my methodology was bad; because I was asking the LLM to score a model in isolation, the LLM's natural randomness would mean that results were not really comparable, at least for models that were reasonably close in quality.

For example, if two models both replied to

Name the author of 'Pride and Prejudice'.

with:

The author of 'Pride and Prejudice' is Sarah Palin.

...then one run of the instruction-following test might "find the judge LLM in a good mood" and get, say, 5% -- after all, the model tried to answer, and actually used a real person's name, even if the answer was totally wrong. But in another run, the judge might be in a "worse mood" and score it at 0%.

My fix was to have two scripts:

The details are here.

Because doing it that way was significantly more work, I've not been doing these tests as part of the interventions mini-series. I felt it would make more sense to wait until I'd tried a bunch of interventions and got a number of models to try.

Now I have those, so let's give it a go!

[ Read more ]


How an LLM becomes more coherent as we train it

Posted on 17 April 2026 in AI |

I remember finding it interesting when, back in 2015, Andrej Karpathy posted about RNNs and gave an example of how their output improves over the course of a training run. What might that look like for a (relatively) modern transformers-based LLM?

I recently trained a GPT-2-small-style LLM, with 163 million parameters, on about 3.2 billion tokens (that's about 12.8 GiB of text) from the Hugging Face FineWeb dataset, and over the course of that training run, I saved the current model periodically -- 57 checkpoints over two days.

Here's what it looked like -- the start, the end, and some interesting waypoints in between.

[ Read more ]


Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation

Posted on 15 April 2026 in AI, LLM from scratch, TIL deep dives, Python |

I've been working on a GPT-2-small-style LLM based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". I've trained various versions of it in the cloud to work out which interventions to the model and training code had the best effects on the loss it gets on a specific test dataset, and now I wanted to do a training run locally to match the best of those. For that, I wanted to match the batch size I was using for the cloud training runs.

When I first started learning this stuff, batching seemed like a performance thing -- with highly parallel systems like GPUs, it generally turned out that you could run a batch of (say) two inputs through a model in less than twice the time you could run one, so it made sense to batch them up.

For inference, that is exactly the advantage you get, but when training, it's become increasingly clear to me that you can also get an improvement in the quality of the model from batching. The best intuitive model I have is that if you run inputs through one-by-one, adjusting parameters after each, then it's easy for the model to "overcorrect" each time. With batches, you get an average set of gradients across all of the items -- which smooths things out and stabilises the training.

Of course, it's possible to overdo it. As an extreme example, imagine that you were somehow able to fit your whole training set into one batch -- then you could train by running that single batch through, doing a single backward pass, and then adjusting the parameters once. It's pretty clear that that would not work very well -- just one single update of the initially-random parameters.

When training on my local machine, I could fit a batch of six sequences into my RTX 3090. I'd found that when I moved to cloud machines, it had a very positive effect on the loss I got out of the models when I tested them. From a quick-and-dirty bit of curve-fitting, I estimated that the optimal batch size for this model, with that training run, was somewhere around 97. Conveniently, that was close to the maximum I could fit onto an 8x A100 40 GiB/GPU machine, so I used a batch size of 96 to test the different interventions I was trying.

And when I finally put all of the interventions that helped with training together, I found (somewhat to my surprise) that their combined effect -- an improvement in loss of 0.113765 -- was less than half of the loss improvement of 0.252474 that I had got from increasing the batch size.

What that all made clear was that if I wanted to do a local training run that matched the quality of the cloud-trained model, I'd need to not only add on the interventions that I'd been testing in detail, but I'd need to match the cloud batch size. And for that, I needed to learn about gradient accumulation.

[ Read more ]


Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud

Posted on 9 April 2026 in AI, LLM from scratch, TIL deep dives, Python |

Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I trained from scratch on my local RTX 3090, using code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

My original model got a loss of 3.944 on my test set, while the original GPT-2 weights got 3.500 on the same dataset. I wanted to see if I could close that gap, and had a list of potential changes to the training setup, and to the model itself. Which of them would help?

I found a list of solid-looking interventions, and in my last post I came to the conclusion that the improvements in loss I had seen with all of them -- with two possible exceptions -- seemed unlikely to be in the noise. What would happen if I tried to put them into a new model?

[ Read more ]


Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?

Posted on 7 April 2026 in AI, LLM from scratch, TIL deep dives, Python |

Towards the end of last year, I trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090, using code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

The result was a pretty decent little model, but it wasn't as good as the original GPT-2-small, despite having more parameters (because it wasn't using weight-tying). Specifically: on a particular test set, my model gave a loss of 3.944 -- quite a lot more than the original GPT-2's 3.500 on the same dataset.

I wanted to see whether I could train a model on my own hardware (or on something that didn't cost too much to rent in the cloud) that got closer to the original model's performance. So over the last few months, I've done a bunch of further training runs, each one testing a specific intervention -- a stand-alone change that I expected to change the loss, either for better or for worse. Specifically:

At the end of all of that, I had this table showing the effect of each intervention in terms of loss on the test set. They're sorted from least-effective to most-effective, and you can see the baseline in there too:

Test set loss Improvement vs baseline
8xa100m40-weight-tying 3.874 -0.182
8xa100m40-weight-decay-cerebras 3.867 -0.175
8xa100m40-baseline 3.692 -
8xa100m80-no-amp 3.679 0.013
8xa100m40-gradient-clipping 3.678 0.014
8xa100m40-qkv-bias 3.669 0.023
8xa100m40-weight-decay-gpt2 3.643 0.049
8xa100m40-remove-dropout 3.641 0.051
8xa100m40-schedule-learning-rate 3.602 0.09

Winners and losers are reasonably clear:

So, for an optimal train, we'd just use the effective interventions, right? Well, not quite.

Full-fat float32 I decided wasn't worth the effort, as it meant that the train took more than twice as long, and (because it required a larger machine), cost more than three times as much.

The others did look like solid changes, but there was one concern. The effect of each intervention is actually pretty small. For example, gradient clipping reduced the loss by 0.014, from 3.692 to 3.678. That's a 0.3% improvement. Even the best intervention, scheduling the learning rate, only improved things by 2%.

Could it be that some or all of these improvements were not real, but just a result of the random nature of training deep neural networks? Could the differences just be in the noise? They seemed small enough for that to be possible.

I've trained seven more models over the last few days to try to get a feel as to how big an effect noise has for this kind of training run. The results appear to show that variations in the initial weights matter quite a lot, but randomness in the training loop (given the same initial weights) actually has a fairly minimal impact. That surprised me a bit!

Let's go through the details.

[ Read more ]


Writing an LLM from scratch, part 32h -- Interventions: full fat float32

Posted on 3 April 2026 in AI, LLM from scratch, TIL deep dives, Python |

This is the last of the interventions I'm trying out to see if I can improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

Back when I did my first training run for a base model, on my local RTX 3090, I used two optimisations:

The first of those boosted training speed from 12,599 tokens per second to 15,402 in my test harness, while AMP on its own boosted it to 19,921 tps (and also allowed me to increase the batch size from 5 to 6). Doing both appeared to hit some kind of diminishing returns -- it maxed out at 19,997 tps, only a little better than AMP on its own.

But intuitively, you'd expect that might come at a cost. While I'm sure the PyTorch developers have solid understanding of where switching to 16-bit will have a minimal impact on training quality, it seems too good to be true that it would have no impact at all.

Let's see what happens if we switch both of these optimisations off!

[ Read more ]