<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0"><channel><title>Giles' blog</title><link>https://www.gilesthomas.com/</link><description>Giles' blog</description><docs>http://www.rssboard.org/rss-specification</docs><generator>python-feedgen</generator><lastBuildDate>Thu, 07 May 2026 18:45:06 +0000</lastBuildDate><item><title>Writing an LLM from scratch, part 32h -- Interventions: full fat float32</title><link>https://www.gilesthomas.com/2026/04/llm-from-scratch-32h-interventions-full-fat-float32</link><description>&lt;p&gt;This is the last of the interventions I'm trying out to see if I can improve the
test loss for a from-scratch GPT-2 small base model, trained on code based on
&lt;a href="https://sebastianraschka.com/"&gt;Sebastian Raschka&lt;/a&gt;'s book
"&lt;a href="https://www.manning.com/books/build-a-large-language-model-from-scratch"&gt;Build a Large Language Model (from Scratch)&lt;/a&gt;".&lt;/p&gt;

&lt;p&gt;Back when I did my first training run for a base model, &lt;a href="/2025/12/llm-from-scratch-28-training-a-base-model-from-scratch"&gt;on my local RTX 3090&lt;/a&gt;,
I used two optimisations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting the
&lt;a href="https://docs.pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html"&gt;32-bit floating point matrix multiplication precision to "high" rather than to "highest"&lt;/a&gt;,
which means that it uses lower-precision (but still technically 32-bit) TF32
for those operations rather than normal float32.&lt;/li&gt;
&lt;li&gt;Using &lt;a href="https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html"&gt;PyTorch's Automated Mixed Precision (AMP)&lt;/a&gt;,
which allows it to use 16-bit calculations rather than 32-bit in places where it
makes sense to do so.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The first of those boosted training speed from 12,599 tokens per second to 15,402
in my test harness, while AMP on its own boosted it to 19,921 tps (and also allowed me
to increase the batch size from 5 to 6).  Doing both appeared to hit some kind of
diminishing returns -- it maxed out at 19,997 tps, only a little better than AMP on its
own.&lt;/p&gt;

&lt;p&gt;But intuitively, you'd expect that might come at a cost.  While I'm sure the PyTorch
developers have solid understanding of where switching to 16-bit will have a minimal
impact on training quality, it seems too good to be true that it would have no impact
at all.&lt;/p&gt;

&lt;p&gt;Let's see what happens if we switch both of these optimisations off!&lt;/p&gt;
&lt;p&gt;I added a new flag to the &lt;code&gt;train.json&lt;/code&gt; config file for the training harness,
&lt;code&gt;use_amp&lt;/code&gt; with a default of &lt;code&gt;True&lt;/code&gt; &lt;sup class="footnote-ref" id="fnref-1"&gt;&lt;a href="#fn-1"&gt;1&lt;/a&gt;&lt;/sup&gt;.  The core implementation was pretty simple;
where we had the call to &lt;code&gt;torch.set_float32_matmul_precision&lt;/code&gt;, we needed to guard it:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_amp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_float32_matmul_precision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;high&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...and where we did the forward pass and the loss calculation, we had to not
wrap it in a &lt;code&gt;with torch.amp.autocast&lt;/code&gt;:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_amp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;autocast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;We also had to avoid &lt;a href="/2026/02/llm-from-scratch-32b-interventions-gradient-clipping#how"&gt;unscaling when clipping gradients&lt;/a&gt;; I did that by just not
creating a scaler when in non-AMP mode, and then:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unscale_&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...and likewise, instead of using the scaler to step the optimiser, we step it
directly if we don't have one:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;However, there was an issue: non-finite gradients.  As I discovered when
&lt;a href="/2026/02/llm-from-scratch-32b-interventions-gradient-clipping#chasing-infinity"&gt;looking into gradient clipping&lt;/a&gt;,
the scaler was actually doing something quite useful for us.   Somewhat buried in
the &lt;a href="https://docs.pytorch.org/tutorials/recipes/recipes/amp_recipe.html#adding-gradscaler"&gt;AMP recipes page&lt;/a&gt;
is a comment:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="c1"&gt;# ``scaler.step()`` first unscales the gradients of the optimizer&amp;#39;s assigned parameters.&lt;/span&gt;
&lt;span class="c1"&gt;# If these gradients do not contain ``inf``s or ``NaN``s, optimizer.step() is then called,&lt;/span&gt;
&lt;span class="c1"&gt;# otherwise, optimizer.step() is skipped.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Now, from the gradient clipping train, I'd come to the conclusion that we were occasionally
getting non-finite gradients, and the scaler was saving us from applying junk updates
when that happened.&lt;/p&gt;

&lt;p&gt;If our new code was stepping the optimiser directly, we'd not have that safety net.  We'd
need something to save us from that.&lt;/p&gt;

&lt;p&gt;My first cut at this was to use the one other API feature I'd seen that handled
non-finite gradients for you: &lt;a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html"&gt;&lt;code&gt;torch.nn.utils.clip_grad_norm_&lt;/code&gt;&lt;/a&gt;
has a &lt;code&gt;error_if_nonfinite&lt;/code&gt; parameter, so if we were using gradient clipping, we could
set that to &lt;code&gt;True&lt;/code&gt; and use the exception to skip stepping the optimiser if it was
raised.  To avoid actually doing any gradient clipping when that happened, if we did not
have gradient clipping explicitly enabled, we could set the &lt;code&gt;max_norm&lt;/code&gt; to infinity.&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://github.com/gpjt/ddp-base-model-from-scratch/blob/f1b7b4acad6170c156e4b6caab0e6385f98489e2/ddp_train.py"&gt;the code for that version&lt;/a&gt;.
I wasn't very happy with it, though.  The use of a gradient clipping API just for
its side-effect of telling us about non-finite gradients felt a bit ugly, and even worse,
the exception it raised was just a generic &lt;code&gt;RuntimeError&lt;/code&gt;, not a custom exception type,
which meant that I had to distinguish between it and other &lt;code&gt;RuntimeErrors&lt;/code&gt; by looking
at the exception message -- not terribly safe, as that's something that could easily change
in the future.&lt;/p&gt;

&lt;p&gt;So I switched to a more explicit, simpler version: scan through the parameters looking for
non-finite gradients, and skip the optimiser step if any are found:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;update&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# The scaler skips non-finite gradients, but if we&amp;#39;re not using it we have&lt;/span&gt;
            &lt;span class="c1"&gt;# to do that for ourselves.&lt;/span&gt;
            &lt;span class="n"&gt;found_nonfinite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grad&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isfinite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;grad&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;all&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                    &lt;span class="n"&gt;found_nonfinite&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
                    &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;found_nonfinite&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;I did have some concerns about the performance impact of that; on my local machine
it took about 0.13 seconds to scan all of the parameters like that for one step.
However, it's better than failing to train the model
at all due to garbage updates!&lt;/p&gt;

&lt;p&gt;So with that, it was time to do the training run.&lt;/p&gt;

&lt;h3 id="the-train"&gt;The train&lt;/h3&gt;

&lt;p&gt;It was pretty clear that I would not be able to run this with my normal microbatch
size of 12 on the 8x A100 40 GiB machines that I'd been using so far for these intervention
tests -- AMP and the lower-precision matrix multiplications save a bit of VRAM, and
I was already pretty much at the limit of what would fit in there.&lt;/p&gt;

&lt;p&gt;Changing the batch size would make this a poor test of the effects of removing the
FP precision stuff in isolation, so I decided that the safest minimal change was to
use a machine with more VRAM -- specifically an 8x A100 80 GiB, as that was the closest
to what I was using (switching to eg. H100s would add all kinds of confounding changes).&lt;/p&gt;

&lt;p&gt;The next problem was getting any kind of machine at all!  &lt;a href="https://lambda.ai/"&gt;Lambda&lt;/a&gt; (they
appear to have rebranded away from "Lambda Labs") very rarely seemed to have any
available instances, never mind the specific type that I wanted.  Eventually,
I &lt;a href="/2026/04/automating-starting-lambda-instances"&gt;put together a system to poll their API and launch an instance&lt;/a&gt;
when one was available.  At 3:25am today &lt;sup class="footnote-ref" id="fnref-2"&gt;&lt;a href="#fn-2"&gt;2&lt;/a&gt;&lt;/sup&gt;, I got a Telegram message from the script saying
that it had managed to find and start one.&lt;/p&gt;

&lt;p&gt;I kicked off the training run, and watched as it got started.  I could see it was using
43.8 GiB/GPU, so it definitely did need the larger instance type.  And it quickly became clear
that this was going to be a long one -- it was estimating 8 hours to do the complete
run!&lt;/p&gt;

&lt;p&gt;In a way that was good news, though, as I could just set an alarm and go to bed.
When I woke up, it was done:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="go"&gt;Training complete in 29,230.838 seconds&lt;/span&gt;
&lt;span class="go"&gt;Tokens seen: 3,260,252,160&lt;/span&gt;
&lt;span class="go"&gt;Throughput: 111,535 tokens/second&lt;/span&gt;
&lt;span class="go"&gt;Final train loss: 3.729&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;That's 8h7m.  For comparison, the baseline train took 3h24m, so we're taking more
than double the time.&lt;/p&gt;

&lt;p&gt;Cost-wise, things were even worse -- more than US$135 in server costs, because as
well as needing the server for much longer, being a larger machine it cost US$16.48/hour rather than
$11.84.  So that's more than three times as expensive
as the US$42 that a typical recent
train has cost me (Lambda raised their prices, so it went up from about US$35 in February).&lt;/p&gt;

&lt;p&gt;Still, at least it looked like a solid run:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32h-interventions-full-fat-float32/loss-chart.png" alt="Full train run loss without AMP" title="Full train run loss without AMP" /&gt;&lt;/p&gt;

&lt;p&gt;Very similar to the others we've seen in this series.&lt;/p&gt;

&lt;p&gt;Time to upload it to &lt;a href="https://huggingface.co/gpjt/8xa100m80-no-amp"&gt;Hugging Face Hub&lt;/a&gt;, and
on to the evals to see if all of this extra cost was worthwhile.&lt;/p&gt;

&lt;h3 id="evals"&gt;Evals&lt;/h3&gt;

&lt;p&gt;Firstly, the smoke test -- how did it complete &lt;code&gt;Every effort moves you&lt;/code&gt;?&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you towards greater success. And even then, they’re on your way to winning a prize and
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Not bad at all!  But the important metric is the loss on the test set, and for that
I got 3.679.  Let's add it to the table to see how that compares to the other training runs:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test set loss&lt;/th&gt;
  &lt;th&gt;Improvement vs baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-tying&lt;/td&gt;
  &lt;td&gt;3.874&lt;/td&gt;
  &lt;td&gt;-0.182&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-decay-cerebras&lt;/td&gt;
  &lt;td&gt;3.867&lt;/td&gt;
  &lt;td&gt;-0.175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-baseline&lt;/td&gt;
  &lt;td&gt;3.692&lt;/td&gt;
  &lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m80-no-amp&lt;/td&gt;
  &lt;td&gt;3.679&lt;/td&gt;
  &lt;td&gt;0.013&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-gradient-clipping&lt;/td&gt;
  &lt;td&gt;3.678&lt;/td&gt;
  &lt;td&gt;0.014&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-qkv-bias&lt;/td&gt;
  &lt;td&gt;3.669&lt;/td&gt;
  &lt;td&gt;0.023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-decay-gpt2&lt;/td&gt;
  &lt;td&gt;3.643&lt;/td&gt;
  &lt;td&gt;0.049&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-remove-dropout&lt;/td&gt;
  &lt;td&gt;3.641&lt;/td&gt;
  &lt;td&gt;0.051&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-schedule-learning-rate&lt;/td&gt;
  &lt;td&gt;3.602&lt;/td&gt;
  &lt;td&gt;0.09&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, a &lt;em&gt;tiny&lt;/em&gt; improvement over our baseline.  Taking more than twice as long on the training run, and
spending three times as much, gained us a loss improvement that's smaller than any other
successful intervention.&lt;/p&gt;

&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;The first question is, did removing AMP and lower-precision matrix multiplications
lead to a better model?  The answer appears to be "yes" -- but it's a tiny enough
difference that it could well be in the noise.&lt;/p&gt;

&lt;p&gt;But the follow-up has to be, was it worth the extra cost in time and money?  And for
that I'm certain that the answer is "no".  If we'd spent twice the time
training with AMP -- on an extra 3B-odd tokens, or on a second epoch with the same
3B -- it seems implausible that the resulting loss would not have been better.&lt;/p&gt;

&lt;p&gt;And anyway, given that my goal with these interventions is to train the best model
I can in two days locally (or 3h30m or so on an 8x A100 40 GiB), it's pretty clear that
if we'd cut this run off about halfway through it would have been worse -- and that's not
even accounting for it being more memory-hungry.&lt;/p&gt;

&lt;p&gt;So, I think the takeaway from this is that AMP appears to be a huge win, at least for
this model.  It has a tiny cost (if any) in model quality, and a huge benefit in training
speed, plus a smallish but still useful benefit in training VRAM requirements. &lt;sup class="footnote-ref" id="fnref-3"&gt;&lt;a href="#fn-3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;And with that, I've reached the end of &lt;a href="/2026/02/llm-from-scratch-32a-interventions-baseline-model#the-interventions"&gt;the interventions that I wanted to try&lt;/a&gt;!  Next,
I'll need to think through what we need to do to try to stack them up.  In particular,
is there any easy way to work out whether any of the improvements I've seen might
be due to random noise?  After all, even though I've been carefully using explicit
seeds, each intervention will have changed the way the training run uses the random
number stream, and that could easily have an effect.&lt;/p&gt;

&lt;p&gt;Stay tuned!&lt;/p&gt;

&lt;p&gt;&lt;a href="/2026/04/llm-from-scratch-32i-interventions-what-is-in-the-noise"&gt;Here's a link to the next post in this series&lt;/a&gt;.&lt;/p&gt;

&lt;div class="footnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn-1"&gt;
&lt;p&gt;The name of the flag is not quite right, as of course we're switching off
not just AMP but the matrix multiplication precision, but it's a decent shorthand.&amp;#160;&lt;a href="#fnref-1" class="footnoteBackLink" title="Jump back to footnote 1 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-2"&gt;
&lt;p&gt;I'm a night owl, so luckily I was still awake.&amp;#160;&lt;a href="#fnref-2" class="footnoteBackLink" title="Jump back to footnote 2 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-3"&gt;
&lt;p&gt;I have to admit that I'm very tempted to see what effect even bigger moves in
the low-precision direction might have.  What if I moved to some kind of 16-bit
training, like &lt;code&gt;bfloat16&lt;/code&gt;?  After all, most of the open weights models like
Qwen are at least released at that kind of bittedness.  But that's one to look into
later, I think.&amp;#160;&lt;a href="#fnref-3" class="footnoteBackLink" title="Jump back to footnote 3 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description><guid isPermaLink="false">/2026/04/llm-from-scratch-32h-interventions-full-fat-float32</guid><pubDate>Fri, 03 Apr 2026 23:50:00 +0000</pubDate></item><item><title>Writing an LLM from scratch, part 32i -- Interventions: what is in the noise?</title><link>https://www.gilesthomas.com/2026/04/llm-from-scratch-32i-interventions-what-is-in-the-noise</link><description>&lt;p&gt;Towards the end of last year, I
&lt;a href="/2025/12/llm-from-scratch-28-training-a-base-model-from-scratch"&gt;trained a 163M-parameter GPT-2-style model from scratch on my local RTX 3090&lt;/a&gt;,
using code based on
&lt;a href="https://sebastianraschka.com/"&gt;Sebastian Raschka&lt;/a&gt;'s book
"&lt;a href="https://www.manning.com/books/build-a-large-language-model-from-scratch"&gt;Build a Large Language Model (from Scratch)&lt;/a&gt;".&lt;/p&gt;

&lt;p&gt;The result was a pretty decent little model, but it wasn't as good as the original
GPT-2-small, despite having more parameters (because it wasn't using weight-tying).
Specifically: on a particular test set, my model gave a loss of 3.944 -- quite a lot
more than the original GPT-2's 3.500 on the same dataset.&lt;/p&gt;

&lt;p&gt;I wanted to see whether I could train a model on my own hardware (or on something that
didn't cost too much to rent in the cloud) that got closer to the original model's
performance.  So over the last few months, I've done a bunch of further training runs, each one testing
a specific intervention -- a stand-alone change that I expected to change the loss, either
for better or for worse.  Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I trained &lt;a href="/2026/02/llm-from-scratch-32a-interventions-baseline-model"&gt;a baseline model&lt;/a&gt; on
an 8x A100 40 GiB per GPU machine on Lambda (which
was better than my original locally-trained model, I believe due to the larger batch size
that the larger machine made possible).&lt;/li&gt;
&lt;li&gt;I tried &lt;a href="/2026/02/llm-from-scratch-32b-interventions-gradient-clipping"&gt;adding gradient clipping&lt;/a&gt;
to see if that would help by limiting the effects of loss spikes.&lt;/li&gt;
&lt;li&gt;I tried &lt;a href="/2026/02/llm-from-scratch-32c-interventions-removing-dropout"&gt;removing dropout&lt;/a&gt;, given
that these days people tend not to use it (because we're doing single-epoch training runs).&lt;/li&gt;
&lt;li&gt;I tried &lt;a href="/2026/02/llm-from-scratch-32d-interventions-adding-attention-bias"&gt;adding bias to the attention weight matrices&lt;/a&gt; --
something that was popular back in the GPT-2 era, and was used by the original weights, but which my code did not use.&lt;/li&gt;
&lt;li&gt;Instead of just using the learning rate of 0.0004 that was used in the code from
the book, I looked into what values people use these days, and learned how
to &lt;a href="/2026/03/llm-from-scratch-32e-interventions-learning-rate"&gt;schedule it over the course of the training run&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Similarly, I learned more about &lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;weight decay&lt;/a&gt; and
tried some alternative values.&lt;/li&gt;
&lt;li&gt;Then I tried making my model more like the original GPT-2 one by introducing
&lt;a href="/2026/03/llm-from-scratch-32g-interventions-weight-tying"&gt;weight tying&lt;/a&gt; to
see if that would help.&lt;/li&gt;
&lt;li&gt;Finally, I decided to try &lt;a href="/2026/04/llm-from-scratch-32h-interventions-full-fat-float32"&gt;training in "full-fat" float32&lt;/a&gt;
instead of using PyTorch's AMP and TF32 matrix multiplication performance
enhancements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the end of all of that, I had this table showing the effect of each intervention
in terms of loss on the test set.  They're sorted from least-effective to most-effective,
and you can see the baseline in there too:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test set loss&lt;/th&gt;
  &lt;th&gt;Improvement vs baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-tying&lt;/td&gt;
  &lt;td&gt;3.874&lt;/td&gt;
  &lt;td&gt;-0.182&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-decay-cerebras&lt;/td&gt;
  &lt;td&gt;3.867&lt;/td&gt;
  &lt;td&gt;-0.175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-baseline&lt;/td&gt;
  &lt;td&gt;3.692&lt;/td&gt;
  &lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m80-no-amp&lt;/td&gt;
  &lt;td&gt;3.679&lt;/td&gt;
  &lt;td&gt;0.013&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-gradient-clipping&lt;/td&gt;
  &lt;td&gt;3.678&lt;/td&gt;
  &lt;td&gt;0.014&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-qkv-bias&lt;/td&gt;
  &lt;td&gt;3.669&lt;/td&gt;
  &lt;td&gt;0.023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-decay-gpt2&lt;/td&gt;
  &lt;td&gt;3.643&lt;/td&gt;
  &lt;td&gt;0.049&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-remove-dropout&lt;/td&gt;
  &lt;td&gt;3.641&lt;/td&gt;
  &lt;td&gt;0.051&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-schedule-learning-rate&lt;/td&gt;
  &lt;td&gt;3.602&lt;/td&gt;
  &lt;td&gt;0.09&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Winners and losers are reasonably clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weight tying and the number for weight decay I derived from a paper by
Cerebras Research (probably without understanding it properly) were negatives.&lt;/li&gt;
&lt;li&gt;Full-fat float32, gradient clipping,  attention biases, the GPT-2 weight decay
parameter, removing dropout, and scheduling (and updating) the learning rate
were positives.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So, for an optimal train, we'd just use the effective interventions, right?
Well, not quite.&lt;/p&gt;

&lt;p&gt;Full-fat float32 I decided wasn't worth the effort, as it meant that the
train took more than twice as long, and (because it required a larger machine), cost
more than three times as much.&lt;/p&gt;

&lt;p&gt;The others did look like solid changes, but there was one concern.  The effect of
each intervention is actually pretty small.  For example, gradient clipping reduced
the loss by 0.014, from 3.692 to 3.678.  That's a 0.3% improvement.  Even the best
intervention, scheduling the learning rate, only improved things by 2%.&lt;/p&gt;

&lt;p&gt;Could it be that some or all of these improvements were not real, but just a result
of the random nature of training deep neural networks?  Could the differences just
be in the noise?  They seemed small enough for that to be possible.&lt;/p&gt;

&lt;p&gt;I've trained seven more models over the last few days to try to get a feel as to how
big an effect noise has for this kind of training run.  The results appear to show
that variations in the initial weights matter quite a lot, but randomness in the
training loop (given the same initial weights) actually has a fairly minimal impact.
That surprised me a bit!&lt;/p&gt;

&lt;p&gt;Let's go through the details.&lt;/p&gt;
&lt;h3 id="is-my-random-seed-code-working"&gt;Is my random seed code working?&lt;/h3&gt;

&lt;p&gt;When I did the original baseline training run -- creating the model that was the
comparison point for all of the interventions --
I wanted to minimise the amount of random number-induced differences
between the training runs in this interventions series.  I did this by setting the random seed
at the start -- specifically, I had this code:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="n"&gt;seed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;
    &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manual_seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manual_seed_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;At the time I wrote it, this seemed pretty complete -- the seed is set on Python's
own random number generator, on PyTorch's, and on the separate ones it uses for CUDA.&lt;/p&gt;

&lt;p&gt;However, in a separate project, where I was fine-tuning a Qwen model as a classifier,
I'd found that this wasn't enough.  In order to get full reproducibility, I'd had to
lock things down a bit more, with this additional code:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backends&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cudnn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deterministic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;True&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backends&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cudnn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;benchmark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;False&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;use_deterministic_algorithms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;So: was my random number seed code enough for this case?  Or would I get a different
model if I ran the same code a second time?&lt;/p&gt;

&lt;p&gt;That was easy enough to do; I spun up a machine, and just ran the "baseline" train
again.  3 hours 24 minutes later:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 12,276.306 seconds
Tokens seen: 3,260,252,160
Throughput: 265,573 tokens/second
Final train loss: 3.743
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Interestingly, that was exactly the same final train loss as the original baseline
train.  &lt;a href="https://huggingface.co/gpjt/8xa100m40-baseline-2"&gt;Here's the model&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I ran my normal smoke test, asking it to complete "Every effort moves you"&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you in the way of getting into that.
I will let you down as you move, not by
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...so that was OK -- the model was generating reasonably coherent text.
Then I ran the eval to find its loss on the test set:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.692
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Exactly the same as the original baseline!  That was certainly promising.  Now, the
use of three decimal places for the output from the loss eval is just a formatting thing,
so I bumped it up to 6 dps, and the new model got this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.691526
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Running that against the original baseline model:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.691526
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Again, exactly the same.  Finally, more out of idle interest than anything else,
I decided to see if the models were at least different:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ diff runs/8xa100m40-baseline/checkpoints/best/model.safetensors runs/8xa100m40-baseline-2/checkpoints/best/model.safetensors
giles@perry:~/Dev/ddp-base-model-from-scratch (main)$
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That is, quite frankly, amazing to me.  I was expecting pretty close results, but
what we're seeing here is that two separate models, trained on the same data, but
on different machines more than a month apart, have weights that are bit-wise identical.
No random noise at all.&lt;/p&gt;

&lt;p&gt;That's actually really reassuring!  It makes me much more comfortable that we're
standing on a stable foundation here.&lt;/p&gt;

&lt;p&gt;Now it was time to see what effect changing that random seed would have.&lt;/p&gt;

&lt;h3 id="changing-the-random-seed"&gt;Changing the random seed&lt;/h3&gt;

&lt;p&gt;Let's think about what the random seed does.  When we call &lt;code&gt;random.seed(42)&lt;/code&gt;, we're
initialising Python's pseudo-random number generator so that it will start at a particular
point -- after we've called it, it will generate the same sequence of "random" numbers
each time it's asked for a new one.  So the effect of this code:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="n"&gt;seed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;
    &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manual_seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manual_seed_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...is to initialise three separate pseudo-random number generators to be in a known
deterministic state, so they'll all generate the same sequence in every run.&lt;/p&gt;

&lt;p&gt;So, the first thing to do was to see what happened if we changed that number.&lt;/p&gt;

&lt;p&gt;I decided to do two training runs, each with exactly the same code as the baseline,
but with different random seeds.  Firstly, I changed it from 42 to 22 &lt;sup class="footnote-ref" id="fnref-1"&gt;&lt;a href="#fn-1"&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;ubuntu@130-61-223-25:~/ddp-base-model-from-scratch$ &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;diff
&lt;span class="go"&gt;diff --git a/ddp_train.py b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;index 5519353..16a8be5 100644&lt;/span&gt;
&lt;span class="go"&gt;--- a/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;+++ b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;@@ -623,7 +623,7 @@ def main(run, datasets_dir_path, checkpoint, find_max_microbatch_size):&lt;/span&gt;
&lt;span class="go"&gt;     dist.init_process_group(backend, device_id=local_rank)&lt;/span&gt;

&lt;span class="gp"&gt;     # &lt;/span&gt;Set&lt;span class="w"&gt; &lt;/span&gt;all&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;random&lt;span class="w"&gt; &lt;/span&gt;seeds
&lt;span class="go"&gt;-    seed = 42&lt;/span&gt;
&lt;span class="go"&gt;+    seed = 22&lt;/span&gt;
&lt;span class="go"&gt;     random.seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;     torch.manual_seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;     torch.cuda.manual_seed_all(seed)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;That training run completed:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 12,287.950 seconds
Tokens seen: 3,260,252,160
Throughput: 265,321 tokens/second
Final train loss: 3.724
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/gpjt/8xa100m40-baseline-3"&gt;Here's the model&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Time for the evals; the smoke test:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you a few spots in your path. Sooner, just think twice how many steps you take. If
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and the loss test:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.673453
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So, that's 3.673453 compared to 3.691526, an improvement of 0.018 over the run with a seed
of 42.  That's more than
the 0.014 improvement we got from gradient clipping (and indeed, the 0.013 from full-fat float32
training), and quite close to the 0.023 improvement from adding attention weight bias.&lt;/p&gt;

&lt;p&gt;Time for another training run:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;ubuntu@141-148-168-241:~/ddp-base-model-from-scratch$ &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;diff
&lt;span class="go"&gt;diff --git a/ddp_train.py b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;index 5519353..1e9b5bc 100644&lt;/span&gt;
&lt;span class="go"&gt;--- a/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;+++ b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;@@ -623,7 +623,7 @@ def main(run, datasets_dir_path, checkpoint, find_max_microbatch_size):&lt;/span&gt;
&lt;span class="go"&gt;     dist.init_process_group(backend, device_id=local_rank)&lt;/span&gt;

&lt;span class="gp"&gt;     # &lt;/span&gt;Set&lt;span class="w"&gt; &lt;/span&gt;all&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;random&lt;span class="w"&gt; &lt;/span&gt;seeds
&lt;span class="go"&gt;-    seed = 42&lt;/span&gt;
&lt;span class="go"&gt;+    seed = 67&lt;/span&gt;
&lt;span class="go"&gt;     random.seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;     torch.manual_seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;     torch.cuda.manual_seed_all(seed)&lt;/span&gt;
&lt;span class="gp"&gt;ubuntu@141-148-168-241:~/ddp-base-model-from-scratch$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Another 3h24m later:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 12,263.076 seconds
Tokens seen: 3,260,252,160
Throughput: 265,859 tokens/second
Final train loss: 3.704
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/gpjt/8xa100m40-baseline-4"&gt;Here's the model&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The smoke test:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you to a new level. The next phase is a one-way street in which you are free to
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and the test set loss:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.653593
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A further improvement!  That's 0.038 better than our original baseline, which beats
adding on attention weight bias (though it's worse than the weight decay update).&lt;/p&gt;

&lt;p&gt;Now, three data points is rather a small number for any kind of statistical
analysis, but just out of interest, let's do the basics.
&lt;a href="https://www.geeksforgeeks.org/maths/variance/"&gt;GeeksForGeeks has a good refresher here&lt;/a&gt; if you're a bit
rusty.&lt;/p&gt;

&lt;p&gt;Firstly, our mean is&lt;/p&gt;

&lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="block"&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#x003BC;&lt;/mi&gt;&lt;mo&gt;&amp;#x0003D;&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mn&gt;3.691526&lt;/mn&gt;&lt;mo&gt;&amp;#x0002B;&lt;/mo&gt;&lt;mn&gt;3.673453&lt;/mn&gt;&lt;mo&gt;&amp;#x0002B;&lt;/mo&gt;&lt;mn&gt;3.653593&lt;/mn&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;&amp;#x02248;&lt;/mo&gt;&lt;mn&gt;3.672857&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;

&lt;p&gt;...and our variance&lt;sup class="footnote-ref" id="fnref-2"&gt;&lt;a href="#fn-2"&gt;2&lt;/a&gt;&lt;/sup&gt; is:&lt;/p&gt;

&lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="block"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mi&gt;&amp;#x003C3;&lt;/mi&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;&amp;#x0003D;&lt;/mo&gt;&lt;mfrac&gt;&lt;mrow&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mn&gt;3.672857&lt;/mn&gt;&lt;mo&gt;&amp;#x02212;&lt;/mo&gt;&lt;mn&gt;3.691526&lt;/mn&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;&amp;#x0002B;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mn&gt;3.672857&lt;/mn&gt;&lt;mo&gt;&amp;#x02212;&lt;/mo&gt;&lt;mn&gt;3.673453&lt;/mn&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;mo&gt;&amp;#x0002B;&lt;/mo&gt;&lt;mo stretchy="false"&gt;&amp;#x00028;&lt;/mo&gt;&lt;mn&gt;3.672857&lt;/mn&gt;&lt;mo&gt;&amp;#x02212;&lt;/mo&gt;&lt;mn&gt;3.653593&lt;/mn&gt;&lt;msup&gt;&lt;mo stretchy="false"&gt;&amp;#x00029;&lt;/mo&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mn&gt;3&lt;/mn&gt;&lt;/mrow&gt;&lt;/mfrac&gt;&lt;mo&gt;&amp;#x02248;&lt;/mo&gt;&lt;mn&gt;0.000240&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;

&lt;p&gt;If we take the square root of that, we get the standard deviation (SD):&lt;/p&gt;

&lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="block"&gt;&lt;mrow&gt;&lt;mi&gt;&amp;#x003C3;&lt;/mi&gt;&lt;mo&gt;&amp;#x0003D;&lt;/mo&gt;&lt;msqrt&gt;&lt;mrow&gt;&lt;mn&gt;0.000240&lt;/mn&gt;&lt;/mrow&gt;&lt;/msqrt&gt;&lt;mo&gt;&amp;#x02248;&lt;/mo&gt;&lt;mn&gt;0.0154919&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;

&lt;p&gt;So, if we assume a normal distribution, what would that say about our results?  Here's
the results table again.&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test set loss&lt;/th&gt;
  &lt;th&gt;Improvement vs baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-tying&lt;/td&gt;
  &lt;td&gt;3.874&lt;/td&gt;
  &lt;td&gt;-0.182&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-decay-cerebras&lt;/td&gt;
  &lt;td&gt;3.867&lt;/td&gt;
  &lt;td&gt;-0.175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-baseline&lt;/td&gt;
  &lt;td&gt;3.692&lt;/td&gt;
  &lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m80-no-amp&lt;/td&gt;
  &lt;td&gt;3.679&lt;/td&gt;
  &lt;td&gt;0.013&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-gradient-clipping&lt;/td&gt;
  &lt;td&gt;3.678&lt;/td&gt;
  &lt;td&gt;0.014&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-qkv-bias&lt;/td&gt;
  &lt;td&gt;3.669&lt;/td&gt;
  &lt;td&gt;0.023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-decay-gpt2&lt;/td&gt;
  &lt;td&gt;3.643&lt;/td&gt;
  &lt;td&gt;0.049&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-remove-dropout&lt;/td&gt;
  &lt;td&gt;3.641&lt;/td&gt;
  &lt;td&gt;0.051&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-schedule-learning-rate&lt;/td&gt;
  &lt;td&gt;3.602&lt;/td&gt;
  &lt;td&gt;0.09&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;If we assume that the results are on a normal distribution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We would expect ~68.2% of results to be within one SD of the mean -- that is, between
3.6573651 and 3.6883489.  Interestingly, our actual baseline result is outside that
range!  But it does include both the gradient clipping and the
QKV bias results.&lt;/li&gt;
&lt;li&gt;We would additionally expect ~95.4% of the results to be within two SDs, which is
3.6418732 to 3.7038408.  That includes our baseline and our weight decay result (though not our experiment
removing dropout -- the six-DP loss number for that is 3.641282).&lt;/li&gt;
&lt;li&gt;Finally, we'd expect ~99.7% of results to be within three SDs, which is a range from
3.6263813 to 3.7193327.  That covers all of our positive results apart from scheduling learning
rate!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That seemed a bit saddening -- were all of the results apart from scheduling the learning
rate within the noise?
Well, so as I said, three data points is too small a number to take those results
without a fistful of salt.  I was thinking of perhaps trying another few random seeds
to see what would happen, and perhaps to tighten those numbers up a bit, but then something
occurred to me -- randomness was being used in two different ways in the training
run, and perhaps we could separate them?&lt;/p&gt;

&lt;h3 id="breaking-the-randomness-apart"&gt;Breaking the randomness apart&lt;/h3&gt;

&lt;p&gt;Where do we use the random numbers?  Well, immediately after we set the seeds, we create
our uninitialised model for training:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GPTModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_conf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;local_rank&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;One of the random number generators -- Python's, PyTorch's, or one of the CUDA ones -- will be used to generate the initial weights
that we're going to start training.  That means that &lt;em&gt;for the same model setup&lt;/em&gt;, we'll
always start with exactly the same weights.  But if the model settings change
such that we initialise different things in a different order, then we'll have different weights.&lt;/p&gt;

&lt;p&gt;After we've done that, we go into the training loop.  That &lt;em&gt;can have&lt;/em&gt; randomness in it;
although the AdamW optimiser itself is deterministic, we are (in all but one of these training
runs) using dropout, which drops a random bunch of activations at various points -- 10% of
them with our config.  And it seems entirely possible that each of the interventions
could change the order of execution of different steps in non-obvious ways, which would
lead to dropout being applied in different ways in different runs.&lt;/p&gt;

&lt;p&gt;So, the question was: what kinds of randomness -- in terms of the initial weights, or
in terms of the training run -- did each intervention potentially change vs the baseline?
Disregarding the full-fat float32 run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gradient clipping: randomness only affected the training run -- the weights it started with
would have been exactly the same as the baseline model's.&lt;/li&gt;
&lt;li&gt;Removing dropout: although this is a parameter on the model, I don't think it changes
the initial weights.  But in the training run, it certainly does affect randomness
by removing its use of the random number generator.&lt;/li&gt;
&lt;li&gt;Adding bias to the attention weights.  This will change both the initial weights -- because
we have those bias weights, things will be initialised differently -- and as a result, the training run,
as the random number generator will have been sampled a different number of times
prior to the run.&lt;/li&gt;
&lt;li&gt;Changing and scheduling the learning rate certainly should not change the initial
weights, but it might conceivably have a non-obvious effect on training.&lt;/li&gt;
&lt;li&gt;Likewise weight decay; no effect I can see on the initial weights, but it could well
change training dynamics.&lt;/li&gt;
&lt;li&gt;Weight-tying.  When I &lt;a href="https://github.com/gpjt/ddp-base-model-from-scratch/commit/c7d6256192ac2206bc16bcf71a56fa4f9c2115b6"&gt;added it to the code&lt;/a&gt;, I
tried to do so in such a way that the other weights would be unaffected -- I created
exactly the same weights as I would without weight tying, then threw away the output
head and replaced it with a reference to the input embedding weights.  So I think
that in theory, this one won't have changed the other model weights (apart from ignoring
the initialised-but-thrown-away output head), but it could well have
changed the training run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Given that, I wanted to get two measures of how sensitive to noise each phase of
the training run was: the initialisation of weights at the start, and the training
run itself.&lt;/p&gt;

&lt;p&gt;I decided to start by nailing down exactly what the training run started with.&lt;/p&gt;

&lt;h4 id="loss-changes-with-the-same-weights-but-different-training-run-seeds"&gt;Loss changes with the same weights but different training run seeds&lt;/h4&gt;

&lt;p&gt;We already had a baseline training run with a specific state of the random number
generator at the start; in our "real" baseline, we seeded with 42 at the start,
and then initialised our weights.  After that, the random number generator would
have reached some specific state based on its initial seed and how many numbers had been generated so far.&lt;/p&gt;

&lt;p&gt;Now, in theory, we could get the RNG into that specific state by seeding it with some
number &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; at that point.  We don't know what &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; is, of course.  But it seems vanishingly
unlikely that it would be something we'd come up with -- specifically, we can be
pretty sure that &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#x02260;&lt;/mo&gt;&lt;mn&gt;23&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; and &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;mo&gt;&amp;#x02260;&lt;/mo&gt;&lt;mn&gt;67&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;.&lt;/p&gt;

&lt;p&gt;So, I put the old initial seed of 42 back in, but re-seeded after the model
had been initialised:&lt;/p&gt;

&lt;p&gt;Firstly, with a re-seed value of 23:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;ubuntu@167-234-217-254:~/ddp-base-model-from-scratch$ &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;diff
&lt;span class="go"&gt;diff --git a/ddp_train.py b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;index 5519353..7b7993c 100644&lt;/span&gt;
&lt;span class="go"&gt;--- a/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;+++ b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;@@ -643,6 +643,12 @@ def main(run, datasets_dir_path, checkpoint, find_max_microbatch_size):&lt;/span&gt;
&lt;span class="go"&gt;     else:&lt;/span&gt;
&lt;span class="go"&gt;         scaler = None&lt;/span&gt;

&lt;span class="go"&gt;+    # Set all of the random seeds again&lt;/span&gt;
&lt;span class="go"&gt;+    seed = 23&lt;/span&gt;
&lt;span class="go"&gt;+    random.seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;+    torch.manual_seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;+    torch.cuda.manual_seed_all(seed)&lt;/span&gt;
&lt;span class="go"&gt;+&lt;/span&gt;
&lt;span class="go"&gt;     datasets_dir = Path(datasets_dir_path)&lt;/span&gt;
&lt;span class="go"&gt;     dataset_name = train_conf[&amp;quot;dataset&amp;quot;]&lt;/span&gt;
&lt;span class="go"&gt;     dataset_dir = datasets_dir / dataset_name&lt;/span&gt;
&lt;span class="gp"&gt;ubuntu@167-234-217-254:~/ddp-base-model-from-scratch$&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;I let that run....&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 12,263.247 seconds
Tokens seen: 3,260,252,160
Throughput: 265,856 tokens/second
Final train loss: 3.731
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and got &lt;a href="https://huggingface.co/gpjt/8xa100m40-baseline-5"&gt;this model&lt;/a&gt;.  Time
for the normal evals:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you into a relationship in a world full of meaning and the beauty of your character that you will never forget
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.681356
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next, I did another training run, the same as the previous one, but with 67 instead of 23
for the re-seed:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;ubuntu@141-148-168-241:~/ddp-base-model-from-scratch$ &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;diff
&lt;span class="go"&gt;diff --git a/ddp_train.py b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;index 5519353..90b902a 100644&lt;/span&gt;
&lt;span class="go"&gt;--- a/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;+++ b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;@@ -643,6 +643,12 @@ def main(run, datasets_dir_path, checkpoint, find_max_microbatch_size):&lt;/span&gt;
&lt;span class="go"&gt;     else:&lt;/span&gt;
&lt;span class="go"&gt;         scaler = None&lt;/span&gt;

&lt;span class="go"&gt;+    # Set all of the random seeds again&lt;/span&gt;
&lt;span class="go"&gt;+    seed = 67&lt;/span&gt;
&lt;span class="go"&gt;+    random.seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;+    torch.manual_seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;+    torch.cuda.manual_seed_all(seed)&lt;/span&gt;
&lt;span class="go"&gt;+&lt;/span&gt;
&lt;span class="go"&gt;     datasets_dir = Path(datasets_dir_path)&lt;/span&gt;
&lt;span class="go"&gt;     dataset_name = train_conf[&amp;quot;dataset&amp;quot;]&lt;/span&gt;
&lt;span class="go"&gt;     dataset_dir = datasets_dir / dataset_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;That one ran:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 12,245.932 seconds
Tokens seen: 3,260,252,160
Throughput: 266,231 tokens/second
Final train loss: 3.732
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...producing &lt;a href="https://huggingface.co/gpjt/8xa100m40-baseline-6"&gt;this model&lt;/a&gt;,
which eval'ed like this &lt;sup class="footnote-ref" id="fnref-3"&gt;&lt;a href="#fn-3"&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you and I back home, to ensure the security of these children, but in this age transition to a
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and...&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.680505
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let's bring those together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our normal baseline: weights initialised with seed 42, and training run starts
with a "seed" of our imaginary &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;A&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; value from above: 3.691526&lt;/li&gt;
&lt;li&gt;The first run above: weights initialised with seed 42, and training run starts
with a seed of 23: 3.681356&lt;/li&gt;
&lt;li&gt;The second run above: weights initialised with seed 42, and training run starts
with a seed of 67: 3.680505&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a mean of ~3.684462, with a variance of ~0.0000752 and a standard deviation
of ~0.008672.  Those are &lt;em&gt;tiny&lt;/em&gt; compared to the numbers from the two trains we did
with the change of the seed prior to the model initialisation.&lt;/p&gt;

&lt;p&gt;That actually surprised
me a bit; we're using dropout in all of these training runs, and it's dropping a random
10% of activations in every forward training pass.  With our different training run
starting seeds, they should be getting very different dropout patterns.  Hand-wavingly,
perhaps over the three million or so sequences we're training on, it averages out?
Still a little counterintuitive, though.&lt;/p&gt;

&lt;p&gt;Anyway, let's take a look at the intervention results again, this time
highlighting the ones that we believe will be starting with the same weights:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test set loss&lt;/th&gt;
  &lt;th&gt;Improvement vs baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-tying&lt;/td&gt;
  &lt;td&gt;3.874&lt;/td&gt;
  &lt;td&gt;-0.182&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m40-weight-decay-cerebras&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.867&lt;/td&gt;
  &lt;td&gt;-0.175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m40-baseline&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.692&lt;/td&gt;
  &lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m80-no-amp&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.679&lt;/td&gt;
  &lt;td&gt;0.013&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m40-gradient-clipping&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.678&lt;/td&gt;
  &lt;td&gt;0.014&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-qkv-bias&lt;/td&gt;
  &lt;td&gt;3.669&lt;/td&gt;
  &lt;td&gt;0.023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m40-weight-decay-gpt2&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.643&lt;/td&gt;
  &lt;td&gt;0.049&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m40-remove-dropout&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.641&lt;/td&gt;
  &lt;td&gt;0.051&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m40-schedule-learning-rate&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.602&lt;/td&gt;
  &lt;td&gt;0.09&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Using the "99.7% should be within three SDs" heuristic, we get a range of
3.658446 - 3.710478.  Of the intervention runs with (I believe) stable weights, only the no-AMP
and the gradient clipping ones are within that range.&lt;/p&gt;

&lt;p&gt;That made me feel quite positive.  If my beliefs are correct about which runs have the same
weights, then noise in the training runs seems unlikely to be causing
the differences -- that is, perhaps the results from the interventions for those
same-weight training runs are real signal and not just noise.&lt;/p&gt;

&lt;p&gt;What would happen if instead of pinning the seed for generating the weights and
varying the starting seed for the training run, we varied the weight seed and
pinned the training one?&lt;/p&gt;

&lt;h4 id="loss-changes-with-different-weights-but-the-training-run-seed-nailed-down"&gt;Loss changes with different weights but the training run seed nailed down&lt;/h4&gt;

&lt;p&gt;We'd already done a training run with a seed of 42 before generating the weights
and a re-seed to 23 after that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first run above: weights initialised with seed 42, and training run starts
with a seed of 23: 3.681356&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I decided to see what would happen if I varied the pre-weights initialisation seed.&lt;/p&gt;

&lt;p&gt;Firstly:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;ubuntu@167-234-217-254:~/ddp-base-model-from-scratch$ &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;diff
&lt;span class="go"&gt;diff --git a/ddp_train.py b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;index 5519353..932acc4 100644&lt;/span&gt;
&lt;span class="go"&gt;--- a/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;+++ b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;@@ -623,7 +623,7 @@ def main(run, datasets_dir_path, checkpoint, find_max_microbatch_size):&lt;/span&gt;
&lt;span class="go"&gt;     dist.init_process_group(backend, device_id=local_rank)&lt;/span&gt;

&lt;span class="gp"&gt;     # &lt;/span&gt;Set&lt;span class="w"&gt; &lt;/span&gt;all&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;random&lt;span class="w"&gt; &lt;/span&gt;seeds
&lt;span class="go"&gt;-    seed = 42&lt;/span&gt;
&lt;span class="go"&gt;+    seed = 23&lt;/span&gt;
&lt;span class="go"&gt;     random.seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;     torch.manual_seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;     torch.cuda.manual_seed_all(seed)&lt;/span&gt;
&lt;span class="go"&gt;@@ -643,6 +643,12 @@ def main(run, datasets_dir_path, checkpoint, find_max_microbatch_size):&lt;/span&gt;
&lt;span class="go"&gt;     else:&lt;/span&gt;
&lt;span class="go"&gt;         scaler = None&lt;/span&gt;

&lt;span class="go"&gt;+    # Set all of the random seeds again&lt;/span&gt;
&lt;span class="go"&gt;+    seed = 23&lt;/span&gt;
&lt;span class="go"&gt;+    random.seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;+    torch.manual_seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;+    torch.cuda.manual_seed_all(seed)&lt;/span&gt;
&lt;span class="go"&gt;+&lt;/span&gt;
&lt;span class="go"&gt;     datasets_dir = Path(datasets_dir_path)&lt;/span&gt;
&lt;span class="go"&gt;     dataset_name = train_conf[&amp;quot;dataset&amp;quot;]&lt;/span&gt;
&lt;span class="go"&gt;     dataset_dir = datasets_dir / dataset_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Let that train:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 12,249.079 seconds
Tokens seen: 3,260,252,160
Throughput: 266,163 tokens/second
Final train loss: 3.725
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...getting &lt;a href="https://huggingface.co/gpjt/8xa100m40-baseline-7"&gt;this model&lt;/a&gt;.  Evals:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you from an easy and comfortable one and can be used up to 90% or even more and the result
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and...&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.673943
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Next, one with 67 as the weights initialisation seed:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="gp"&gt;ubuntu@141-148-168-241:~/ddp-base-model-from-scratch$ &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;diff
&lt;span class="go"&gt;diff --git a/ddp_train.py b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;index 5519353..a64b423 100644&lt;/span&gt;
&lt;span class="go"&gt;--- a/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;+++ b/ddp_train.py&lt;/span&gt;
&lt;span class="go"&gt;@@ -623,7 +623,7 @@ def main(run, datasets_dir_path, checkpoint, find_max_microbatch_size):&lt;/span&gt;
&lt;span class="go"&gt;     dist.init_process_group(backend, device_id=local_rank)&lt;/span&gt;

&lt;span class="gp"&gt;     # &lt;/span&gt;Set&lt;span class="w"&gt; &lt;/span&gt;all&lt;span class="w"&gt; &lt;/span&gt;of&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;random&lt;span class="w"&gt; &lt;/span&gt;seeds
&lt;span class="go"&gt;-    seed = 42&lt;/span&gt;
&lt;span class="go"&gt;+    seed = 67&lt;/span&gt;
&lt;span class="go"&gt;     random.seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;     torch.manual_seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;     torch.cuda.manual_seed_all(seed)&lt;/span&gt;
&lt;span class="go"&gt;@@ -643,6 +643,12 @@ def main(run, datasets_dir_path, checkpoint, find_max_microbatch_size):&lt;/span&gt;
&lt;span class="go"&gt;     else:&lt;/span&gt;
&lt;span class="go"&gt;         scaler = None&lt;/span&gt;

&lt;span class="go"&gt;+    # Set all of the random seeds again&lt;/span&gt;
&lt;span class="go"&gt;+    seed = 23&lt;/span&gt;
&lt;span class="go"&gt;+    random.seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;+    torch.manual_seed(seed)&lt;/span&gt;
&lt;span class="go"&gt;+    torch.cuda.manual_seed_all(seed)&lt;/span&gt;
&lt;span class="go"&gt;+&lt;/span&gt;
&lt;span class="go"&gt;     datasets_dir = Path(datasets_dir_path)&lt;/span&gt;
&lt;span class="go"&gt;     dataset_name = train_conf[&amp;quot;dataset&amp;quot;]&lt;/span&gt;
&lt;span class="go"&gt;     dataset_dir = datasets_dir / dataset_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;That trained:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 12,255.283 seconds
Tokens seen: 3,260,252,160
Throughput: 266,028 tokens/second
Final train loss: 3.714
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...getting &lt;a href="https://huggingface.co/gpjt/8xa100m40-baseline-8"&gt;this model&lt;/a&gt;, and &lt;sup class="footnote-ref" id="fnref-4"&gt;&lt;a href="#fn-4"&gt;4&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you, in an attempt to protect it’s rights and the rights of the people in any respect
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.664345
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;OK, so here we have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mean: ~3.673215&lt;/li&gt;
&lt;li&gt;Variance: ~0.000145&lt;/li&gt;
&lt;li&gt;SD: ~0.012062&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared to the SD we got when we varied just the initial seed, 0.0154919, it's not
too far off.  Using the 3-SD rule, we get a range of 3.637030 - 3.709400, and looking
at the table again, this time with the ones that we &lt;em&gt;don't&lt;/em&gt; expect to have the same
weights highlighted:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test set loss&lt;/th&gt;
  &lt;th&gt;Improvement vs baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m40-weight-tying&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.874&lt;/td&gt;
  &lt;td&gt;-0.182&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-decay-cerebras&lt;/td&gt;
  &lt;td&gt;3.867&lt;/td&gt;
  &lt;td&gt;-0.175&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-baseline&lt;/td&gt;
  &lt;td&gt;3.692&lt;/td&gt;
  &lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m80-no-amp&lt;/td&gt;
  &lt;td&gt;3.679&lt;/td&gt;
  &lt;td&gt;0.013&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-gradient-clipping&lt;/td&gt;
  &lt;td&gt;3.678&lt;/td&gt;
  &lt;td&gt;0.014&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;strong&gt;8xa100m40-qkv-bias&lt;/strong&gt;&lt;/td&gt;
  &lt;td&gt;3.669&lt;/td&gt;
  &lt;td&gt;0.023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-weight-decay-gpt2&lt;/td&gt;
  &lt;td&gt;3.643&lt;/td&gt;
  &lt;td&gt;0.049&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-remove-dropout&lt;/td&gt;
  &lt;td&gt;3.641&lt;/td&gt;
  &lt;td&gt;0.051&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;8xa100m40-schedule-learning-rate&lt;/td&gt;
  &lt;td&gt;3.602&lt;/td&gt;
  &lt;td&gt;0.09&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;...we can see that the QKV bias is well within that range (as are all of the interventions
apart from the two negative-effect ones and scheduling the learning rate).&lt;/p&gt;

&lt;p&gt;Right, what does all of that tell us?&lt;/p&gt;

&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;This post obviously isn't even trying to be statistically rigorous.  The number of
training runs I've done and the amount of data is way too small for that.  However,
training runs are expensive (Lambda have raised their prices again, so these cost
more than US$50 each!), so there's a limit to how much I can do.&lt;/p&gt;

&lt;p&gt;But even with the limited amount of data, something seems pretty clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Varying the random seed at the start, prior to initialising weights, and not
constraining the starting point for the training runs, gave a mean of 3.672857,
with an SD of 0.0154919.&lt;/li&gt;
&lt;li&gt;Keeping the same seed for model weights (so that they all started with the same
weights), and varying the seed for the training run, gave a mean of 3.684462,
with an SD of 0.008672.&lt;/li&gt;
&lt;li&gt;Varying the seed for the model weights (so that they all started with different
weights), and keeping the training run seed pinned, gave a mean of 3.673215
and an SD of 0.012062.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"One of these things is not like the others".  Keeping the model weights stable
and only allowing variation in randomness across the training run itself meant that
almost all of the differences between training runs disappeared.  Could this be
a result of the small number of samples?  I guess conceivably it might, but it seems
vanishingly unlikely.&lt;/p&gt;

&lt;p&gt;So I feel reasonably confident in saying that the bulk of the variation in results that
we can chalk up to random noise in these training runs comes from variations in the
model weights' initialisation.&lt;/p&gt;

&lt;p&gt;Additionally, the first training run in this post -- the re-run of the baseline model
with no changes -- gave exactly the same numbers as the original baseline run.  So we
can be confident that all of the models with no changes to the weight initialisation
started with the same weights.  Of course, I could be wrong about which models really
did have the same weights, but given that they were running the same code with the same
seed, I'm pretty much sure.&lt;/p&gt;

&lt;p&gt;That makes me fairly confident that the intervention runs that had the same initial
weights gave a real signal about whether or not the intervention in question actually
helped.  The only exception is gradient clipping, which fell within the three-SD
range for the same-weights tests -- and it's essentially free, adding just 100 seconds
to a three hour training run.&lt;/p&gt;

&lt;p&gt;That's a really interesting result!  As I said earlier, given that dropout is making
us ignore a random 10% of activations during the training run, I would have thought
that changing &lt;em&gt;which&lt;/em&gt; random 10% were being ignored would have a much larger effect.
And that's not even considering other sources of random noise in the training run.&lt;/p&gt;

&lt;p&gt;I was less surprised that model weight initialisation was important, though.  It's pretty
obvious that your starting position in the loss landscape is going to affect where you
end up at the end of the training run.&lt;/p&gt;

&lt;p&gt;Still, we now have a reasonable level of trust that our interventions gave a real signal,
so I think we have everything in place to see how they stack
together, and do a best-effort training run.  Can we approach the original GPT-2 small
weights' performance on our test set loss?&lt;/p&gt;

&lt;p&gt;It should be fun to find out :-)&lt;/p&gt;

&lt;p&gt;&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;Here's a link to the next post in this series&lt;/a&gt;.&lt;/p&gt;

&lt;div class="footnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn-1"&gt;
&lt;p&gt;Numbers chosen based on a misremembering of &lt;a href="https://xkcd.com/3184/"&gt;this XKCD&lt;/a&gt;.
For some reason (perhaps because it rhymes) I thought that the old-timey funny
number thing was "22 skidoo" rather than "23 skidoo".&amp;#160;&lt;a href="#fnref-1" class="footnoteBackLink" title="Jump back to footnote 1 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-2"&gt;
&lt;p&gt;On working through this later: with &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; samples from a dataset, it is (as I understand it) best to
use &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;mo&gt;&amp;#x02212;&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt; as the denominator here (Bessel's correction) for the "sample variance".
If we had every possible value, then it would be correct to use &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;n&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt;.  However, while
this changes a few details in the analysis, I don't think it changes the final conclusion
of the post meaningfully (it would just bump up the SDs by 22% or so), so I've
left it as-is.&amp;#160;&lt;a href="#fnref-2" class="footnoteBackLink" title="Jump back to footnote 2 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-3"&gt;
&lt;p&gt;I found it interesting that this model does the "you and I" hypercorrection that so
many people do when trying to write formally!  Based on the (correct) correction
of "me and you move back home" to "you and I move back home", I think as a result of
excessive pattern-matching.&amp;#160;&lt;a href="#fnref-3" class="footnoteBackLink" title="Jump back to footnote 3 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-4"&gt;
&lt;p&gt;Another grammatical error based on pattern-matching -- it would make
sense that the possessive form of "it" in English was "it's", just like the possessive
form of "John" is "John's".&amp;#160;&lt;a href="#fnref-4" class="footnoteBackLink" title="Jump back to footnote 4 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description><guid isPermaLink="false">/2026/04/llm-from-scratch-32i-interventions-what-is-in-the-noise</guid><pubDate>Tue, 07 Apr 2026 21:00:00 +0000</pubDate></item><item><title>Writing an LLM from scratch, part 32j -- Interventions: trying to train a better model in the cloud</title><link>https://www.gilesthomas.com/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud</link><description>&lt;p&gt;Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I
&lt;a href="/2025/12/llm-from-scratch-28-training-a-base-model-from-scratch"&gt;trained from scratch on my local RTX 3090&lt;/a&gt;,
using code based on
&lt;a href="https://sebastianraschka.com/"&gt;Sebastian Raschka&lt;/a&gt;'s book
"&lt;a href="https://www.manning.com/books/build-a-large-language-model-from-scratch"&gt;Build a Large Language Model (from Scratch)&lt;/a&gt;".&lt;/p&gt;

&lt;p&gt;My original model got a loss of 3.944 on my test set, while the original GPT-2
weights got 3.500 on the same dataset.  I wanted to see if I could close that gap,
and had a list of potential changes to the
training setup, and to the model itself.  Which of them would help?&lt;/p&gt;

&lt;p&gt;I found a list of solid-looking interventions, and in &lt;a href="/2026/04/llm-from-scratch-32i-interventions-what-is-in-the-noise"&gt;my last post&lt;/a&gt;
I came to the conclusion that the improvements in loss I had seen with all of them -- with
two possible exceptions -- seemed unlikely to be in the noise.  What would happen
if I tried to put them into a new model?&lt;/p&gt;
&lt;h3 id="the-interventions"&gt;The interventions&lt;/h3&gt;

&lt;p&gt;Let's start by looking at the results that we have for the interventions so far --
this is the table I've been using as I go through them, but I've updated it to contain
the loss figures for each model to six decimal places instead of three, and
made each model name link to the associated post.  I've also corrected the loss for
the &lt;code&gt;8xa100m40-weight-decay-cerebras&lt;/code&gt; model, which was mistakenly using the training loss at the
end of the run rather than the loss on the test set &lt;sup class="footnote-ref" id="fnref-1"&gt;&lt;a href="#fn-1"&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test set loss&lt;/th&gt;
  &lt;th&gt;Improvement vs baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32g-interventions-weight-tying"&gt;8xa100m40-weight-tying&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.874305&lt;/td&gt;
  &lt;td&gt;-0.182779&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;8xa100m40-weight-decay-cerebras&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.813856&lt;/td&gt;
  &lt;td&gt;-0.122330&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32a-interventions-baseline-model"&gt;8xa100m40-baseline&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.691526&lt;/td&gt;
  &lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32h-interventions-full-fat-float32"&gt;8xa100m80-no-amp&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.678968&lt;/td&gt;
  &lt;td&gt;0.012558&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32b-interventions-gradient-clipping"&gt;8xa100m40-gradient-clipping&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.678317&lt;/td&gt;
  &lt;td&gt;0.013209&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32d-interventions-adding-attention-bias"&gt;8xa100m40-qkv-bias&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.669385&lt;/td&gt;
  &lt;td&gt;0.022141&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;8xa100m40-weight-decay-gpt2&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.642940&lt;/td&gt;
  &lt;td&gt;0.048586&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32c-interventions-removing-dropout"&gt;8xa100m40-remove-dropout&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.641282&lt;/td&gt;
  &lt;td&gt;0.050244&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32e-interventions-learning-rate"&gt;8xa100m40-schedule-learning-rate&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.601917&lt;/td&gt;
  &lt;td&gt;0.089609&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As I've mentioned before, simply moving to training in the cloud improved things markedly,
getting loss down from 3.944 to 3.691526; I suspect this was due to having a closer-to-optimal
batch size (more about that in my next post).  What to do about the other interventions, though?&lt;/p&gt;

&lt;p&gt;It seemed clear that two of them were not helping: weight tying, and the one using the figure
for weight decay that I'd (I suspect incorrectly) derived from a paper by Cerebras
Research.  The "no-AMP" run (which would be better described
as "full-fat float32") had a small positive effect, but was so costly in terms of both time and money
that it wasn't worthwhile.&lt;/p&gt;

&lt;p&gt;So we had five interventions to try:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Gradient clipping.&lt;/li&gt;
&lt;li&gt;QKV bias (that is, adding bias to the attention weight matrices).&lt;/li&gt;
&lt;li&gt;Changing weight decay to the GPT-2 value (0.01 rather than the 0.1 that is typical nowadays).&lt;/li&gt;
&lt;li&gt;Removing dropout&lt;/li&gt;
&lt;li&gt;Updating the learning rate from 0.0004 to 0.0014, but also scheduling it so that it varies
over the course of the training run.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;How would they stack up?  It seemed pretty unlikely that their independent contributions
would just sum up neatly so that we got a total improvement
of 0.013209 + 0.022141 + 0.048586 + 0.050244 + 0.089609 = 0.223789 (though that would
certainly be nice!).&lt;/p&gt;

&lt;p&gt;One question to consider was how independent they were.  For any set of interventions,
you can imagine them being independent and adding up nicely, or pulling in separate
directions so that the combined effect is worse than the sum, or pulling in the same
direction so that they amplify each other.&lt;/p&gt;

&lt;p&gt;My intuition was that gradient clipping and removing dropout were pretty independent,
at least conceptually.  They might affect other interventions indirectly (eg. via changing the
training run's use of the random number generator) but they'd be unlikely to have a direct
effect.  QKV bias I was less sure about, but it seemed -- again, just intuitively --
at least reasonably independent of the others, with one important exception (which I'll get into below).&lt;/p&gt;

&lt;p&gt;By contrast, weight decay and the learning rate interact together quite strongly,
at least in standard gradient descent, and I'd tested them in isolation.  The result for changing the weight
decay to 0.01 was based on a fixed learning rate of 0.0004, and the result for scheduling
the learning rate was based on a weight decay of 0.1.&lt;/p&gt;

&lt;p&gt;That felt like an issue, and definitely needed some thought.&lt;/p&gt;

&lt;p&gt;Additionally, there were some issues with which interventions might have not had a real
effect, and instead just been the results of the use of randomness.  While my analysis of how that might have affected
things was somewhat limited by the number of test runs I could
afford to do, it did show up two plausible issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Adding gradient clipping looked like it might have been within the training run noise.&lt;/li&gt;
&lt;li&gt;Adding QKV bias would have had a large effect on the model's initial weights.  All
of the others would have started with essentially the same weights (apart from
weight tying, though even that would have had the same values for the initial
weights apart from the tied ones).  But adding the bias would have completely changed
them, and its effect size was comfortably within the range of differences you might
expect from that.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After some thought, I came up with a plan.  If I were doing this properly and scientifically,
I suppose I'd try every combination of interventions, but that would be ruinously expensive &lt;sup class="footnote-ref" id="fnref-2"&gt;&lt;a href="#fn-2"&gt;2&lt;/a&gt;&lt;/sup&gt;, so a sensible
minimal set of training runs felt like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start a training run with all of the interventions apart from QKV bias.&lt;/li&gt;
&lt;li&gt;In parallel (Lambda instance availability permitting) run another one, with
all of the interventions &lt;em&gt;including&lt;/em&gt; QKV bias.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When those completed, I'd find the test set loss for both models.  I'd choose the best
run, and then do another run with those settings, but with weight decay switched
back to the original value of 0.1.  I chose to revert weight decay rather than the learning rate stuff because this was
the one I was least sure about -- the updated "GPT-2" value of 0.01 is very unusual
by today's standards, and I'd come to it via a rather circuitous route -- see
&lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;the post&lt;/a&gt; for more details.&lt;/p&gt;

&lt;p&gt;The best of the three runs would be the winning combination of interventions.&lt;/p&gt;

&lt;p&gt;Again, this was not an exhaustive plan &lt;sup class="footnote-ref" id="fnref-3"&gt;&lt;a href="#fn-3"&gt;3&lt;/a&gt;&lt;/sup&gt;.  But it seemed to make sense.  Let's see
how it turned out.&lt;/p&gt;

&lt;h3 id="training-run-1-without-qkv-bias"&gt;Training run 1: without QKV bias&lt;/h3&gt;

&lt;p&gt;Just to recap, this one had these interventions against the baseline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gradient clipping at 3.5&lt;/li&gt;
&lt;li&gt;Weight decay changed from 0.1 to 0.01&lt;/li&gt;
&lt;li&gt;Dropout removed&lt;/li&gt;
&lt;li&gt;Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run
then a cosine decay to 0.00014.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It did not have QKV bias.  You can see &lt;a href="https://github.com/gpjt/ddp-base-model-from-scratch/tree/main/runs/8xa100m40-stacked-interventions-1"&gt;the config here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's the loss chart over the course of the training run:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-1-loss-chart.png" alt="The loss chart without QKV bias" title="The loss chart without QKV bias" /&gt;&lt;/p&gt;

&lt;p&gt;As normal with learning rate scheduling, I also charted that to make sure it was
doing the right thing (you can see that it was):&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-1-learning-rate-chart.png" alt="The learning rate chart without QKV bias" title="The learning rate chart without QKV bias" /&gt;&lt;/p&gt;

&lt;p&gt;And I also tracked the gradient norms -- you can see that there was some clipping
happening near the start of the run:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-1-grad-norm-chart.png" alt="The gradient norm chart without QKV bias" title="The gradient norm chart without QKV bias" /&gt;&lt;/p&gt;

&lt;p&gt;At the end of the run, it reported this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 11,437.717 seconds
Tokens seen: 3,260,252,160
Throughput: 285,044 tokens/second
Final train loss: 3.557
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That's a slightly lower final train loss than normal, and it took 3h10m, which is
faster than usual, but about the same as the other train we did without dropout --
that makes sense, as the process of zeroing out random activations isn't free.&lt;/p&gt;

&lt;p&gt;I downloaded the model -- &lt;a href="https://huggingface.co/gpjt/8xa100m40-stacked-interventions-1"&gt;here it is&lt;/a&gt; -- and then ran the smoke test:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you back to writing. In addition, there is room in your notebook so this will be an opportunity that
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and got its loss on the test set:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.577761
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Not bad at all -- the best result we've had so far, albeit not quite
up to the standard of the original GPT-2 weights.&lt;/p&gt;

&lt;p&gt;Now the next one, with QKV bias.&lt;/p&gt;

&lt;h3 id="training-run-2-all-five-interventions-including-qkv-bias"&gt;Training run 2: all five interventions including QKV bias&lt;/h3&gt;

&lt;p&gt;This one had these interventions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gradient clipping at 3.5&lt;/li&gt;
&lt;li&gt;Weight decay changed from 0.1 to 0.01&lt;/li&gt;
&lt;li&gt;Dropout removed&lt;/li&gt;
&lt;li&gt;Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run
then a cosine decay to 0.00014.&lt;/li&gt;
&lt;li&gt;QKV bias switched on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can see &lt;a href="https://github.com/gpjt/ddp-base-model-from-scratch/tree/main/runs/8xa100m40-stacked-interventions-2"&gt;the config here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's the loss chart:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-2-loss-chart.png" alt="The loss chart with QKV bias" title="The loss chart with QKV bias" /&gt;&lt;/p&gt;

&lt;p&gt;...the learning rate:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-2-learning-rate-chart.png" alt="The learning rate chart with QKV bias" title="The learning rate chart with QKV bias" /&gt;&lt;/p&gt;

&lt;p&gt;...the gradient norms (note that we had more clipping, about halfway through):&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-2-grad-norm-chart.png" alt="The gradient norm chart with QKV bias" title="The gradient norm chart with QKV bias" /&gt;&lt;/p&gt;

&lt;p&gt;...and the final printout at the end.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 11,487.294 seconds
Tokens seen: 3,260,252,160
Throughput: 283,814 tokens/second
Final train loss: 3.584
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That final train loss is slightly higher, which is normally an indicator that the
test loss will be higher, but we'll have to see.&lt;/p&gt;

&lt;p&gt;Time to download the model --
&lt;a href="https://huggingface.co/gpjt/8xa100m40-stacked-interventions-2"&gt;here it is&lt;/a&gt; --
and on to the smoke test:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you to change the color in your image.
Your image’s color will need to match the
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...and then the moment of truth -- what was its loss on the test set?&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.604342
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As I suspected from the training loss at the end, slightly worse than the run without
QKV bias. So, that meant that we should do the next run, with a weight decay of 0.1, with
no QKV bias.&lt;/p&gt;

&lt;h3 id="training-run-3-just-dropout-removal-gradient-clipping-and-a-higher-but-scheduled-learning-rate"&gt;Training run 3: just dropout removal, gradient clipping, and a higher but scheduled learning rate&lt;/h3&gt;

&lt;p&gt;Given the above results, this one had these interventions vs the baseline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gradient clipping at 3.5&lt;/li&gt;
&lt;li&gt;Dropout removed&lt;/li&gt;
&lt;li&gt;Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run
then a cosine decay to 0.00014.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Weight decay was back to the baseline value of 0.1, rather than the value of 0.01
used in the previous two runs, and QKV bias was switched back off.&lt;/p&gt;

&lt;p&gt;You can see &lt;a href="https://github.com/gpjt/ddp-base-model-from-scratch/tree/main/runs/8xa100m40-stacked-interventions-3"&gt;the config here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's the loss chart:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-3-loss-chart.png" alt="The loss chart without QKV bias and with weight decay back to 0.1" title="The loss chart without QKV bias and with weight decay back to 0.1" /&gt;&lt;/p&gt;

&lt;p&gt;You can see that it's much choppier than the previous two runs; that initially surprised me,
as the higher weight decay means that we're regularising the model more than we were with
those, which I thought would "calm things down".  But on reflection, I had it backward.
Hand-waving a bit, a more regularised model is fitting less closely every detail to the data it has seen, considering
the typical stuff more than it does the outliers.
That means that when something a bit more out-of-distribution appears, it might not
have yet learned how to integrate it into its model of the world.&lt;/p&gt;

&lt;p&gt;Well, it sounds plausible, anyway :-)&lt;/p&gt;

&lt;p&gt;On to the learning rate (just to double-check), and it's fine:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-3-learning-rate-chart.png" alt="The learning rate chart without QKV bias and with weight decay back to 0.1" title="The learning rate chart without QKV bias and with weight decay back to 0.1" /&gt;&lt;/p&gt;

&lt;p&gt;And again, the gradient norms:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-3-grad-norm-chart.png" alt="The gradient norm chart without QKV bias and with weight decay back to 0.1" title="The gradient norm chart without QKV bias and with weight decay back to 0.1" /&gt;&lt;/p&gt;

&lt;p&gt;...which similarly to the loss chart show more occasions where gradients spiked and
had to be clipped -- even towards the end of the training run this time.&lt;/p&gt;

&lt;p&gt;The final printout at the end:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 11,422.638 seconds
Tokens seen: 3,260,252,160
Throughput: 285,420 tokens/second
Final train loss: 3.570
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Once again, although the final train loss is not definitive, it tends to be indicative
of the test loss.  It's in between the last two runs, so we'd expect the test loss to
be likewise in between theirs:&lt;/p&gt;

&lt;p&gt;Time to download the model --
&lt;a href="https://huggingface.co/gpjt/8xa100m40-stacked-interventions-3"&gt;here it is&lt;/a&gt; --
and on to the smoke test:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you to make, our service has helped millions of women around the world who were recently hit by car wreck
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Hmm.  At least vaguely coherent, though I'm not 100% convinced.  It looks like ads for
personal injury lawyers have crept into FineWeb somehow...&lt;/p&gt;

&lt;p&gt;Still, it's time for the test loss (drumroll):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.590266
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As predicted from the train loss, it's in between the two runs above.&lt;/p&gt;

&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;Let's put these three runs into the results table:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test set loss&lt;/th&gt;
  &lt;th&gt;Improvement vs baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32g-interventions-weight-tying"&gt;8xa100m40-weight-tying&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.874305&lt;/td&gt;
  &lt;td&gt;-0.182779&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;8xa100m40-weight-decay-cerebras&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.813856&lt;/td&gt;
  &lt;td&gt;-0.122330&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32a-interventions-baseline-model"&gt;8xa100m40-baseline&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.691526&lt;/td&gt;
  &lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32h-interventions-full-fat-float32"&gt;8xa100m80-no-amp&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.678968&lt;/td&gt;
  &lt;td&gt;0.012558&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32b-interventions-gradient-clipping"&gt;8xa100m40-gradient-clipping&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.678317&lt;/td&gt;
  &lt;td&gt;0.013209&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32d-interventions-adding-attention-bias"&gt;8xa100m40-qkv-bias&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.669385&lt;/td&gt;
  &lt;td&gt;0.022141&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;8xa100m40-weight-decay-gpt2&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.642940&lt;/td&gt;
  &lt;td&gt;0.048586&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32c-interventions-removing-dropout"&gt;8xa100m40-remove-dropout&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.641282&lt;/td&gt;
  &lt;td&gt;0.050244&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;8xa100m40-stacked-interventions-2&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.604342&lt;/td&gt;
  &lt;td&gt;0.087184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32e-interventions-learning-rate"&gt;8xa100m40-schedule-learning-rate&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.601917&lt;/td&gt;
  &lt;td&gt;0.089609&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;8xa100m40-stacked-interventions-3&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.590266&lt;/td&gt;
  &lt;td&gt;0.101260&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;8xa100m40-stacked-interventions-1&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.577761&lt;/td&gt;
  &lt;td&gt;0.113765&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As a reminder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt; was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01,
dropout removed, and the learning rate intervention, but &lt;em&gt;no&lt;/em&gt; QKV bias&lt;/li&gt;
&lt;li&gt;&lt;code&gt;8xa100m40-stacked-interventions-2&lt;/code&gt; was gradient clipping at 3.5, weight decay changed from 0.1 to 0.01,
dropout removed, and the learning rate intervention, &lt;em&gt;with&lt;/em&gt; QKV bias&lt;/li&gt;
&lt;li&gt;&lt;code&gt;8xa100m40-stacked-interventions-3&lt;/code&gt; was gradient clipping at 3.5,
dropout removed, and the learning rate intervention, but &lt;em&gt;no&lt;/em&gt; QKV bias, and &lt;em&gt;no change to weight decay&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can see that adding on QKV bias actually made the model &lt;em&gt;worse&lt;/em&gt; than the learning-rate-only
intervention.  That pushes me slightly away from the "it's all about the initial weights"
direction; perhaps instead the bias adds some kind of stability that the learning rate
scheduling also provides, and they fight against each other?  Unfortunately I think the
only way to pick it apart would be to do a full set of runs, switching each intervention
on and off independently, and that would be too costly.&lt;/p&gt;

&lt;p&gt;The fact that the weight decay change from 0.1 to 0.01 actually did help when combined
with the learning rate change and scheduling was a bit of a surprise; because they're both
coupled when we think about standard gradient descent, I was expecting them to be too intertwined for my tests
of them in isolation to have been valid.  Quite pleased that it didn't work out that way,
though, because sweeping across values for different parameters is much easier than
it would be if they were connected.&lt;/p&gt;

&lt;p&gt;However, at this point it occurs to me that it
might be because we're using the AdamW optimiser.  As I
understand it, its big difference versus Adam is that it decouples weight decay.  I don't
have a solid mental model of what that means exactly (will read up and post about
it eventually), but it certainly seems pertinent here.&lt;/p&gt;

&lt;p&gt;Anyway, I have to say, I'm both pleased with and disappointed by these results.
Pleased because we got a result by putting interventions together that was better
than any of them in isolation, but disappointed that the end result wasn't even
better.&lt;/p&gt;

&lt;p&gt;The difference between &lt;code&gt;8xa100m40-baseline&lt;/code&gt;'s loss, at 3.691526, and original GPT-2 small's,
at 3.5, was 0.191526.  Our best result, for &lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;,
was 3.577761, so an improvement of 0.113765.  That's about 60% of the way there.&lt;/p&gt;

&lt;p&gt;That said, by sheer chance, while trying out the different sizes of cloud machines,
I'd got from a loss of 3.944 training locally to the baseline's value of 3.691526 --
I suspect due to the fact that training in the cloud meant that I could use
batch sizes of 96.&lt;/p&gt;

&lt;p&gt;So a different way of looking at it is that we should include that in the calculations too.
From 3.944 to 3.5, the gap with GPT-2 small was 0.444.  And we went from 3.944 to 3.577761,
an improvement of 0.366239.  And that means that we managed to get 82% of the improvement
we needed.&lt;/p&gt;

&lt;p&gt;On the other hand, it means that in terms of my improvements, 0.252474 came from a happy
accident, while all of my careful work on interventions only got me 0.113765.  :-(&lt;/p&gt;

&lt;p&gt;Anyway, I think that for now, I'll have to rest happy with that as a result -- and next time around,
let's see if we can get to the same level of improvement locally, using gradient accumulation.&lt;/p&gt;

&lt;p&gt;&lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;Here's a link to the next post in this series&lt;/a&gt;.&lt;/p&gt;

&lt;div class="footnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn-1"&gt;
&lt;p&gt;Luckily the difference was small enough that it doesn't change any of the
conclusions I'd made about it.&amp;#160;&lt;a href="#fnref-1" class="footnoteBackLink" title="Jump back to footnote 1 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-2"&gt;
&lt;p&gt;Because there are five interventions, and each can be on or off, then it's
equivalent to a 5-digit binary number.  So that's &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;msup&gt;&lt;mn&gt;2&lt;/mn&gt;&lt;mn&gt;5&lt;/mn&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/math&gt; trains, less the five ones I'd already done and the baseline,
for a total of &lt;math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"&gt;&lt;mrow&gt;&lt;mn&gt;32&lt;/mn&gt;&lt;mo&gt;&amp;#x02212;&lt;/mo&gt;&lt;mn&gt;6&lt;/mn&gt;&lt;mo&gt;&amp;#x0003D;&lt;/mo&gt;&lt;mn&gt;26&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;.
At US$50-odd for a train, that's definitely a no-go.&amp;#160;&lt;a href="#fnref-2" class="footnoteBackLink" title="Jump back to footnote 2 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-3"&gt;
&lt;p&gt;I did also consider changing the random seed at the start of the code to 67
rather than 42, given that it seemed to provide better initial weights when I was
exploring the effects of random noise on the training.  I even started the first two
training runs with that in place.  However, on reflection I realised that it would be
one step too far away from scientific rigour.  I'm not trying to be 100% rigorous in
these posts, but it seemed like a step too far to diligently test all of the
interventions against one seed, and then YOLO
in a different one for the final training runs.&amp;#160;&lt;a href="#fnref-3" class="footnoteBackLink" title="Jump back to footnote 3 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description><guid isPermaLink="false">/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud</guid><pubDate>Thu, 09 Apr 2026 20:00:00 +0000</pubDate></item><item><title>Writing an LLM from scratch, part 32k -- Interventions: training a better model locally with gradient accumulation</title><link>https://www.gilesthomas.com/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation</link><description>&lt;p&gt;I've been working on a GPT-2-small-style LLM based on
&lt;a href="https://sebastianraschka.com/"&gt;Sebastian Raschka&lt;/a&gt;'s book
"&lt;a href="https://www.manning.com/books/build-a-large-language-model-from-scratch"&gt;Build a Large Language Model (from Scratch)&lt;/a&gt;".
I've trained various versions of it in the cloud to
&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;work out which interventions to the model and training code&lt;/a&gt;
had the best effects on the loss it gets on a specific test dataset, and now
I wanted to do a training run locally to match the best of those.
For that, I wanted to match the batch size I was using for the cloud training
runs.&lt;/p&gt;

&lt;p&gt;When I first started learning this stuff, batching seemed like a performance
thing -- with highly parallel systems like GPUs, it generally turned out that you could run
a batch of (say) two inputs through a model in less than twice the time you could run one, so it made
sense to batch them up.&lt;/p&gt;

&lt;p&gt;For inference, that is exactly the advantage you get, but when training, it's become increasingly
clear to me that you can also get an improvement in the quality of the model
from batching.  The best intuitive model I have is that if you run inputs
through one-by-one, adjusting parameters after each, then it's easy for the model to
"overcorrect" each time.  With batches, you get an average set of gradients across
all of the items -- which smooths things out and stabilises the training.&lt;/p&gt;

&lt;p&gt;Of course, it's possible to overdo it.   As an extreme example, imagine that you
were somehow able to fit your whole training set into one batch -- then you could
train by running that single batch through, doing a single backward pass, and then adjusting the parameters
once.  It's pretty clear that that would not work very well -- just one single update
of the initially-random parameters.&lt;/p&gt;

&lt;p&gt;When training on my local machine, I could fit a batch of six sequences into my RTX 3090.  I'd found that
when I moved to cloud machines, it had a very positive effect on the loss I got
out of the models when I tested them.  From &lt;a href="/2026/01/llm-from-scratch-29-ddp-training-a-base-model-in-the-cloud#the-results"&gt;a quick-and-dirty bit of curve-fitting&lt;/a&gt;,
I estimated that the optimal batch size for this model, with that training run,
was somewhere around 97.  Conveniently, that was close to the maximum I could fit
onto an 8x A100 40 GiB/GPU machine, so I used a batch size of 96 to test the different
interventions I was trying.&lt;/p&gt;

&lt;p&gt;And when I finally &lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;put all of the interventions that helped with training together&lt;/a&gt;,
I found (somewhat to my surprise) that their combined effect -- an improvement in
loss of 0.113765 -- was less than half of the loss improvement of 0.252474 that I had got from
increasing the batch size.&lt;/p&gt;

&lt;p&gt;What that all made clear was that if I wanted to do a local training run that matched
the quality of the cloud-trained model, I'd need to not only add on the interventions
that I'd been testing in detail, but I'd need to match the cloud batch size.  And for
that, I needed to learn about gradient accumulation.&lt;/p&gt;
&lt;h3 id="gradient-accumulation-basics"&gt;Gradient accumulation basics&lt;/h3&gt;

&lt;p&gt;Gradient accumulation is pretty much what it sounds like; instead of the normal technique
of doing a forward pass, working out the loss, getting gradients with a backward pass,
and then applying them by stepping the optimiser, you do multiple forward-backward phases,
letting the gradients accumulate, and then do one optimiser step after that.&lt;/p&gt;

&lt;p&gt;When you do that, you're getting the training stabilisation benefits of a larger batch
size, even though you're not getting the performance boost.  Sounds simple enough,
and it is, in theory, but implementation got a little more complicated.&lt;/p&gt;

&lt;p&gt;Let's work through it step-by-step.&lt;/p&gt;

&lt;p&gt;To start with, imagine you have a really simple training loop:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;batched_dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;

    &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;

    &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zero_grad&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Adding gradient accumulation to that is really simple!&lt;/p&gt;

&lt;p&gt;Let's assume that &lt;code&gt;batched_dataset&lt;/code&gt; has a length divisible by &lt;code&gt;gradient_accumulation_steps&lt;/code&gt;,
the number of steps we want to run through before we step the optimiser.  As a first (not quite correct)
cut, you could just do this:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batched_dataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;

    &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batched_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zero_grad&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;You can see that we're just stepping the optimiser every &lt;code&gt;gradient_accumulation_steps&lt;/code&gt; steps.
An alternative way to do it would be with an inner loop:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;dataset_ix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;dataset_ix&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batched_dataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;ii&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;dataset_ix&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batched_dataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;batched_dataset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;dataset_ix&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;dataset_ix&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;

    &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zero_grad&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Other stuff&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Which of those is better would depend on the details of the training loop -- in general,
if you wanted the "other stuff" to be done once per training batch, then you'd want to
use the first option, whereas if you wanted it to be done once per optimiser step,
the second would be easier.  As you'll see in a bit, I went for the second one for my
code.&lt;/p&gt;

&lt;p&gt;However, there's one small correction that we need to do to make either of these properly.  Remember
that when you calculate loss across a batch -- for example, cross entropy loss like this:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;functional&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cross_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flatten&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;flatten&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...you're getting the &lt;em&gt;average&lt;/em&gt; loss across the batch, so when you do the backward
pass, you're getting the average gradients.  By contrast, in the code above, we're
doing a backward pass on the complete loss at each step, so the gradients that are
being generated in each backward pass are being added to each other -- you wind up
with the sum of all of them rather than the average.  So the gradients that the optimizer
applied would be &lt;code&gt;gradient_accumulation_steps&lt;/code&gt; times larger than they should be --
it would be as if we'd multiplied the learning rate by that number!&lt;/p&gt;

&lt;p&gt;But that's easy enough to fix.  The average gradients over a number of steps are the sum divided by the
number of steps, and we can do that division ahead of time just by scaling the loss down.
Adding that into the first example above:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batched_dataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;calculate_loss&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loss&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batched_dataset&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;optimizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;zero_grad&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;And that's basically it; with those changes, the original basic training loop
becomes one that uses gradient accumulation.  The effective batch size is whatever
the real batch size is, times the number of gradient accumulation steps.&lt;/p&gt;

&lt;h3 id="gradient-accumulation-in-practice"&gt;Gradient accumulation in practice&lt;/h3&gt;

&lt;p&gt;However, the real training loop that I'm using for these experiments is a bit more
complicated than that simple example.  There's checkpointing, AMP, and -- most importantly
-- it can handle multi-GPU training using DistributedDataParallel.  That made things
a little bit more complicated.&lt;/p&gt;

&lt;p&gt;The first thing was to look into the way I was selecting the data to train on.  My dataset was already
in batches, but we had to split those batches up between GPUs.  The solution in the code
was to work out how many global steps there were -- each global step being one batch
going through each GPU on the machine -- like this:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;total_global_steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_ds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;world_size&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;&lt;code&gt;world_size&lt;/code&gt;, if you remember from &lt;a href="/2026/01/llm-from-scratch-29-ddp-training-a-base-model-in-the-cloud"&gt;the DDP post&lt;/a&gt;,
is the number of processes running in a multi-GPU training run -- one per GPU.&lt;/p&gt;

&lt;p&gt;Next, in the training loop, I iterated over the global steps:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;progress_bar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start_global_step&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_global_steps&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;disable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;global_step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;progress_bar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Get the data&lt;/span&gt;
    &lt;span class="c1"&gt;# Forward and backward pass&lt;/span&gt;
    &lt;span class="c1"&gt;# Step the optimiser.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...for each one, getting the appropriate batch out for the specific GPU that was
running the code:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_ds&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;global_step&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;world_size&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;&lt;code&gt;rank&lt;/code&gt; is a zero-indexed number, unique to each of the per-GPU processes.  So this
basically split &lt;code&gt;train_ds&lt;/code&gt; into chunks of length &lt;code&gt;world_size&lt;/code&gt;, and then each GPU
was fed the batch at its &lt;code&gt;rank&lt;/code&gt;'s offset into the chunk.&lt;/p&gt;

&lt;p&gt;I wanted to keep things shaped such that when I was running with gradient accumulation
locally, it would be similar to a cloud run with per-GPU batching.  Specifically: when
I was training in the cloud, I had eight GPUs with a per-GPU microbatch size of 12, giving
a total batch size of 96.  Locally, I could fit a batch size of six on my GPU, so I needed
to do gradient accumulation over 96 / 6 = 16 steps.&lt;/p&gt;

&lt;p&gt;To keep things as similar as possible, I decided that I wanted the concept of a
"global step" to match between the runs.  In other words, it would expand slightly,
from meaning "one batch per GPU" to being "one optimiser step per GPU".
So, each time through that &lt;code&gt;global_step&lt;/code&gt; loop, we'd do multiple forward-backward
passes, and then one optimiser step.  That would mean that the best way to do things
would be with something much more like the second of the two bits of sample code
above -- the one with the inner loop rather than the modulus.&lt;/p&gt;

&lt;p&gt;Maybe that's easier to show in code:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;total_global_steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_ds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;world_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;

&lt;span class="o"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;global_step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;progress_bar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="o"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;accumulation_step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Forward and backward passes&lt;/span&gt;
        &lt;span class="o"&gt;...&lt;/span&gt;

    &lt;span class="c1"&gt;# Step the optimiser&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;That required a change to the data lookup; I decided that &lt;code&gt;train_ds&lt;/code&gt; would be split
into chunks of size &lt;code&gt;gradient_accumulation_steps * world_size&lt;/code&gt;, and then each of those
would be split into chunks of size &lt;code&gt;world_size&lt;/code&gt;, so the code to get the appropriate
batch for a given run through the loop became this:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;targets&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;train_ds&lt;/span&gt;&lt;span class="p"&gt;[((&lt;/span&gt;&lt;span class="n"&gt;global_step&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;accumulation_step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;world_size&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;That required a corresponding change in &lt;code&gt;load_dataset&lt;/code&gt; to make sure that &lt;code&gt;train_ds&lt;/code&gt; was
divisible by both the world size, the per-GPU batch ("microbatch") size, and the number of gradient accumulation steps, but that was easy:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;one_full_batch_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;world_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;microbatch_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;seq_length&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...became this:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;one_full_batch_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;world_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;microbatch_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;seq_length&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;That was enough to get the gradient accumulation happening!  Next, I needed to change
the backward pass code to scale down the loss so that we got averaged rather than summed
gradients.  Because we might be using AMP with a scaler, the code wasn't just a simple
&lt;code&gt;loss.backward&lt;/code&gt;:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_loss&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;train_loss&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...but the change was obvious enough:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;All of those changes put together, plus a bit of shuffling around of code,
were enough to get a correct gradient accumulation
training loop!  But there was one small tweak I needed to add.&lt;/p&gt;

&lt;p&gt;When you're using DDP, gradients need to be synchronised between the different
per-GPU processes.  As a reminder, what happens is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each process does a forward pass.&lt;/li&gt;
&lt;li&gt;Each process does a backward pass.&lt;/li&gt;
&lt;li&gt;When they have the gradients, they essentially share them so that each process has
an average of the gradients from all of those backward passes.&lt;/li&gt;
&lt;li&gt;Then they all step their optimisers to apply the average gradients to each process's
copy of the model.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now, with my first cut of the gradient accumulation code above, what would have happened is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For each gradient accumulation step:
&lt;ul&gt;
&lt;li&gt;Each process does a forward pass.&lt;/li&gt;
&lt;li&gt;Each process does a backward pass.&lt;/li&gt;
&lt;li&gt;The average is worked out&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;They all step their optimisers based on the most recent average&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That would be correct, but not very efficient.  We're sending out gradients and
averaging on every accumulation step.  But because each of our per-GPU processes
is keeping its own "local" average (by accumulating the scaled-down gradients), we only
really need to send those local averages out and get a global average once, just before we step
the optimiser.  If we do that, we can save quite a lot of work.&lt;/p&gt;

&lt;p&gt;The trick to avoid that was to use the method &lt;code&gt;no_sync&lt;/code&gt; on the &lt;code&gt;DistributedDataParallel&lt;/code&gt; class
that our own model is wrapped in.  What we wanted to do was suppress the gradient
synchronisation for each of the accumulation steps apart from the last one.&lt;/p&gt;

&lt;p&gt;It was easy to work out whether we were on the last gradient accumulation step:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;is_last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;accumulation_step&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Now, what we needed to do was to wrap this:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...in &lt;code&gt;model.no_sync&lt;/code&gt;, but &lt;em&gt;only if&lt;/em&gt; &lt;code&gt;is_last&lt;/code&gt; was false.&lt;/p&gt;

&lt;p&gt;Conditional &lt;code&gt;with&lt;/code&gt; statements can be a little fiddly, but Python has a "do-nothing"
&lt;a href="https://docs.python.org/3/library/contextlib.html#contextlib.nullcontext"&gt;&lt;code&gt;nullcontext&lt;/code&gt; context manager in &lt;code&gt;contextlib&lt;/code&gt;&lt;/a&gt; -- that is,&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;nullcontext&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...is identical to just:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="n"&gt;do_something&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;So we can combine that with the ternary operator like this:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;no_sync&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;is_last&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;nullcontext&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;scaler&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;scaler&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_loss&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;gradient_accumulation_steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;...which does exactly what we want &lt;sup class="footnote-ref" id="fnref-1"&gt;&lt;a href="#fn-1"&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;With that change, I had something I was happy with; you can
&lt;a href="https://github.com/gpjt/ddp-base-model-from-scratch/commit/c5d860e29848287abd2c1f9af11a8b849d260aba"&gt;see the diff here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So now it was time to do a training run!&lt;/p&gt;

&lt;h3 id="the-new-local-baseline"&gt;The new local baseline&lt;/h3&gt;

&lt;p&gt;I'd originally been planning to jump right in and do a training run based on my
&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;last cloud run&lt;/a&gt;,
with all of the interventions I'd decided were worth using, but locally with
gradient accumulation.&lt;/p&gt;

&lt;p&gt;However, I decided that it would be interesting to try doing a new "baseline" train
first.  I'd done my local training runs, and then established a baseline version in
the cloud by taking exactly the same configuration and doing the training run on
an 8x A100 40 GiB with an overall batch size of 96.  So I could repeat that locally
with gradient accumulation, and that would show two things (or perhaps, the same
thing but in different lights):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whether the increased effective batch size had as positive an effect on the loss
as the increased real batch size did when I did my cloud runs.&lt;/li&gt;
&lt;li&gt;Whether the locally-trained gradient accumulation model was similar to the cloud-trained
big-batch model in terms of its loss.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That would help confirm my understanding that it was the
increased batch size that helped in the cloud, and not, say, some architectural difference -- and would also act as a good test
of the gradient accumulation code.&lt;/p&gt;

&lt;p&gt;Here's &lt;a href="https://github.com/gpjt/ddp-base-model-from-scratch/tree/main/runs/1xrtx3090-baseline"&gt;the training run config&lt;/a&gt;.
I kicked it off:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;giles@perry:~/Dev/ddp-base-model-from-scratch&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;main&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;uv&lt;span class="w"&gt; &lt;/span&gt;run&lt;span class="w"&gt; &lt;/span&gt;torchrun&lt;span class="w"&gt; &lt;/span&gt;--nproc_per_node&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ddp_train.py&lt;span class="w"&gt; &lt;/span&gt;1xrtx3090-baseline&lt;span class="w"&gt; &lt;/span&gt;datasets/
Fetching&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;files:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;%&lt;span class="p"&gt;|&lt;/span&gt;████████████████████████████████████████████████████████████████████████████████████████████████████████████████████&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;/4&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;00&lt;/span&gt;:00&amp;lt;&lt;span class="m"&gt;00&lt;/span&gt;:00,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;810&lt;/span&gt;.65it/s&lt;span class="o"&gt;]&lt;/span&gt;
Starting&lt;span class="w"&gt; &lt;/span&gt;rank&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;training&lt;span class="w"&gt; &lt;/span&gt;at&lt;span class="w"&gt; &lt;/span&gt;global&lt;span class="w"&gt; &lt;/span&gt;step&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;%&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt;                                                                                                                  &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;/33165&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;00&lt;/span&gt;:04&amp;lt;?,&lt;span class="w"&gt; &lt;/span&gt;?it/s,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;.991,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;tps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;,451&lt;span class="o"&gt;]&lt;/span&gt;


Checkpoint

Continuing&lt;span class="w"&gt; &lt;/span&gt;training
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;%&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt;                                                                                                       &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;23&lt;/span&gt;/33165&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;01&lt;/span&gt;:50&amp;lt;&lt;span class="m"&gt;43&lt;/span&gt;:29:50,&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;.72s/it,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;7&lt;/span&gt;.625,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;tps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;20&lt;/span&gt;,549&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;That looked like the right number of global steps; it matched the numbers I saw when
training in the cloud.  And 44 hours for the training run seemed correct: my original
local runs took 48, but with them I was spending quite a lot of time on validation,
which this code didn't do.&lt;/p&gt;

&lt;p&gt;Just less than two days later:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 155,402.289 seconds
Tokens seen: 3,260,252,160
Throughput: 20,979 tokens/second
Final train loss: 3.738
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That all looked good.  The loss chart looked like this:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation/baseline-loss-chart.png" alt="The loss chart for the local baseline train" title="The loss chart for the local baseline train" /&gt;&lt;/p&gt;

&lt;p&gt;For comparison, here's the one from the cloud training run with the same config
(but using larger batches rather than gradient accumulation):&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32a-interventions-baseline-model/baseline-training-run-chart.png" alt="Baseline training run on an 8x A100 with 40 GiB/GPU" title="Baseline training run on an 8x A100 with 40 GiB/GPU" /&gt;&lt;/p&gt;

&lt;p&gt;You can see that they're similar, but not identical.  That's pretty much what you'd expect!
The two training runs were on different architectures -- RTX 3090 vs A100 -- and so there
will probably be differences in the CUDA kernels, and also PyTorch's AMP (which uses
16-bit instead of 32-bit in cases where it makes sense) might make different decisions.
I think that if we'd run it on a machine with one A100, then the results of using gradient accumulation
would be even closer (perhaps even identical) to a larger batch size, especially if we
were training without AMP.&lt;/p&gt;

&lt;p&gt;I &lt;a href="https://huggingface.co/gpjt/1xrtx3090-baseline"&gt;uploaded the model to Hugging Face&lt;/a&gt; and it
was time for the evals.  The smoke test first:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you to a more realistic approach and is always welcome, especially for a younger child!
I’
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As usual, reasonably coherent.  But the important one was the loss on the test set:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.683835
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That's solid!  The cloud-trained baseline model got 3.691526, so this local one was
actually very slightly better, by 0.007691.  But that's very close indeed, which is
what we wanted to see :-)&lt;/p&gt;

&lt;p&gt;It was time to see what effect adding on the interventions would have.&lt;/p&gt;

&lt;h3 id="the-local-run-with-the-interventions"&gt;The local run with the interventions&lt;/h3&gt;

&lt;p&gt;As a reminder, here are the changes I made to the config for this run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gradient clipping at 3.5&lt;/li&gt;
&lt;li&gt;Learning rate changed from 0.0004 to 0.0014, with a warmup over 5% of the run then a cosine decay to 0.00014.&lt;/li&gt;
&lt;li&gt;Weight decay changed from 0.1 to 0.01&lt;/li&gt;
&lt;li&gt;Dropout removed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It did not include QKV bias.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/gpjt/ddp-base-model-from-scratch/tree/main/runs/1xrtx3090-stacked-interventions"&gt;Here's the config&lt;/a&gt;.
I kicked it off, and:&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;giles@perry:~/Dev/ddp-base-model-from-scratch&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;main&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;uv&lt;span class="w"&gt; &lt;/span&gt;run&lt;span class="w"&gt; &lt;/span&gt;torchrun&lt;span class="w"&gt; &lt;/span&gt;--nproc_per_node&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ddp_train.py&lt;span class="w"&gt; &lt;/span&gt;1xrtx3090-stacked-interventions&lt;span class="w"&gt; &lt;/span&gt;datasets
Fetching&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;files:&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;%&lt;span class="p"&gt;|&lt;/span&gt;████████████████████████████████████████████████████████████████████████████████████████████████████████████████████&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;/4&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;00&lt;/span&gt;:00&amp;lt;&lt;span class="m"&gt;00&lt;/span&gt;:00,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;749&lt;/span&gt;.32it/s&lt;span class="o"&gt;]&lt;/span&gt;
Starting&lt;span class="w"&gt; &lt;/span&gt;rank&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;training&lt;span class="w"&gt; &lt;/span&gt;at&lt;span class="w"&gt; &lt;/span&gt;global&lt;span class="w"&gt; &lt;/span&gt;step&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;%&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt;                                                                                                                  &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;/33165&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;00&lt;/span&gt;:04&amp;lt;?,&lt;span class="w"&gt; &lt;/span&gt;?it/s,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;.994,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;tps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;21&lt;/span&gt;,589&lt;span class="o"&gt;]&lt;/span&gt;


Checkpoint

Continuing&lt;span class="w"&gt; &lt;/span&gt;training
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;%&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt;                                                                                                      &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;19&lt;/span&gt;/33165&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="m"&gt;01&lt;/span&gt;:25&amp;lt;&lt;span class="m"&gt;40&lt;/span&gt;:52:06,&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="m"&gt;4&lt;/span&gt;.44s/it,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;loss&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;.318,&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;tps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;21&lt;/span&gt;,867&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;It looked like it was going to take 40 hours; that matched what happened in the cloud
runs, as removing dropout speeds things up quite a lot.&lt;/p&gt;

&lt;p&gt;Just less than two days later:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Training complete in 146,299.816 seconds
Tokens seen: 3,260,252,160
Throughput: 22,285 tokens/second
Final train loss: 3.519
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The loss chart over the training run looked like this:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation/interventions-loss-chart.png" alt="The loss chart for the local interventions train" title="The loss chart for the local interventions train" /&gt;&lt;/p&gt;

&lt;p&gt;That's very smooth, with no loss spikes.  For comparison, here's the chart when we did the same training run in the cloud;
you can see that it was a bit choppier than the local one.&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-1-loss-chart.png" alt="The loss chart for the cloud interventions run" title="The loss chart for the cloud interventions run" /&gt;&lt;/p&gt;

&lt;p&gt;The gradient norm chart was also interesting:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation/interventions-grad-norm-chart.png" alt="The gradient norm chart for the local interventions train" title="The gradient norm chart for the local interventions train" /&gt;&lt;/p&gt;

&lt;p&gt;If you compare it to the one from the cloud training run below, you can see that
the local one was actually noisier -- the cloud run has a few gradient spikes near
the start but calms down from around global step 6,000 or so, whereas the local one is
spiky up to about 3,000, then calm, but has a massive spike at around 10,000.&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud/stacked-interventions-1-grad-norm-chart.png" alt="The gradient norm chart for the cloud interventions run" title="The gradient norm chart for the cloud interventions run" /&gt;&lt;/p&gt;

&lt;p&gt;The learning rate we don't need to compare, but it was worth sanity checking to make sure
we really did train the right way:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation/interventions-learning-rate-chart.png" alt="The learning rate chart for the local interventions train" title="The learning rate chart for the local interventions train" /&gt;&lt;/p&gt;

&lt;p&gt;So that all looked good.  The training run did have some differences to the cloud one,
but (as with the previous baseline train) it looked similar enough.  Architectural
differences between the A100s in the cloud and the local RTX 3090 seemed like a plausible
cause.&lt;/p&gt;

&lt;h3 id="evals"&gt;Evals&lt;/h3&gt;

&lt;p&gt;I &lt;a href="https://huggingface.co/gpjt/1xrtx3090-stacked-interventions"&gt;uploaded the model to Hugging Face&lt;/a&gt;,
and it was time to run the evals.  The smoke test first:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you to give your customers the opportunity you give them in the event of a failure.&amp;lt;|endoftext|&amp;gt;A couple friends
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Reasonably coherent -- and I think that's the first time I've seen an &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt; token
in a smoke test output!  But the important one is, as ever, the loss, and:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Loss against our test dataset: 3.538161
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Let's add both this one and the local baseline to the results table for all interventions:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test set loss&lt;/th&gt;
  &lt;th&gt;Improvement vs baseline&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32g-interventions-weight-tying"&gt;8xa100m40-weight-tying&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.874305&lt;/td&gt;
  &lt;td&gt;-0.182779&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;8xa100m40-weight-decay-cerebras&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.813856&lt;/td&gt;
  &lt;td&gt;-0.122330&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32a-interventions-baseline-model"&gt;8xa100m40-baseline&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.691526&lt;/td&gt;
  &lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;1xrtx3090-baseline&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.683835&lt;/td&gt;
  &lt;td&gt;0.007691&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32h-interventions-full-fat-float32"&gt;8xa100m80-no-amp&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.678968&lt;/td&gt;
  &lt;td&gt;0.012558&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32b-interventions-gradient-clipping"&gt;8xa100m40-gradient-clipping&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.678317&lt;/td&gt;
  &lt;td&gt;0.013209&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32d-interventions-adding-attention-bias"&gt;8xa100m40-qkv-bias&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.669385&lt;/td&gt;
  &lt;td&gt;0.022141&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;8xa100m40-weight-decay-gpt2&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.642940&lt;/td&gt;
  &lt;td&gt;0.048586&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/02/llm-from-scratch-32c-interventions-removing-dropout"&gt;8xa100m40-remove-dropout&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.641282&lt;/td&gt;
  &lt;td&gt;0.050244&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;8xa100m40-stacked-interventions-2&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.604342&lt;/td&gt;
  &lt;td&gt;0.087184&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/03/llm-from-scratch-32e-interventions-learning-rate"&gt;8xa100m40-schedule-learning-rate&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.601917&lt;/td&gt;
  &lt;td&gt;0.089609&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;8xa100m40-stacked-interventions-3&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.590266&lt;/td&gt;
  &lt;td&gt;0.101260&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;8xa100m40-stacked-interventions-1&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.577761&lt;/td&gt;
  &lt;td&gt;0.113765&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;1xrtx3090-stacked-interventions&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;3.538161&lt;/td&gt;
  &lt;td&gt;0.153365&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;That's really weird!  The local run with the interventions, &lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;,
is 0.039600 points better than the cloud version of the
same training run, &lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;.  That's nice, in that lower loss is always better, but it's also
rather confusing -- that's a bigger loss improvement than some of the interventions.&lt;/p&gt;

&lt;p&gt;In theory, all that we changed between the cloud version of this training run,
and the local one
was the architecture.  I was expecting that to have an effect, but thought that
it would be small -- as, indeed, it was with the baseline trains &lt;code&gt;8xa100m40-baseline&lt;/code&gt;
and &lt;code&gt;1xrtx3090-baseline&lt;/code&gt;, where you can see the loss difference was just 0.007691 --
about five times smaller.&lt;/p&gt;

&lt;p&gt;Now, when I was looking into
&lt;a href="/2026/04/llm-from-scratch-32i-interventions-what-is-in-the-noise"&gt;the effects of noise on training loss&lt;/a&gt;,
I found that changing the random seed that was used to initialise the weights (but starting
the training run itself at the same random seed) had
a much bigger effect on the resulting model quality than keeping the weights identical
but varying the seed at the start of the post-initialisation phase of the training run.
The standard deviation of the varied-weights, same-train models was about double the SD of the
same-weights, varied-train.&lt;/p&gt;

&lt;p&gt;That was interesting, though not directly comparable -- those tests were done with
the same training run, but the architecture held constant -- a 8x A100 40 GiB machine
for each test.&lt;/p&gt;

&lt;p&gt;However, it felt like it would be a good idea to at least see whether we started with
the same weights locally and when training in the cloud.  My suspicion was that we probably
would; the weight initialisation uses deterministic non-GPU code, so with the same seed
we'd expect the same weights regardless of the computer.  The similarity of the loss
results for the local and cloud baseline training runs also seemed to point in that
direction.&lt;/p&gt;

&lt;p&gt;But it was worth testing.  I created a throwaway branch of the training code, which
-- after creating the model --
just dumped the model weights to a file, then exited.
I ran it locally using
the &lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt; config, and
then I fired up yet another 8x A100 40 GiB machine on Lambda, ran the same code there,
this time with the &lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt; config, and
then &lt;code&gt;scp&lt;/code&gt;ed down the weights.&lt;/p&gt;

&lt;div class="codehilite"&gt;
&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;giles@perry:~/Dev/ddp-base-model-from-scratch&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;dump-weights&lt;span class="o"&gt;)&lt;/span&gt;$&lt;span class="w"&gt; &lt;/span&gt;diff&lt;span class="w"&gt; &lt;/span&gt;cloud-init-weights.safetensors&lt;span class="w"&gt; &lt;/span&gt;local-init-weights.safetensors
giles@perry:~/Dev/ddp-base-model-from-scratch&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;dump-weights&lt;span class="o"&gt;)&lt;/span&gt;$
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;

&lt;p&gt;Identical.  That was reassuring!&lt;/p&gt;

&lt;p&gt;I considered doing more analysis on this; for example, in my investigations into noise,
I found that keeping the same weights but altering the random seed for the rest of the
training run, I got results with a standard deviation of 0.008672 -- more than four times
smaller than the difference between the local and cloud trains with the interventions.
Might that be a number I could use for some kind of comparison?&lt;/p&gt;

&lt;p&gt;However, I decided that it's not really comparable.  That number was from varying the random seed, but
keeping the same architecture.  There's not really any solid reason to believe that
keeping the seed constant but changing the architecture would cause the same kind of differences.
They might be more similar, they might be less.&lt;/p&gt;

&lt;p&gt;I think that all we can really say here is that the change of machine changed some
aspects of the training dynamics in a way that happened to get us a lower loss.  I
can easily imagine that if I'd done something slightly different -- used a local
RTX 4090, for example -- it could equally well have gone in the other direction.&lt;/p&gt;

&lt;p&gt;And at least it's reassuring that the improvement was smaller than the
interventions I was most convinced by; the only smaller ones were full-fat float32, gradient clipping, and
QKV bias -- ones that I'd already decided might have only been beneficial due to noise.
Most importantly, it was orders of magnitude smaller than the 0.252474 improvement
I originally saw when I moved from local training to larger-batch cloud training.&lt;/p&gt;

&lt;h3 id="conclusion"&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;So, I think that that brings me to the end of this set of training experiments.
We started with a locally-trained model that got a loss of 3.943522 on our test set,
compared to the original GPT-2 small model, which got 3.499677 &lt;sup class="footnote-ref" id="fnref-2"&gt;&lt;a href="#fn-2"&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;I've tried a bunch of interventions to try to get my model closer, and finally I've
managed to get almost all of the way there, to 3.538161.  That's really pleasing!&lt;/p&gt;

&lt;p&gt;I think that there are two things to do before I can fully wrap up this "interventions"
mini-series, and get back to the main-line LLM from scratch stuff.&lt;/p&gt;

&lt;p&gt;Firstly, I should
revisit the instruction fine-tuning tests, which I put on hold while doing these
training runs.  That would give us some indication as to whether the loss improvement was just a technical improvement
that made a number go down, or whether it actually improved the usefulness of the model.&lt;/p&gt;

&lt;p&gt;Secondly, I think I really need to write a wrap-up.  I've been working on this stuff
on and off since December, and I think a summary of what I did would be quite nice!&lt;/p&gt;

&lt;p&gt;I'll post soon; don't touch that dial :-)&lt;/p&gt;

&lt;p&gt;&lt;a href="/2026/04/llm-from-scratch-32l-interventions-instruction-fine-tuning-tests"&gt;Here's a link to the next post in this series&lt;/a&gt;.&lt;/p&gt;

&lt;div class="footnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn-1"&gt;
&lt;p&gt;Thanks to &lt;a href="https://stackoverflow.com/a/34798330"&gt;this Stack Overflow answer&lt;/a&gt; for that trick.&amp;#160;&lt;a href="#fnref-1" class="footnoteBackLink" title="Jump back to footnote 1 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-2"&gt;
&lt;p&gt;I'm going to switch to six decimal places from now on -- previously I was
rounding it to three, hence 3.500.&amp;#160;&lt;a href="#fnref-2" class="footnoteBackLink" title="Jump back to footnote 2 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description><guid isPermaLink="false">/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation</guid><pubDate>Wed, 15 Apr 2026 20:00:00 +0000</pubDate></item><item><title>How an LLM becomes more coherent as we train it</title><link>https://www.gilesthomas.com/2026/04/how-an-llm-becomes-more-coherent-over-training</link><description>&lt;p&gt;I remember finding it interesting when, back
in 2015, Andrej Karpathy posted about RNNs and
&lt;a href="https://karpathy.github.io/2015/05/21/rnn-effectiveness/#the-evolution-of-samples-while-training"&gt;gave an example of how their output improves over the course of a training run&lt;/a&gt;.
What might that look like for a (relatively) modern transformers-based LLM?&lt;/p&gt;

&lt;p&gt;I recently trained a GPT-2-small-style LLM, with 163 million parameters, on about
3.2 billion tokens (that's about 12.8 GiB of text) from the Hugging Face
&lt;a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"&gt;FineWeb&lt;/a&gt; dataset, and
over the course of that training run, I saved the current model periodically
-- 57 checkpoints over two days.&lt;/p&gt;

&lt;p&gt;Here's what it looked like -- the start, the end, and some interesting waypoints
in between.&lt;/p&gt;
&lt;p&gt;For each checkpoint, I asked it to generate a completion to the words "Every effort
moves you". &lt;sup class="footnote-ref" id="fnref-1"&gt;&lt;a href="#fn-1"&gt;1&lt;/a&gt;&lt;/sup&gt;  When the model was first created, before any training had been done, it
came up with this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves youhhhh esoteric Suns 1896ricia enormous initially
speculative arenaelse anth Zimmerman Insight Sketch demonstr despicable
capitalists clamp flung condemnation
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If you've read the Karpathy essay, you'll see one important difference -- it's already got words in there.
His RNNs were generating complete noise at this stage.
Even by the 100th iteration, he gives an example like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;tyntd-iafhatawiaoihrdemot  lytdws  e ,tfti, astai f ogoh eoase rrranbyne
'nhthnee e plia tklrgd t o idoe ns,smtt   h ne etie h,hregtrs nigtike,aoaenns lng
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That's an important difference between the RNNs he was talking about, which were
character-based and had to learn about words and the like, and LLMs like this one,
where the text is input and then output one token at a time.
(&lt;a href="/2025/08/what-ai-chatbots-are-doing-under-the-hood"&gt;More info here&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Still, even though it has what looks like words, it's essentially content-free token
salad with no structure or coherence &lt;sup class="footnote-ref" id="fnref-2"&gt;&lt;a href="#fn-2"&gt;2&lt;/a&gt;&lt;/sup&gt;.  Let's see what happens if we train it more.&lt;/p&gt;

&lt;p&gt;In my training loop, it sees 96 sequences of 1,024 tokens, and then we update it
based on its loss (an index of how wrong it was at predicting next tokens), so that's 98,304 tokens for each step.
After 617 of these &lt;sup class="footnote-ref" id="fnref-3"&gt;&lt;a href="#fn-3"&gt;3&lt;/a&gt;&lt;/sup&gt; it seems to have mostly learned something about which
tokens are most common:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you and to was, in the, a, The
 your of- and
| to the The
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;By the next checkpoint at step 1234, we've got
something that's starting to come together.
It doesn't make sense, but there's some kind of glimmering of meaning:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you’ll take the rest of the mainstay in all of his team. This
year with a
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And just a little while later, at the checkpoint at step 2468, we have something
that actually makes some kind of sense (at least at the start)!&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you to a different country. For all the most part, a world
map can only see the world map
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now, the training data I'm using was scraped from the Internet, and unsurprisingly
there's a lot of somewhat cheesy business content there.  By step 9255, we're starting
to get a lot of stuff like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you forward and it is important to make sure that your
clients are satisfied. A number of people have
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...or even more cheesy self-help stuff (step 10489):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you to be the best that you will ever have. To be your best,
you should be able to
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To be fair, the starting point of "Every effort moves you" is probably biasing things
a bit there.&lt;/p&gt;

&lt;p&gt;But let's be clear: by this point it's seen 1,031,110,656 tokens -- that is, it's about
one third trained.  And it's coming up with pretty coherent text!  The rest of the training
run is more about refining things -- the loss chart for this training run looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation/interventions-loss-chart.png" alt="The loss chart for the local interventions train" title="The loss chart for the local interventions train" /&gt;&lt;/p&gt;

&lt;p&gt;Loosely speaking, the lower the loss number, the better the model is, so you can see that the bulk of the improvement had happened by this point.
From here on, I'll just give a few of the more interesting samples:&lt;/p&gt;

&lt;p&gt;By step 14191, it's started using bullet points...&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you towards your goals.
- Develop meaningful habits or habits that promote your business
- Keep personal and
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Step 24680 -- more motivational stuff:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you forward and keeps you motivated. You make sure you don’t
leave it alone.
A
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Step 25297 -- small models like this do like repeating themselves.  You might remember
seeing ChatGPT output back in 2023 or so that had tics like this:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you from a simple position to a complex issue of complexity
and complexity.
As soon as the book takes
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And again at step 26531&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you, the company, the company, the community and all those
involved. I will be pleased to say
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;At step 27765 it decides that it has had enough after generating just a couple of
words and tries to start a new document:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you to the next level.&amp;lt;|endoftext|&amp;gt;Hip Hop: The New York
Times, April 23, 2017
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;But step 28382 is actually rather good.  I particularly like the "however":&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you, however, towards a better future, and that’s what counts
as a win.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And finally, the training run finishes at step 33164 with these wise words of caution:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Every effort moves you, and you’re rewarded, but not to your potential. You’ve
got to
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Well worth remembering, I'm sure we can all agree.  I wonder what deep wisdom
we'd have gained if I had asked it to generate more than 20 new tokens...&lt;/p&gt;

&lt;p&gt;What I found most surprising when I first
started playing with this is how &lt;em&gt;fast&lt;/em&gt; even simple LLMs got to a stage where they could generate
plausible text.  Just one third of the way through the training run, this model
was making some kind of sense.&lt;/p&gt;

&lt;p&gt;The problem, of course, is that we don't just want generators of plausible content --
we want that content to make sense and be correct.  And that's why it's worth
grinding through the other two thirds -- in the hope that when you ask it to complete
"The capital of France is", it will reply with "Paris" rather than a coherent but
wrong answer like "Rouen".&lt;/p&gt;

&lt;div class="footnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn-1"&gt;
&lt;p&gt;Technical details: 20 GPT-2 tokens generated on top of the initial text, with
a temperature of 1.  I've added line breaks to make it easier to read the samples.&amp;#160;&lt;a href="#fnref-1" class="footnoteBackLink" title="Jump back to footnote 1 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-2"&gt;
&lt;p&gt;Well, it mentions " despicable capitalists", but I suspect that's just randomness
rather than some kind of primitive political consciousness.  Including the
space at the start, that's tokens 47034 and 32663 in the GPT-2 tokeniser.&amp;#160;&lt;a href="#fnref-2" class="footnoteBackLink" title="Jump back to footnote 2 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-3"&gt;
&lt;p&gt;So, 60,653,568 tokens seen.&amp;#160;&lt;a href="#fnref-3" class="footnoteBackLink" title="Jump back to footnote 3 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description><guid isPermaLink="false">/2026/04/how-an-llm-becomes-more-coherent-over-training</guid><pubDate>Fri, 17 Apr 2026 23:30:00 +0000</pubDate></item><item><title>Writing an LLM from scratch, part 32l -- Interventions: updated instruction fine-tuning results</title><link>https://www.gilesthomas.com/2026/04/llm-from-scratch-32l-interventions-instruction-fine-tuning-tests</link><description>&lt;p&gt;I've been working on a GPT-2-small-style LLM based on
&lt;a href="https://sebastianraschka.com/"&gt;Sebastian Raschka&lt;/a&gt;'s book
"&lt;a href="https://www.manning.com/books/build-a-large-language-model-from-scratch"&gt;Build a Large Language Model (from Scratch)&lt;/a&gt;",
and have tried a bunch of different things to see if I could get it to approach
the quality of the original OpenAI GPT-2-small, measured in terms of loss on a
held-back test dataset.  After working through them, in my
&lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;last post&lt;/a&gt;,
I managed to train one that was almost (if not quite) there.&lt;/p&gt;

&lt;p&gt;Now, back before I started digging into these interventions, I was doing three evals
for each model I built; a smoke test (to see if it could give a coherent completion
to "Every effort moves you"), a test for that test set loss, and
an instruction-following test that fine-tuned the model on the &lt;a href="https://crfm.stanford.edu/2023/03/13/alpaca.html"&gt;Alpaca&lt;/a&gt;
dataset, got it to generate results for a test set of instructions, and then used
an LLM as a judge to score them.&lt;/p&gt;

&lt;p&gt;The idea behind this was that the loss on the test set was an interesting technical measure
of the quality of a model, but it didn't really tell us much about how useful it might
be in reality.&lt;/p&gt;

&lt;p&gt;Unfortunately, in January, &lt;a href="/2026/01/llm-from-scratch-30-digging-into-llm-as-a-judge"&gt;I realised that my methodology was bad&lt;/a&gt;;
because I was asking the LLM
to score a model in isolation, the LLM's natural randomness would mean that results were
not really comparable, at least for models that were reasonably close in quality.&lt;/p&gt;

&lt;p&gt;For example, if two models both replied to&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Name the author of 'Pride and Prejudice'.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;with:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;The author of 'Pride and Prejudice' is Sarah Palin.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;...then one run of the instruction-following test might "find the judge LLM in a good mood"
and get, say, 5% -- after all, the model &lt;em&gt;tried&lt;/em&gt; to answer, and actually used a real person's
name, even if the answer was totally wrong.  But in another run, the judge might be in a
"worse mood" and score it at 0%.&lt;/p&gt;

&lt;p&gt;My fix was to have two scripts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One that fine-tuned the model then got it to generate responses, then saved those responses in a file.&lt;/li&gt;
&lt;li&gt;One that took a bunch of files generated by the above, one for each of a set of different models, and presented them to the
LLM together, so that it would (hopefully) be consistent in how it rated them relative to each other.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The details are &lt;a href="/2026/01/llm-from-scratch-30-digging-into-llm-as-a-judge"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Because doing it that way was significantly more work, I've not been doing these tests as part of the interventions mini-series.
I felt it would make more sense to wait until I'd tried a bunch of interventions
and got a number of models to try.&lt;/p&gt;

&lt;p&gt;Now I have those, so let's give it a go!&lt;/p&gt;
&lt;h3 id="the-background-and-the-last-test"&gt;The background, and the last test&lt;/h3&gt;

&lt;p&gt;At the end of the previous round of IFT tests, I had this table.  It's sorted
by the loss on the test set (shown to 3 decimal places), and has the score that
the model got from an instruction fine-tuning run:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test loss&lt;/th&gt;
  &lt;th&gt;IFT score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: medium&lt;/td&gt;
  &lt;td&gt;3.231&lt;/td&gt;
  &lt;td&gt;39.64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: small&lt;/td&gt;
  &lt;td&gt;3.500&lt;/td&gt;
  &lt;td&gt;16.66&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 40 GiB&lt;/td&gt;
  &lt;td&gt;3.674&lt;/td&gt;
  &lt;td&gt;16.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x H100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.725&lt;/td&gt;
  &lt;td&gt;11.59&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.730&lt;/td&gt;
  &lt;td&gt;11.23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x B200 160 GiB&lt;/td&gt;
  &lt;td&gt;3.771&lt;/td&gt;
  &lt;td&gt;11.59&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb train&lt;/td&gt;
  &lt;td&gt;3.944&lt;/td&gt;
  &lt;td&gt;11.32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu extended train&lt;/td&gt;
  &lt;td&gt;4.135&lt;/td&gt;
  &lt;td&gt;16.41&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu train&lt;/td&gt;
  &lt;td&gt;4.167&lt;/td&gt;
  &lt;td&gt;15.77&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;There's a loose correlation where lower loss means a higher IFT score, with two
weird exceptions: the two FineWeb-Edu training runs, where they got much higher results
than you'd expect from the loss.&lt;/p&gt;

&lt;p&gt;My working hypothesis was that there were two components that led to a model getting a good
score:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Its raw intelligence: lower-loss models were smarter, so they were better at
instruction-following after the fine-tune.&lt;/li&gt;
&lt;li&gt;Its knowledge.  All of the models -- mine and OpenAI's -- apart from the FineWeb-Edu ones were trained on
what amounted to minimally-curated data from the Internet.  But FineWeb-Edu is
meant to be "the most educational" subset of FineWeb, so it presumably is more
dense in useful facts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So in those terms, the OpenAI models and Cloud FineWeb, 8x A100 40 GiB might be smart but not know very much, and the FineWeb-Edu ones
might be dumb but knowledgeable.  The ones in between, by contrast, could be relatively dumb too, but
also not know very much.&lt;/p&gt;

&lt;p&gt;There was one other oddity: the Cloud FineWeb, 8x A100 40 GiB
model seemed surprisingly good on the IFT results when considering its loss -- but perhaps
there was some kind of step function, where as soon as a model got better than (say) 3.7
on the loss, it suddenly became smart in whatever way mattered.&lt;/p&gt;

&lt;p&gt;All very hand-wavy, of course, but it was a hypothesis of sorts.
Would the new models fit that pattern?  It was time to find out.&lt;/p&gt;

&lt;h3 id="the-initial-run-and-the-mystery"&gt;The initial run, and the mystery&lt;/h3&gt;

&lt;p&gt;I didn't think it was worth adding all 14 models that I've trained in my intervention-testing
to that table, so I decided to just add four of them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;8xa100m40-baseline&lt;/code&gt;, the &lt;a href="https://www.gilesthomas.com/2026/02/llm-from-scratch-32a-interventions-baseline-model"&gt;baseline cloud-trained model for all of the interventions&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;1xrtx3090-baseline&lt;/code&gt;, the locally-trained version of the same -- the first model from &lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;this post&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;, &lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;the best model we managed to get in the cloud&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;, the best local model -- the second from &lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;this post&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, I already had files containing responses from fine-tuned versions of the
other models, so I just needed to run the first of my two fine-tuning scripts against
all four of the new models.&lt;/p&gt;

&lt;p&gt;I did that, and then also tweaked the judge script so that instead of using GPT-5.1, it
used GPT-5.4.  If you run the script multiple times, each time will normally give you different
scores anyway; hopefully the ranking will remain roughly the same.  So given that I
was going to have to re-run the script to get new aggregate results, and those would
not really be comparable to the original ones anyway, this seemed like a reasonable
price to pay for (hopefully) a smarter judge.&lt;/p&gt;

&lt;p&gt;I ran that once, and got some results that surprised me -- so much that I decided to
do three runs and see if the results stood up.  They did; here's the new table,
with scores for each run, the average, and the rank that each one got based on the average.&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test loss&lt;/th&gt;
  &lt;th&gt;IFT score 1&lt;/th&gt;
  &lt;th&gt;IFT score 2&lt;/th&gt;
  &lt;th&gt;IFT score 3&lt;/th&gt;
  &lt;th&gt;IFT average&lt;/th&gt;
  &lt;th&gt;IFT rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: medium&lt;/td&gt;
  &lt;td&gt;3.231442&lt;/td&gt;
  &lt;td&gt;43.44&lt;/td&gt;
  &lt;td&gt;41.83&lt;/td&gt;
  &lt;td&gt;41.30&lt;/td&gt;
  &lt;td&gt;42.19&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: small&lt;/td&gt;
  &lt;td&gt;3.499677&lt;/td&gt;
  &lt;td&gt;19.27&lt;/td&gt;
  &lt;td&gt;19.37&lt;/td&gt;
  &lt;td&gt;18.36&lt;/td&gt;
  &lt;td&gt;19.00&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.538161&lt;/td&gt;
  &lt;td&gt;19.20&lt;/td&gt;
  &lt;td&gt;18.60&lt;/td&gt;
  &lt;td&gt;18.15&lt;/td&gt;
  &lt;td&gt;18.65&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.577761&lt;/td&gt;
  &lt;td&gt;11.70&lt;/td&gt;
  &lt;td&gt;12.74&lt;/td&gt;
  &lt;td&gt;11.28&lt;/td&gt;
  &lt;td&gt;11.91&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 40 GiB&lt;/td&gt;
  &lt;td&gt;3.673623&lt;/td&gt;
  &lt;td&gt;18.25&lt;/td&gt;
  &lt;td&gt;18.40&lt;/td&gt;
  &lt;td&gt;17.83&lt;/td&gt;
  &lt;td&gt;18.16&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.683835&lt;/td&gt;
  &lt;td&gt;13.59&lt;/td&gt;
  &lt;td&gt;13.93&lt;/td&gt;
  &lt;td&gt;12.56&lt;/td&gt;
  &lt;td&gt;13.36&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.691526&lt;/td&gt;
  &lt;td&gt;17.72&lt;/td&gt;
  &lt;td&gt;17.33&lt;/td&gt;
  &lt;td&gt;16.26&lt;/td&gt;
  &lt;td&gt;17.10&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x H100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.724507&lt;/td&gt;
  &lt;td&gt;14.87&lt;/td&gt;
  &lt;td&gt;15.05&lt;/td&gt;
  &lt;td&gt;13.68&lt;/td&gt;
  &lt;td&gt;14.53&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.729900&lt;/td&gt;
  &lt;td&gt;12.65&lt;/td&gt;
  &lt;td&gt;13.34&lt;/td&gt;
  &lt;td&gt;12.55&lt;/td&gt;
  &lt;td&gt;12.85&lt;/td&gt;
  &lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x B200 160 GiB&lt;/td&gt;
  &lt;td&gt;3.771478&lt;/td&gt;
  &lt;td&gt;14.39&lt;/td&gt;
  &lt;td&gt;14.72&lt;/td&gt;
  &lt;td&gt;12.87&lt;/td&gt;
  &lt;td&gt;13.99&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb train&lt;/td&gt;
  &lt;td&gt;3.943522&lt;/td&gt;
  &lt;td&gt;12.66&lt;/td&gt;
  &lt;td&gt;13.06&lt;/td&gt;
  &lt;td&gt;11.67&lt;/td&gt;
  &lt;td&gt;12.46&lt;/td&gt;
  &lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu extended train&lt;/td&gt;
  &lt;td&gt;4.134991&lt;/td&gt;
  &lt;td&gt;17.64&lt;/td&gt;
  &lt;td&gt;16.93&lt;/td&gt;
  &lt;td&gt;16.29&lt;/td&gt;
  &lt;td&gt;16.95&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu train&lt;/td&gt;
  &lt;td&gt;4.166892&lt;/td&gt;
  &lt;td&gt;17.94&lt;/td&gt;
  &lt;td&gt;18.92&lt;/td&gt;
  &lt;td&gt;17.05&lt;/td&gt;
  &lt;td&gt;17.97&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;You can see that relative rankings are fairly consistent across the IFT runs.  But while in general
the lower-loss runs get better IFT results, now there are even more exceptions to that
trend than there were before.&lt;/p&gt;

&lt;p&gt;Let's look down the "IFT rank" column, which is based on the IFT average:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first surprise is &lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;.  It has the fourth-best
loss, but it's the worst model out of all of them on the instruction fine-tuning
test!  It was trained on exactly the same data as all of the others apart from
the OpenAI ones and the FineWeb-Edu ones.  Even more perplexingly, it was as close
a match to &lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;  as I could make it, but got
completely different results.  You might
remember from &lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;the post&lt;/a&gt;
that those two runs started with the same weights and had exactly the same
training config; the only difference was that they were trained on different
architectures, and one used DDP with a real global batch size of 96, while the
other used gradient accumulation to get the same batch size.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;1xrtx3090-baseline&lt;/code&gt; also does much worse than you'd expect from its loss numbers;
it's only a tiny bit worse than Cloud FineWeb, 8x A100 40 GiB in loss terms, but
much worse on the IFT test.  Again, this one is essentially a clone of another:
&lt;code&gt;8xa100m40-baseline&lt;/code&gt;, which was the same training run but using DDP rather than
gradient accumulation.  The same problem -- one of a pair of closely-matched
models has worse results on the IFT test.  But in this case, it's the gradient
accumulation model that turned out bad.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a really odd situation.  If the training runs using gradient accumulation rather than
DDP had been consistently worse -- or vice versa -- then we could imagine some kind
of connection.  But in the first case, GA beat DDP, but in the second, it was
the other way around.&lt;/p&gt;

&lt;p&gt;Apart from that, we do still see that the two FineWeb-Edu models are doing much
better than the others.  And the remaining models are all pretty close together, both
in terms of loss and in terms of their ranking, apart from the Local FineWeb train,
which is bad in both.&lt;/p&gt;

&lt;p&gt;It is, however, interesting that Local FineWeb-Edu extended train, which was trained
on twice as much data as Local FineWeb-Edu train, is consistently worse in terms of
the IFT numbers, though.  That wasn't the case in my tests previously.&lt;/p&gt;

&lt;p&gt;All of this puzzled me.  The "lots of knowledge makes a model better at this" idea
seemed to be weakened by the relative ranks of the two FineWeb-Edu models (after all,
if it was true, you'd expect the model trained on more data to be consistently better).
And the "smart, low-loss models are better" side seemed to be contradicted by
&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt; and &lt;code&gt;1xrtx3090-baseline&lt;/code&gt;'s bad results.&lt;/p&gt;

&lt;p&gt;What might be going on here?&lt;/p&gt;

&lt;h3 id="epochs-of-fine-tuning"&gt;Epochs of fine-tuning&lt;/h3&gt;

&lt;p&gt;Looking at the training code, one thing stood out to me.  The process was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fine-tune the model for a maximum of 100 epochs over the training set.&lt;/li&gt;
&lt;li&gt;If loss on a held-back validation set went above the result for the previous
epoch, we did an early exit and used the previous epoch's model for the generation
of the responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, the early-exit code always cut in pretty quickly.
I'd noticed that during my original generation of the results for the new models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;8xa100m40-baseline&lt;/code&gt; took 6 epochs until validation loss started rising.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;1xrtx3090-baseline&lt;/code&gt; took 5&lt;/li&gt;
&lt;li&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt; took 4&lt;/li&gt;
&lt;li&gt;&lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt; took 5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I decided to regenerate responses for all of the models, and then run the new
responses past the LLM judge again.  But this time I would keep a record of how
many epochs of training we got before the exit:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test loss&lt;/th&gt;
  &lt;th&gt;IFT score&lt;/th&gt;
  &lt;th&gt;Epochs&lt;/th&gt;
  &lt;th&gt;IFT rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: medium&lt;/td&gt;
  &lt;td&gt;3.231442&lt;/td&gt;
  &lt;td&gt;39.14&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: small&lt;/td&gt;
  &lt;td&gt;3.499677&lt;/td&gt;
  &lt;td&gt;24.93&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.538161&lt;/td&gt;
  &lt;td&gt;16.97&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.577761&lt;/td&gt;
  &lt;td&gt;10.40&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 40 GiB&lt;/td&gt;
  &lt;td&gt;3.673623&lt;/td&gt;
  &lt;td&gt;20.73&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.683835&lt;/td&gt;
  &lt;td&gt;13.61&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.691526&lt;/td&gt;
  &lt;td&gt;13.57&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x H100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.724507&lt;/td&gt;
  &lt;td&gt;14.25&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.729900&lt;/td&gt;
  &lt;td&gt;11.66&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x B200 160 GiB&lt;/td&gt;
  &lt;td&gt;3.771478&lt;/td&gt;
  &lt;td&gt;15.17&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb train&lt;/td&gt;
  &lt;td&gt;3.943522&lt;/td&gt;
  &lt;td&gt;13.25&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu extended train&lt;/td&gt;
  &lt;td&gt;4.134991&lt;/td&gt;
  &lt;td&gt;16.39&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu train&lt;/td&gt;
  &lt;td&gt;4.166892&lt;/td&gt;
  &lt;td&gt;17.80&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;It was getting even harder to see any useful pattern!  One thing that did
stand out, though, was that the still oddly-high Cloud FineWeb, 8x A100 40 GiB
model was being instruction-trained for seven epochs.  It was also rather noticeable
that the two FineWeb-Edu models had the same "advantage", if that's what it was.  But
the Local FineWeb train had seven
epochs too, and got a poor score, the OpenAI models only got two each, and
led the pack, and &lt;code&gt;1xrtx3090-baseline&lt;/code&gt; got a pretty poor result given its six
epochs of training.&lt;/p&gt;

&lt;p&gt;Still, what would happen if we got rid of that confounder?  I did yet another
set of runs; this time, I changed the fine-tuning/generation script to always do
four epochs -- no early exit.  I chose four because it was the modal number in the
previous trains -- no strong reason for it beyond that.&lt;/p&gt;

&lt;h4 id="training-for-four-epochs"&gt;Training for four epochs&lt;/h4&gt;

&lt;p&gt;Here's what came out at the end:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test loss&lt;/th&gt;
  &lt;th&gt;IFT score&lt;/th&gt;
  &lt;th&gt;Epochs&lt;/th&gt;
  &lt;th&gt;IFT rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: medium&lt;/td&gt;
  &lt;td&gt;3.231442&lt;/td&gt;
  &lt;td&gt;43.99&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: small&lt;/td&gt;
  &lt;td&gt;3.499677&lt;/td&gt;
  &lt;td&gt;25.70&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.538161&lt;/td&gt;
  &lt;td&gt;14.46&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.577761&lt;/td&gt;
  &lt;td&gt;10.07&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;11=&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 40 GiB&lt;/td&gt;
  &lt;td&gt;3.673623&lt;/td&gt;
  &lt;td&gt;13.51&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.683835&lt;/td&gt;
  &lt;td&gt;10.65&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.691526&lt;/td&gt;
  &lt;td&gt;12.55&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x H100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.724507&lt;/td&gt;
  &lt;td&gt;11.41&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.729900&lt;/td&gt;
  &lt;td&gt;9.48&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x B200 160 GiB&lt;/td&gt;
  &lt;td&gt;3.771478&lt;/td&gt;
  &lt;td&gt;10.07&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;11=&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb train&lt;/td&gt;
  &lt;td&gt;3.943522&lt;/td&gt;
  &lt;td&gt;10.16&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu extended train&lt;/td&gt;
  &lt;td&gt;4.134991&lt;/td&gt;
  &lt;td&gt;10.54&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu train&lt;/td&gt;
  &lt;td&gt;4.166892&lt;/td&gt;
  &lt;td&gt;15.09&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Still no obvious pattern.&lt;/p&gt;

&lt;h4 id="training-for-seven-epochs"&gt;Training for seven epochs&lt;/h4&gt;

&lt;p&gt;What if we try seven epochs of training for all of them, so that they all get as much
"benefit" (if that's what it is) as the FineWeb-Edu models?&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Test loss&lt;/th&gt;
  &lt;th&gt;IFT score&lt;/th&gt;
  &lt;th&gt;Epochs&lt;/th&gt;
  &lt;th&gt;IFT rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: medium&lt;/td&gt;
  &lt;td&gt;3.231442&lt;/td&gt;
  &lt;td&gt;40.74&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: small&lt;/td&gt;
  &lt;td&gt;3.499677&lt;/td&gt;
  &lt;td&gt;24.87&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.538161&lt;/td&gt;
  &lt;td&gt;16.91&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.577761&lt;/td&gt;
  &lt;td&gt;10.59&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 40 GiB&lt;/td&gt;
  &lt;td&gt;3.673623&lt;/td&gt;
  &lt;td&gt;15.94&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.683835&lt;/td&gt;
  &lt;td&gt;13.68&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3.691526&lt;/td&gt;
  &lt;td&gt;14.82&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x H100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.724507&lt;/td&gt;
  &lt;td&gt;10.82&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 80 GiB&lt;/td&gt;
  &lt;td&gt;3.729900&lt;/td&gt;
  &lt;td&gt;10.70&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x B200 160 GiB&lt;/td&gt;
  &lt;td&gt;3.771478&lt;/td&gt;
  &lt;td&gt;13.81&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb train&lt;/td&gt;
  &lt;td&gt;3.943522&lt;/td&gt;
  &lt;td&gt;13.09&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu extended train&lt;/td&gt;
  &lt;td&gt;4.134991&lt;/td&gt;
  &lt;td&gt;16.27&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu train&lt;/td&gt;
  &lt;td&gt;4.166892&lt;/td&gt;
  &lt;td&gt;15.54&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Just as confused as ever...&lt;/p&gt;

&lt;h3 id="putting-it-all-together"&gt;Putting it all together&lt;/h3&gt;

&lt;p&gt;Here's a table with all of the ranks we got from these tests:&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;&lt;/th&gt;
  &lt;th&gt;Initial rank&lt;/th&gt;
  &lt;th&gt;Updated script rank&lt;/th&gt;
  &lt;th&gt;4-epoch rank&lt;/th&gt;
  &lt;th&gt;7-epoch rank&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: medium&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
  &lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;OpenAI weights: small&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
  &lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
  &lt;td&gt;11=&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 40 GiB&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;1xrtx3090-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;code&gt;8xa100m40-baseline&lt;/code&gt;&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x H100 80 GiB&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x A100 80 GiB&lt;/td&gt;
  &lt;td&gt;11&lt;/td&gt;
  &lt;td&gt;12&lt;/td&gt;
  &lt;td&gt;13&lt;/td&gt;
  &lt;td&gt;12&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Cloud FineWeb, 8x B200 160 GiB&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;11=&lt;/td&gt;
  &lt;td&gt;8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb train&lt;/td&gt;
  &lt;td&gt;12&lt;/td&gt;
  &lt;td&gt;11&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
  &lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu extended train&lt;/td&gt;
  &lt;td&gt;7&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
  &lt;td&gt;9&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;Local FineWeb-Edu train&lt;/td&gt;
  &lt;td&gt;5&lt;/td&gt;
  &lt;td&gt;4&lt;/td&gt;
  &lt;td&gt;3&lt;/td&gt;
  &lt;td&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;It's hard to draw much sense out of this, but a few things are clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Performance on this test is correlated with loss, but it's far from the only factor.&lt;/li&gt;
&lt;li&gt;The OpenAI weights consistently lead the pack.&lt;/li&gt;
&lt;li&gt;Of our own models, &lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;, Cloud FineWeb, 8x A100 40 GiB,
and Local FineWeb-Edu train do pretty well.&lt;/li&gt;
&lt;li&gt;Strangely, Local FineWeb-Edu extended train, which is just Local FineWeb-Edu train
that has been trained on a further 3B tokens of the FineWeb-Edu dataset, is consistently
worse than the model it was based on.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt; and &lt;code&gt;1xrtx3090-baseline&lt;/code&gt; are consistently bad.  Cloud FineWeb, 8x A100 80 GiB is
also not great.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the one hand, training different models for different numbers of epochs feels wrong
for an evaluation like this, as they're being "treated differently".  On the other hand,
if it's meant to be a good evaluation of model usefulness in the real world, then individual models would
be fine-tuned for different amounts of time, depending on validation loss.  So perhaps
it is better?&lt;/p&gt;

&lt;p&gt;But the differing results are still quite a puzzle.  I figured that a modern AI could
easily build me a data exploration interface, specifically for the original results
and seven-epoch ones, so I asked Claude and got &lt;a href="/post-assets/llm-from-scratch-32l-interventions-instruction-fine-tuning-tests/ift-eval-explorer.html"&gt;this rather nice one&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After poring over that, though, I couldn't find a smoking gun -- for example, some
kind of systematic error that &lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt; was always making
that pulled its score down.&lt;/p&gt;

&lt;p&gt;I think that the best -- albeit hand-wavy and incomplete -- mental model that I have right now is something like this.  If
we consider the loss landscape that these models are all in, they've all been trained
to try to get to a place with as low loss as we could manage.  When we do the instruction
fine-tune on them, we're changing the landscape -- the objective of "be better at
following instructions" is different to "be better at minimising loss".&lt;/p&gt;

&lt;p&gt;Now, those two landscapes could be completely different!  You can imagine a task that we
might set instead of instruction-following that could be completely uncorrelated with
loss minimisation, or even inversely correlated.&lt;/p&gt;

&lt;p&gt;But instruction-following is relatively close; it at least shares features like "generate
coherent text".  So when we do the instruction fine-tuning, what we're trying to do is
to move from the place where the model ended up after its pre-training, to a place where
performance on the new goal -- instruction-following -- is best.&lt;/p&gt;

&lt;p&gt;Here's where I'm going to get more than a bit hand-wavy.  You can easily imagine that some
places where the loss was low, there might be downhill slopes pointing towards good
locations in the new instruction-following landscape.  With instruction fine-tuning,
you'd be able to get a good IFT model.&lt;/p&gt;

&lt;p&gt;But other places with low loss might not have that advantage; maybe they're at or
near a poor "local minimum" in the IFT landscape -- that is, a place where there is
no downhill route to a better place.  So simple fine-tuning like this
might never get a good result!&lt;/p&gt;

&lt;p&gt;With this mindset, we might say that the OpenAI weights are pretty well-positioned,
not just in the loss landscape but also in the IFT landscape.  The FineWeb-Edu
models happened to get lucky, and wind up in a place that (despite having poor loss),
is well-positioned for the IFT objective.  And by contrast, &lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt;
and &lt;code&gt;1xrtx3090-baseline&lt;/code&gt; were just unlucky: they got to a place where the loss
landscape was not well-correlated with the IFT landscape.&lt;/p&gt;

&lt;p&gt;This seems plausible enough for me to
use it as my working model for now, and see if I can work out some way to test it.
Keeping track of the validation loss during the instruction fine-tuning process would
certainly be a good start; unfortunately I only realised that after doing all
of the tests above, and re-doing them would be quite a lot of work.&lt;/p&gt;

&lt;p&gt;One final thing is worth repeating.  Our two "unlucky" models,
&lt;code&gt;8xa100m40-stacked-interventions-1&lt;/code&gt; and &lt;code&gt;1xrtx3090-baseline&lt;/code&gt;, each had a twin.
The former was the DDP-trained counterpart of the gradient-accumulated
&lt;code&gt;1xrtx3090-stacked-interventions&lt;/code&gt;, while the latter was the gradient-accumulated counterpart
of &lt;code&gt;8xa100m40-baseline&lt;/code&gt;.  So while something odd clearly happened, it doesn't look like
DDP or gradient accumulation by themselves are the culprit.&lt;/p&gt;

&lt;p&gt;I think that at this point, it's best for me to draw a line under this -- I have a
bunch of other things I'd like to get to, and this is a bit of a side quest at
this point.&lt;/p&gt;

&lt;p&gt;Still, I have one main takeaway from this: chasing lower loss is technically interesting but is not the only goal.
In some cases, it seems likely that lower-loss models can be worse for actual use.&lt;/p&gt;

&lt;p&gt;Coming up next: I'm going to wrap up this "interventions" mini-series, and move
on to the final steps in my LLM from scratch journey.  See you then!&lt;/p&gt;

&lt;p&gt;&lt;a href="/2026/04/llm-from-scratch-32m-interventions-conclusion"&gt;Here's a link to the next post in this series&lt;/a&gt;.&lt;/p&gt;
</description><guid isPermaLink="false">/2026/04/llm-from-scratch-32l-interventions-instruction-fine-tuning-tests</guid><pubDate>Mon, 20 Apr 2026 20:00:00 +0000</pubDate></item><item><title>Writing an LLM from scratch, part 32m -- Interventions: conclusion</title><link>https://www.gilesthomas.com/2026/04/llm-from-scratch-32m-interventions-conclusion</link><description>&lt;p&gt;Last November, when I finished the main body of
"&lt;a href="https://www.manning.com/books/build-a-large-language-model-from-scratch"&gt;Build a Large Language Model (from Scratch)&lt;/a&gt;",
I &lt;a href="/2025/11/llm-from-scratch-27-whats-left-and-whats-next"&gt;set myself a number of follow-on goals&lt;/a&gt;.
One was "training the full GPT-2 base model myself".&lt;/p&gt;

&lt;p&gt;I've reached the end of that journey, with a model that is almost -- if not quite
-- as good as GPT-2 small, trained in 44 hours on my own machine,
so I thought it would be worth summarising
how it went.&lt;/p&gt;
&lt;p&gt;In December, &lt;a href="/2025/12/llm-from-scratch-28-training-a-base-model-from-scratch"&gt;I trained my first model&lt;/a&gt;,
taking two days,
but was disappointed to see that it was worse in terms of loss, and in terms
of how well it could be fine-tuned to follow instructions, than the original GPT-2 model.&lt;/p&gt;

&lt;p&gt;I expected that a chunk of that difference was likely to be due to the original model having
been trained for longer, but also noticed that there were a number of changes -- interventions -- that
I could make to the model and the training run, and I thought they might help.&lt;/p&gt;

&lt;p&gt;In January, I &lt;a href="/2026/01/llm-from-scratch-29-ddp-training-a-base-model-in-the-cloud"&gt;got a DDP training system together&lt;/a&gt;
that would allow me to iterate on those interventions without having to wait for
two days for each result.&lt;/p&gt;

&lt;p&gt;In February, I got started by &lt;a href="/2026/02/llm-from-scratch-32a-interventions-baseline-model"&gt;training a baseline model in the cloud&lt;/a&gt;,
and
I've since ground through all of the interventions, and come up with
a set that lowered the loss nicely, both &lt;a href="/2026/04/llm-from-scratch-32j-interventions-trying-to-train-a-better-model-in-the-cloud"&gt;in the cloud&lt;/a&gt;,
and &lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;locally&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Along the way, I've learned about, or refined my knowledge of, a bunch of ML concepts.
In increasing order of how they helped with the loss (with the first two actually making
it slightly worse):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="/2026/03/llm-from-scratch-32g-interventions-weight-tying"&gt;Weight tying&lt;/a&gt;, which
I found made the loss worse, but it was interesting how simple it was to implement.&lt;/li&gt;
&lt;li&gt;PyTorch's &lt;a href="/2026/04/llm-from-scratch-32h-interventions-full-fat-float32"&gt;Automated Mixed Precision&lt;/a&gt;,
which also harmed the loss a tiny bit, but
had the benefit of making training twice as fast, and 66% cheaper in the cloud
-- well worth the loss penalty.&lt;/li&gt;
&lt;li&gt;&lt;a href="/2026/02/llm-from-scratch-32b-interventions-gradient-clipping"&gt;Gradient clipping&lt;/a&gt; -- a cheap,
but (somewhat to my surprise) not particularly effective intervention for this model.&lt;/li&gt;
&lt;li&gt;&lt;a href="/2026/02/llm-from-scratch-32d-interventions-adding-attention-bias"&gt;QKV bias&lt;/a&gt; -- that is,
adding bias to the attention weight matrices -- which also helped a tiny bit, though
I later felt that this might have been in the noise.&lt;/li&gt;
&lt;li&gt;&lt;a href="/2026/03/llm-from-scratch-32f-interventions-weight-decay"&gt;Weight decay&lt;/a&gt; -- more effective,
and something
that's simple enough to understand with simple gradient descent.  I still need to
learn more about it in
the context of optimisers, though -- particularly with AdamW.&lt;/li&gt;
&lt;li&gt;&lt;a href="/2026/02/llm-from-scratch-32c-interventions-removing-dropout"&gt;Dropout&lt;/a&gt;, which
seems to be less than useful for single-epoch training: removing it
helped the model quite a lot.&lt;/li&gt;
&lt;li&gt;&lt;a href="/2026/03/llm-from-scratch-32e-interventions-learning-rate"&gt;The learning rate&lt;/a&gt;,
which I built up quite a lot of new knowledge about, and by both increasing it
and scheduling it, I got the biggest bang for the buck.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've also learned how to &lt;a href="/2026/01/custom-automodelforcausallm-frompretrained-models-on-hugging-face"&gt;upload my custom models to Hugging Face&lt;/a&gt;,
found out some &lt;a href="/2026/04/llm-from-scratch-32i-interventions-what-is-in-the-noise"&gt;interesting things about how random noise affects training&lt;/a&gt;,
and come up with improvements in the setup I have for using an
&lt;a href="/2026/01/llm-from-scratch-30-digging-into-llm-as-a-judge"&gt;LLM as a judge for instruction fine-tuned models&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There was a bit of a mystery when I tried out the instruction fine-tuning tests,
though.  Although two of my models were very close to GPT-2 small in terms of loss,
&lt;a href="/2026/04/llm-from-scratch-32l-interventions-instruction-fine-tuning-tests"&gt;I found&lt;/a&gt;
that while one of them had an instruction fine-tuning result that was likewise close
to GPT-2 small, the other was &lt;em&gt;much&lt;/em&gt; worse!  A mystery to dig into later, I think.&lt;/p&gt;

&lt;p&gt;But it was still very satisfying that my best model -- trained locally in 44 hours
-- was almost as good as GPT-2 small, even if it did fall somewhat short.  So on that
positive note, I'm going to wrap up this "Interventions" series-within-a-series, and move
on to the two other things I wanted to do before wrapping up the "LLM from scratch" series
as a whole:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Going through the appendices in the book to see if there's anything I want
to highlight there.&lt;/li&gt;
&lt;li&gt;The final test as to whether I've really understood everything: building my own
LLM from scratch without reference to the book.  I want to do that in a different
framework, not PyTorch, to minimise the risk of just regurgitating code --
I asked people on X/Twitter which one I should use, and
&lt;a href="https://x.com/gpjt/status/1985434030880293004"&gt;the winner was JAX&lt;/a&gt; -- so it should be interesting to see how that goes!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The appendices first, I think -- I'll post about them shortly.  But I think the big
one will be the JAX implementation -- really looking forward to that.&lt;/p&gt;

&lt;p&gt;&lt;a href="/2026/04/llm-from-scratch-33-what-i-learned-from-the-appendices"&gt;Here's a link to the next post in this series&lt;/a&gt;.&lt;/p&gt;
</description><guid isPermaLink="false">/2026/04/llm-from-scratch-32m-interventions-conclusion</guid><pubDate>Tue, 21 Apr 2026 18:15:00 +0000</pubDate></item><item><title>Writing an LLM from scratch, part 33 -- what I learned from finally getting round to the appendices</title><link>https://www.gilesthomas.com/2026/04/llm-from-scratch-33-what-i-learned-from-the-appendices</link><description>&lt;p&gt;After finishing the main body of
"&lt;a href="https://www.manning.com/books/build-a-large-language-model-from-scratch"&gt;Build a Large Language Model (from Scratch)&lt;/a&gt;",
I &lt;a href="/2025/11/llm-from-scratch-27-whats-left-and-whats-next"&gt;set myself three follow-on goals&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The first was training a full GPT-2-small-style base model myself.  That was
&lt;a href="/2025/12/llm-from-scratch-28-training-a-base-model-from-scratch"&gt;reasonably easy to do&lt;/a&gt;
but unlocked a &lt;a href="/2026/04/llm-from-scratch-32m-interventions-conclusion"&gt;bunch of irresistible side quests&lt;/a&gt;;
having finally got to the end of those, it's time to move on to the others: reading
through the book's appendices, and building my own GPT-2 style model in JAX.&lt;/p&gt;

&lt;p&gt;This post is about the appendices.  The TL;DR: there was stuff in there that could have saved me time
in my side-questing, but I think that having to work those things out from scratch probably
helped me learn them better.&lt;/p&gt;
&lt;h3 id="appendix-a-introduction-to-pytorch"&gt;Appendix A: Introduction to PyTorch&lt;/h3&gt;

&lt;p&gt;This is an excellent overview of PyTorch, and given that I'm writing for people
who are reading the book too, all I can really say is that it's well worth reading,
even if you have some experience in it.  He gives an intro to what it is, some details
on how to choose to use GPUs (or Apple Silicon) if you have them, and an overview of tensors.&lt;/p&gt;

&lt;p&gt;He then goes on to explain the basics of automated differentiation and back-propagation,
with a bit of background detail about the chain rule.  I think this bit is useful at a
"how-to" level, but the mathematical details felt like they were summarised too briefly
to be all that useful.  I can see why -- this is an appendix to a book on an adjacent
subject, not a textbook on the mathematics of training ML models.  But something this
brief feels like it would be confusing for people who don't know it already, but not
really useful for those that do.&lt;/p&gt;

&lt;p&gt;Perhaps I'm underestimating the typical reader, but if and when I write up my own
explanation of how this works (perhaps as a follow-up to
"&lt;a href="/2025/09/maths-for-llms"&gt;The maths you need to start understanding LLMs&lt;/a&gt;"), I'll
go quite a lot slower and try to explain things in more detail.&lt;/p&gt;

&lt;p&gt;Anyway, as I said, the explanation is more of a bonus in this book, quite
far from its main focus, so this is a nit.&lt;/p&gt;

&lt;p&gt;He then goes on to a high-level explanation of PyTorch's &lt;code&gt;Dataset&lt;/code&gt;s and &lt;code&gt;DataLoader&lt;/code&gt;s.
This was quite useful for me.  I must admit that I've been struggling a bit to see
the value of DataLoaders -- indexing directly into Datasets has worked very nicely
for me.  I suspect this is a question of scale more than anything; even my big
training runs, 44 hours of training a 163M-parameter model on 3 billion tokens, worked
fine without a DataLoader.  But after reading this section, I felt I was getting
some way towards having more of a handle on how they might help.  I'm not quite
there yet, but hopefully soon...&lt;/p&gt;

&lt;p&gt;Next, there are sections on training loops, both with and without GPU support.  Nothing
new there for me, at least.&lt;/p&gt;

&lt;p&gt;Then came the real surprise: a really solid walkthrough on training models across
multiple GPUs with DistributedDataParallel!  That's something I learned from the documentation
and various online tutorials &lt;a href="/2026/01/llm-from-scratch-29-ddp-training-a-base-model-in-the-cloud"&gt;back in January&lt;/a&gt;,
and reading this appendix first would have saved some time.&lt;/p&gt;

&lt;p&gt;But thinking back on it, I think that the way I did it was better pedagogically for
me.  By having to grind through it from first principles -- following the docs, coding
something, seeing it break, trying again, and eventually getting there -- I think
I internalised the knowledge much better.&lt;/p&gt;

&lt;p&gt;It's a balance, really.  If I read explanations, I learn faster, but
the knowledge is shallower.  Learning by doing is slower but deeper.  Working out
a good balance is hard.  It feels like I've struck a good balance on this one, but
I suppose it's difficult to know for sure.&lt;/p&gt;

&lt;p&gt;The one thing in the DDP section that did stand out for me, though, was the use of a
&lt;code&gt;DistributedSampler&lt;/code&gt; for the &lt;code&gt;DataLoader&lt;/code&gt;.  That might have made some of my DDP code
a bit simpler!&lt;/p&gt;

&lt;p&gt;On to the next appendix.&lt;/p&gt;

&lt;h3 id="appendix-b-references-and-further-reading"&gt;Appendix B: References and further reading&lt;/h3&gt;

&lt;p&gt;I won't go through this in detail; it does what it says on the tin, and there's
a bunch of interesting stuff in there.  I scanned through and nothing felt like
a must-read right now, but I'll be checking it in the future if I'm looking for
suggestions for things to read about.&lt;/p&gt;

&lt;h3 id="appendix-c-exercise-solutions"&gt;Appendix C: Exercise solutions&lt;/h3&gt;

&lt;p&gt;Another one that is exactly what it says it is.&lt;/p&gt;

&lt;h3 id="appendix-d-adding-bells-and-whistles-to-the-training-loop"&gt;Appendix D: Adding bells and whistles to the training loop&lt;/h3&gt;

&lt;p&gt;Once again, something I could have saved time by reading first!  In it, he covers
gradient clipping, which I went over back &lt;a href="/2026/02/llm-from-scratch-32b-interventions-gradient-clipping"&gt;in February&lt;/a&gt;,
and warming up and then doing a cosine decay on the learning rate, which was something
I looked into &lt;a href="/2026/03/llm-from-scratch-32e-interventions-learning-rate"&gt;in March&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Just like with DDP, I think that having to learn about these from resources I could
find on the Internet meant that I got to a deeper understanding than I would have if
I'd just been following the book.  This is not a point against the book, of course!
Again, it's one of those balancing acts: do it yourself and learn more, or read about
it and learn faster.&lt;/p&gt;

&lt;p&gt;Still well worth reading though.&lt;/p&gt;

&lt;h3 id="appendix-e-parameter-efficient-fine-tuning-with-lora"&gt;Appendix E: Parameter-efficient fine-tuning with LoRA&lt;/h3&gt;

&lt;p&gt;This was a really interesting read.  I've been reading about LoRA on the side, but
most treatments I've seen started with an explanation of the maths, but then essentially
said "now, to do it, install PEFT" (or Unsloth, or something similar).&lt;/p&gt;

&lt;p&gt;Raschka gives the full code, showing how you can write your own LoRA stuff, and I
think this is excellent.  Digging into it right now would be a side quest, but I'm
inspired by it and might do my own LoRA writeup after finishing this LLM from scratch arc.&lt;/p&gt;

&lt;p&gt;Let's see if I manage that or if I get distracted by something shiny first...&lt;/p&gt;

&lt;h3 id="and-thats-it"&gt;...and that's it!&lt;/h3&gt;

&lt;p&gt;The last page in the book.  Well, the first page of the index.  Done.  Wow!&lt;/p&gt;

&lt;p&gt;But before I start the celebrations, there's one last step.  As I said
&lt;a href="/2025/11/llm-from-scratch-27-whats-left-and-whats-next"&gt;last November&lt;/a&gt;, I wanted to:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[Build] my own LLM from scratch in a different framework,
  without using the book. That is, I think, essential, and perhaps would be the
  crowning post of this series. It would be a nice way to end it, wouldn't it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I think I was right, so that's what's next.  I
&lt;a href="https://x.com/gpjt/status/1985434030880293004"&gt;asked people on Twitter&lt;/a&gt; which
framework I should use, and the winner was JAX -- and so that's what's coming next.&lt;/p&gt;

&lt;p&gt;Watch this space!&lt;/p&gt;
</description><guid isPermaLink="false">/2026/04/llm-from-scratch-33-what-i-learned-from-the-appendices</guid><pubDate>Wed, 22 Apr 2026 17:30:00 +0000</pubDate></item><item><title>10Gb/s Ethernet: what I had to (re)learn</title><link>https://www.gilesthomas.com/2026/04/10g-ethernet-what-i-relearned</link><description>&lt;p&gt;My ISP recently started offering a 10Gb option, and my "shiny new thing!" Pavlovian
response immediately kicked in.  So of course, I
had to upgrade the wired networking in my home -- which meant I had to learn a few things to get
it all working, and relearn a bunch of stuff I'd forgotten over the years.&lt;/p&gt;

&lt;p&gt;Wired networking for home and small offices hasn't really moved forward that much in the last 20-odd years.
Back in 2006, gigabit Ethernet was standard for businesses, and most home users moved
to it not long after.  Perhaps due to the rise of WiFi for most "last few metres" connections,
it's pretty much stagnated there, perhaps with a bit of a push towards 2.5Gb/s more
recently.&lt;/p&gt;

&lt;p&gt;But with faster ISP connections arriving, I think things are starting to become
a bit more interesting.  Even the fastest WiFi 7 connections are only able to get up
to around 6Gb/s to a single device -- and that's in an ideal "super-fast machine sitting right next
to the AP in a shielded lab" setup.&lt;/p&gt;

&lt;p&gt;Here's what I had to drag up from my memory, and the new stuff I had to learn, in
order to get this all working.  I'll write about the background in this post, and
then tomorrow I'll post about what I actually put in place.&lt;/p&gt;
&lt;h3 id="a-bit-of-history"&gt;A bit of history&lt;/h3&gt;

&lt;p&gt;Let's start with a bit of the backstory.  Bear with me, it's not just self-indulgent
reminiscing!&lt;/p&gt;

&lt;p&gt;When I first started using networked computers, back in the early 90s, the most popular
standard was &lt;a href="https://en.wikipedia.org/wiki/10BASE2"&gt;10BASE2&lt;/a&gt;.  We had this in the first
office that I worked in, and in the university computer labs.  In the back of your
computer, you'd have a T-shaped connector like this:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/10g-ethernet-what-i-relearned/BNC_Tee_connector,_with_Ethernet_cable_connected-92166.jpg" alt="BNC Tee connector, with Ethernet cable connected" title="BNC Tee connector, with Ethernet cable connected" /&gt;
&lt;small&gt;© Raimond Spekking / &lt;a href="https://creativecommons.org/licenses/by-sa/4.0/deed"&gt;CC BY-SA 4.0&lt;/a&gt; (via &lt;a href="https://commons.wikimedia.org/wiki/File:BNC_Tee_connector,_with_Ethernet_cable_connected-92166.jpg"&gt;Wikimedia Commons&lt;/a&gt;)&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;The end facing the camera in that photo was the bit that went into your computer.
Computers were daisy-chained together; you might have a server connected to workstation
one, workstation one to workstation two, and so on, until you reached the last workstation.
You'd have to cap the unused end of the T connectors at each end of the chain with a special terminator.&lt;/p&gt;

&lt;p&gt;Essentially it was a single coaxial cable, so every computer saw every
bit that was sent along the bus.  In turn, that meant that everyone was sharing the same bandwidth, a meagre
10Mb/s.  The cool thing about Ethernet (compared to older networking technologies) was
that the computers shared it without any need for coordination -- if two of them started
"speaking" at the same time, they'd notice, and stop.  They would then start again after
a random back-off, so one of them would randomly wait for less time than the other and
start first.  The other would notice that "the line was busy" and would wait again for another
chance.&lt;/p&gt;

&lt;p&gt;Of course, this limited the number of computers you could have on one network, as past
around 20 or so, they'd spend all of their time interrupting each other and never actually
be able to send anything -- and anyway, sharing 10Mb/s across a large number of computers
would be an issue.  On top of that, there was a hard cap of 30 machines per network.
You'd use more specialised networking equipment to link
different networks together -- bridges, switches and routers.  More about switches later.&lt;/p&gt;

&lt;p&gt;By the time we started setting up networking in a house that I shared with friends,
in around 1996 or so &lt;sup class="footnote-ref" id="fnref-1"&gt;&lt;a href="#fn-1"&gt;1&lt;/a&gt;&lt;/sup&gt;, the most popular option had changed: now people were using
10BASE-T.  Still 10Mb/s, but using the RJ45 connectors and twisted-pair cables that we've come to know and love.
All of the computers would have a single cable going to a hub, in a star topology.
You might link multiple hubs together to build larger networks.&lt;/p&gt;

&lt;p&gt;However, these
hubs were still little more than a convenient form factor to electrically link all
of the wires together into a single bus.  You still had the problem that every computer could see
every bit on the bus, and the same bandwidth-sharing and limits with the number of computers that you
could handle as a result.&lt;/p&gt;

&lt;p&gt;Over the years after that, things moved on.  Switches had been relatively expensive
things; they would be used to interlink hubs, or 10BASE2 networks.  They would learn (from seeing the
source MAC address on incoming packets) which machines were sending to each of their ports,
and use that to know where to send packets that came in on other ports.  If, say, a switch
learned that addresses A, B, and C were on port 1, then if a packet for one of those machines
came in on port 2, it would know it could just send it out on port 1 and not on the others.
That helped to address the bandwidth-sharing and the problems with collisions.&lt;/p&gt;

&lt;p&gt;Prices for switches got lower and lower, and eventually -- I think sometime between 2005 and
2010 -- they became so cheap that
there was little point in bothering with hubs -- you'd just connect every computer
directly to a switch.  That meant that any two computers on the same switch
could talk to each other at the full network speed, as packets would just be
switched from port to port &lt;sup class="footnote-ref" id="fnref-2"&gt;&lt;a href="#fn-2"&gt;2&lt;/a&gt;&lt;/sup&gt;.  The connections between switches were still a bottleneck, of course,
but that was much less of a problem.&lt;/p&gt;

&lt;p&gt;At the same time, speeds increased, from 10Mb/s
to 100Mb and then finally to 1Gb/s, which was standard for business machines by 2005 or so --
I remember that when we bought our first computers for Resolver Systems back then, that's what they
came with by default.&lt;/p&gt;

&lt;p&gt;Home computers weren't far behind -- and that's where we've been ever since. &lt;sup class="footnote-ref" id="fnref-3"&gt;&lt;a href="#fn-3"&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;h3 id="isps-and-larger-networks-the-move-to-sfp"&gt;ISPs and larger networks: the move to SFP&lt;/h3&gt;

&lt;p&gt;Back to that bottleneck between the switches.  Even back in the days of 10Mb/s
networks, if you were managing a larger network, you would want a faster network to interlink them -- so, for example, if
two computers on the same switch both wanted to access some external resource, they
wouldn't be competing for the same 10Mb/s uplink.  Once you went past small office-sized
networks, that kind of thing started becoming important.  ISPs and datacenters, of course,
had the same problem in spades.&lt;/p&gt;

&lt;p&gt;What you would need was an uplink on the switch that could run at a faster data rate.
So even when 1Gb/s Ethernet was too expensive for the connections to the computers themselves,
you might have a switch with a 1Gb/s uplink to connect it to the larger network,
and a bunch of 100Mb/s ports for the local stuff.&lt;/p&gt;

&lt;p&gt;Additionally, for larger networks you would have another problem -- physical
distance.  All of these RJ45-based networking technologies had a maximum cable length
of 100m.  You could extend that by putting a repeater (or even just a switch) every 100m or so as a "signal booster" --
but if, for example, you wanted to link two buildings, that could be tricky.  You'd
need to run both the data cable and power, and you'd need to have some way of getting
access to the repeaters if they went wrong.&lt;/p&gt;

&lt;p&gt;Ethernet over fibre optic connections had been a standard thing for years, though, and
it had much better range -- for single-mode, many kilometers.  So while it was too fiddly for LANs,
it made great sense as a backbone technology.&lt;/p&gt;

&lt;p&gt;What that meant, though, was that in order to set up some particular network topology,
you might wind up having to get a whole bunch of different switches.  For short connections
between two of them, you might use an RJ45 uplink connection, while for longer ones you
might want fibre.  More complex topologies might need some entirely different mix of
ports.&lt;/p&gt;

&lt;p&gt;To make this worse, there were a bunch of different fibre optic standards -- multi-mode
and single mode fibres, different connectors, and so on.&lt;/p&gt;

&lt;p&gt;Rather than manufacturing a large range of different kinds of switches with all of the
combinations that people needed, manufacturers
separated out the physical layer of the transport from the switching hardware.  A switch,
instead of having specific RJ45 or fibre connectors for its ports, would have Small
Form-factor Pluggable (SFP) "cages", essentially a new kind of socket.  These allow people to mix and match different kinds of
transceiver modules, which would slot into the cage to provide an actual usable interface  -- one for RJ45 for gigabit Ethernet, or one for the particular
kind of fibre connection they were using -- whatever configuration worked best for them.&lt;/p&gt;

&lt;p&gt;A typical switch for a larger network might have one or two of those for backbone
connections, and then RJ45s for local connections.  Over time, gigabit backbones
were no longer enough, and SFP was followed by SFP+, which could handle 10Gb/s.
Since then, there have been extensions for even faster speeds, way up to hundreds of
Gb/s.&lt;/p&gt;

&lt;p&gt;Back in the day, this stuff was only important to network admins for medium-sized networks and larger,
of course.  But now, 10Gb Ethernet means that we've now hit the point where it matters
even for home users, and that's because of thermals.&lt;/p&gt;

&lt;h3 id="10gbase-t-is-so-hot-right-now"&gt;10GBASE-T is so hot (right now)&lt;/h3&gt;

&lt;p&gt;Here's the problem.  Somewhat loosely speaking, the faster a network connection on a particular
kind of wiring, the hotter it runs.  Over an RJ45/twisted pair connection, 10Mb/s Ethernet basically shed no heat, 100Mb/s
a little more, even gigabit Ethernet just left your switches somewhat warm.
The jump up to 10Gb over RJ45, called 10GBASE-T, makes things decidedly toasty --
you'll see just how toasty in tomorrow's post.&lt;/p&gt;

&lt;p&gt;There's also the issue of cabling.  Because network speeds have been stable
for some time -- Gigabit Ethernet being the standard for ~20 years --
most buildings with structured cabling (the kind of thing where there are RJ45 sockets
in the walls wired together) will have the standard for that -- CAT-5E.  Unfortunately 10Gb/s Ethernet
won't officially work over it -- you might be lucky, especially with short cables,
but in general it won't work, or if it does it won't be reliable.&lt;/p&gt;

&lt;p&gt;CAT-6 cabling helps -- it can handle 10Gb/s over runs up to about 55 metres.  And
the ideal is CAT-6A, which can handle 10Gb/s over the same 100 metre cable lengths
that you'd expect for the older, slower setups.&lt;/p&gt;

&lt;p&gt;What this meant was that an interim standard was created.  10GBASE-T is hot and needs
cables that people don't necessarily have, especially when you're talking about what's
installed in the walls of their building.  But if you run it a bit slower, you can
do so over older cables and without melting them.&lt;/p&gt;

&lt;p&gt;That's why I didn't mention 2.5Gb/s Ethernet earlier (or indeed the rarer 5Gb/s).
They were introduced as slowed-down versions of 10Gb/s to get it to work on existing
infrastructure without major upgrades.  And that's great, right up until the point
your ISP emails you to say that they're offering 10Gb/s to your home now...&lt;/p&gt;

&lt;p&gt;So, what can you do to run 10Gb/s without melting things?&lt;/p&gt;

&lt;h3 id="sfp-dac-fibre"&gt;SFP+, DAC, fibre&lt;/h3&gt;

&lt;p&gt;Let's think about what an SFP or SFP+ module actually is.  It slots into a cage on a
switch.  On one side, there's an electrical connection to the switch hardware, which
is carrying the signal -- incoming and outgoing -- using a particular protocol &lt;sup class="footnote-ref" id="fnref-4"&gt;&lt;a href="#fn-4"&gt;4&lt;/a&gt;&lt;/sup&gt;.
The module does its magic, and on the other side we have -- say -- 10GBASE-T to an RJ45
socket, or a blinking laser with an appropriate interface for optical fibre.&lt;/p&gt;

&lt;p&gt;What would happen if you just had a dumb electrical cable to connect an SFP+ cage on
one switch to another on another switch?  That actually works pretty well!  It's called a
passive Direct Attach Copper (DAC) cable.  The interfacing is a little more complicated
than just a completely dumb wire -- the switch
will want to query the module in the cage to find out some details about it, so you need
a tiny bit of electronics -- but it's still really simple.&lt;/p&gt;

&lt;p&gt;On top of that, if you add a bit of amplification to the DAC, then you get an active
DAC, which can double that kind of length (though these are relatively rare).&lt;/p&gt;

&lt;p&gt;The neat thing about DACs is that they run &lt;em&gt;much&lt;/em&gt; cooler than 10GBASE-T, using about
a third of the power.  Of course,
they lose out in terms of range.  But for simple stuff within one room, and especially
between switches in a rack, they work really well.&lt;/p&gt;

&lt;p&gt;The next step on top of DACs is that you can convert the underlying SFP(+) protocol
directly to light, and send it down an optical fibre -- normally called an Active Optical
Cable, or an AOC for short
(though I've seen the rather confusing terminology "optical DAC" in various places).
With that, you can normally get up to 100m.  These are cheap and easy to use (because
they're all-in-one units, so you don't have any fiddly alignment of the fibre to do),
so they're the best option once you pass passive-DAC distances.&lt;/p&gt;

&lt;p&gt;After that, though, you really need to switch to the official standards, and go to
more traditional fibre-optic setups.  I've done much less research into those, so
won't try to explain them.  Either way, for the home, anything above this level is
probably overkill right now...&lt;/p&gt;

&lt;h3 id="wrapping-up"&gt;Wrapping up&lt;/h3&gt;

&lt;p&gt;So: moving from the 2.5Gb/s networks that work smoothly with the same infrastructure
we've been using for the last 20 years or so to 10Gb/s is a tricky step change.  Suddenly,
things that didn't matter -- thermal management, cable lengths, and so on -- become
important.  And there are solutions, but you need to start actually understanding things
again rather than just plugging stuff in and assuming it will work.&lt;/p&gt;

&lt;p&gt;Fun!  Time to put it into practice :-)  In my
next post, I'll show exactly the changes I had to make to get my existing 2.5Gb/s
network ported over to 10Gb/s -- the hardware I wound up buying, how well it works,
and (importantly) how hot it all runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="/2026/04/10g-ethernet-what-i-did"&gt;Here's a link to that post&lt;/a&gt;.&lt;/p&gt;

&lt;div class="footnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn-1"&gt;
&lt;p&gt;To share our blazingly fast bonded dual ISDN Internet connection -- 128Kb/s.&amp;#160;&lt;a href="#fnref-1" class="footnoteBackLink" title="Jump back to footnote 1 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-2"&gt;
&lt;p&gt;I remember feeling a little sad when that happened, because it meant that what
I felt was coolest about Ethernet -- the back-off-and-retry thing -- was no longer
all that important.  And when the connections went full duplex (a single
switch port could both send and receive at the same time over the same cable) it was finished.&amp;#160;&lt;a href="#fnref-2" class="footnoteBackLink" title="Jump back to footnote 2 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-3"&gt;
&lt;p&gt;If you're thinking "what about 2.5Gb/s?", I'll come back to that -- it's an
interesting case.&amp;#160;&lt;a href="#fnref-3" class="footnoteBackLink" title="Jump back to footnote 3 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-4"&gt;
&lt;p&gt;SFF-8472 for SFP, then there's SFF-8431 and SFF-8432 for SFP+.&amp;#160;&lt;a href="#fnref-4" class="footnoteBackLink" title="Jump back to footnote 4 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description><guid isPermaLink="false">/2026/04/10g-ethernet-what-i-relearned</guid><pubDate>Tue, 28 Apr 2026 18:45:00 +0000</pubDate></item><item><title>10Gb/s Ethernet: what I actually did to get it working in my home</title><link>https://www.gilesthomas.com/2026/04/10g-ethernet-what-i-did</link><description>&lt;p&gt;Having &lt;a href="/2026/04/10g-ethernet-what-i-relearned"&gt;learned enough&lt;/a&gt; about 10Gb/s
Ethernet to be comfortable about setting it up in my
house, it was time to bite the bullet: order it from the ISP, buy some kit, and
get started.&lt;/p&gt;

&lt;p&gt;I already had 2.5Gb/s working.  The apartment has structured cabling --
each room has one or more RJ45 sockets in the wall,
and there's a patch panel downstairs by our front door that has a matching patch socket
for each wall socket.  So when we moved in, I simply set things up so that there was a 2.5Gb/s switch
down by the patch panel, and wired everything together there.  Most of our stuff works
over WiFi, of course, but I needed a wired backbone to connect the excessive number
of computers in my study both to each other, and to the outside world.&lt;/p&gt;

&lt;p&gt;What did I need to do?&lt;/p&gt;
&lt;p&gt;Simplifying a bit, I had this 2.5Gb/s setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The ISP connection came into the apartment in the living room.&lt;/li&gt;
&lt;li&gt;It went through a router/firewall machine I'd set up myself (more on that later),
then via a 2.5Gb/s switch to the main WiFi AP and also to a wall socket.&lt;/li&gt;
&lt;li&gt;Down at the patch panel, I had a 2.5Gb/s switch, which was connected to the patch
socket corresponding to the router's wall socket.&lt;/li&gt;
&lt;li&gt;Another connection from that switch went to the patch socket corresponding to the
wall socket in my study.&lt;/li&gt;
&lt;li&gt;In the study, I had another 2.5Gb/s switch that handled internal networking.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are a few other things dotted around, of course -- extra APs and what-have-you --
but that's the core, and I'll focus on that to keep things simple.&lt;/p&gt;

&lt;p&gt;Would I be able to get it all upgraded to work with 10Gb/s?  The most important
question was the structured cabling in the walls; was it CAT-5E or CAT-6, or even CAT-6A?
Remember from the last post, 10GBASE-T might work over short runs of -5E (even though
officially it's not meant to be able to).  It probably would run over -6, because that's
generally OK up to 55 metres or so, and I don't think any of the runs in the house are longer than that.
And it would be fine over -6A, which is good for 100-metre runs.&lt;/p&gt;

&lt;p&gt;I was unable to find out exactly which type I had (the
parts of the cables that are visible to me don't have any kind of marking to say),
so I decided to do a staged rollout.&lt;/p&gt;

&lt;p&gt;The first step was to set up the wired network within my study as 10Gb/s.  There
were two important things to wire up; my primary desktop, &lt;code&gt;perry&lt;/code&gt;, and a Proxmox
cluster I have running in an 11" rack.  The setup I had was just one 2.5Gb/s switch
sitting on top of the rack, linked to the wall, to the cluster machines, and to &lt;code&gt;perry&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now, getting the Proxmox cluster up to high-speed internal networking was a non-starter.
The machines there are all old ones -- it's essentially a retirement home for mini-PCs
I used to use for other things &lt;sup class="footnote-ref" id="fnref-1"&gt;&lt;a href="#fn-1"&gt;1&lt;/a&gt;&lt;/sup&gt;.  They're mostly gigabit ethernet, with one
2.5Gb/s one.&lt;/p&gt;

&lt;p&gt;But getting &lt;code&gt;perry&lt;/code&gt; up to 10Gb/s was an important goal, as that's where I do most
of my work.  I also wanted to have space for a second machine that I'm planning to
set up to do training/inference without tying up &lt;code&gt;perry&lt;/code&gt;'s GPU, and that would also
need fast networking.&lt;/p&gt;

&lt;p&gt;I wanted to have things running reasonably cool (after all, the PC itself and its GPU pump out
quite enough heat already when doing a &lt;a href="/2026/04/llm-from-scratch-32k-interventions-training-our-best-model-locally-gradient-accumulation"&gt;training run&lt;/a&gt;),
so DAC felt like the right way to go.
I bought a reasonably cheap managed 10Gb/s switch &lt;sup class="footnote-ref" id="fnref-2"&gt;&lt;a href="#fn-2"&gt;2&lt;/a&gt;&lt;/sup&gt;, a
&lt;a href="https://mikrotik.com/product/crs305_1g_4s_in"&gt;MikroTik CRS305-1G-4S+IN&lt;/a&gt;, with a single
10GBASE-T adapter to allow me to connect it to the wall socket.  I tend to name
anything on my network with its own IP, so this became &lt;code&gt;nigel&lt;/code&gt;.
Next, a 10Gb/s SFP+ PCIe card -- an
&lt;a href="https://www.asus.com/networking-iot-servers/wired-networking/all-series/xg-c100f/"&gt;Asus XG-C100F&lt;/a&gt;
-- for &lt;code&gt;perry&lt;/code&gt; and a DAC cable to connect the two.&lt;/p&gt;

&lt;p&gt;For the Proxmox cluster, I decided to stick with the old 2.5Gb/s unmanaged switch, a
&lt;a href="https://www.trendnet.com/products/2.5g-switches-sfpplus-port/6-port-2.5g-unmanaged-switch-10g-sfpplus-slot-TEG-S5061"&gt;TRENDnet TEG-S5061&lt;/a&gt;.
I'd originally bought that one because it was the cheapest 2.5Gb/s on Amazon with
decent reviews, and had completely forgotten that it had one major feature --
an SFP+ 10Gb/s port for the uplink!  So another short DAC to connect that to the MikroTik,
and the study network "backbone" was 10Gb/s.&lt;/p&gt;

&lt;p&gt;Of course, no two computers in there could actually
communicate at that speed, as only &lt;code&gt;perry&lt;/code&gt; was 10Gb/s-capable -- but I could have all of the Proxmox machines talking to &lt;code&gt;perry&lt;/code&gt;
at the same time at full speed.  I did some tests with &lt;code&gt;iperf3&lt;/code&gt; to make sure that it was
all working as expected; I couldn't test very thoroughly, but I was able to get about 4Gb/s total throughput,
which was reassuring: two machines at 1Gb/s plus one at 2.5Gb/s should
be a touch less than 4.5Gb/s.&lt;/p&gt;

&lt;p&gt;The next step was to check the possibilities for the connection down to the patch panel.
I bought a
&lt;a href="https://eu.store.ui.com/eu/en/category/accessories-cables-dacs/collections/pro-store-ethernet-adapter/products/uacc-adapter-rj45-usbc-10ge"&gt;Ubiquiti 10G Ethernet dongle&lt;/a&gt;,
and took my laptop, &lt;code&gt;laura&lt;/code&gt; &lt;sup class="footnote-ref" id="fnref-3"&gt;&lt;a href="#fn-3"&gt;3&lt;/a&gt;&lt;/sup&gt;, down there.&lt;/p&gt;

&lt;p&gt;The news was good!  Running an &lt;code&gt;iperf3&lt;/code&gt; test between &lt;code&gt;perry&lt;/code&gt; and &lt;code&gt;laura&lt;/code&gt; down the
structured cabling, I was able to get just less than 10Gb/s from &lt;code&gt;laura&lt;/code&gt; to &lt;code&gt;perry&lt;/code&gt;,
and about 7Gb/s from &lt;code&gt;perry&lt;/code&gt; to &lt;code&gt;laura&lt;/code&gt;.  The slower receive speed at the &lt;code&gt;laura&lt;/code&gt; end
worried me, but when I checked &lt;code&gt;ps&lt;/code&gt; it became obvious what was going on.  I could see
the &lt;code&gt;ksoftirqd&lt;/code&gt; kernel process running at 100%, so some single-core thing was maxing out.&lt;/p&gt;

&lt;p&gt;The Ethernet dongle
was connected over USB, of course, and that meant it needed to do much more work on the CPU for each
incoming "data has arrived" interrupt than a PCIe card like the one on &lt;code&gt;perry&lt;/code&gt;.
That meant that &lt;code&gt;laura&lt;/code&gt; could only receive
data at a rate that one core could handle, which happened to be 7Gb/s.
&lt;code&gt;laura&lt;/code&gt; is a ThinkPad optimised for lightness and
long battery life, not CPU power, so single-core performance is not great, and it
hit a wall.&lt;/p&gt;

&lt;p&gt;But the 10Gb/s speed in the other direction was enough to make me comfortable
that the structured cabling could
handle that speed, which was excellent news -- probably I had either short runs of CAT-6,
or CAT-6A in there, though conceivably I was just getting very lucky with CAT-5E.&lt;/p&gt;

&lt;p&gt;The downside was the heat.  The USB dongle got too hot to comfortably hold while it was
running, and while I wasn't able to check the SFP+ module in the MikroTik during the test,
when I came back upstairs again I touched it and it was even hotter.
I decided that that was something to keep an eye on for later (and as you'll see, it
did become a recurring theme).&lt;/p&gt;

&lt;p&gt;For now, it was time to do the rest of the upgrade.&lt;/p&gt;

&lt;p&gt;Downstairs at the patch panel, it was a simple choice.  All of the connections were
RJ45, of course, and I only needed four.  So the
&lt;a href="https://mikrotik.com/product/crs304_4xg_in"&gt;MikroTik CRS304-4XG-IN&lt;/a&gt; was the obvious choice.&lt;/p&gt;

&lt;p&gt;The final place where I needed to do some upgrades was at the ISP end.  The box that our provider
gave us had just one 10Gb/s port -- a 10GBASE-T RJ45 one.  Now, I don't generally
trust ISP routers that much, so I've always had my own router sitting between them
and the home network -- a dual-port mini-PC running a locked-down Arch installation &lt;sup class="footnote-ref" id="fnref-4"&gt;&lt;a href="#fn-4"&gt;4&lt;/a&gt;&lt;/sup&gt;.
My old one was dual-2.5Gb/s, so that needed an upgrade.&lt;/p&gt;

&lt;p&gt;I settled on a &lt;a href="https://eu.protectli.com/product/vp2440/"&gt;Protectli VP2440&lt;/a&gt;, which has
two SFP+ 10Gb/s cages, plus two normal 2.5Gb/s RJ45s.  I didn't need the latter,
but it was the cheapest option with 10Gb/s in their range, and I've always been very
happy with their hardware and customer service.&lt;/p&gt;

&lt;p&gt;However, I was a little concerned about thermals.  As I mentioned, the SFP+ module
in the MikroTik in the study got very hot when I did my test.  I'd need dual SFP+ modules
for the Protectli -- one for the WAN port connected to the ISP box, and the other for
the wall socket to go down to the patch panel.  Might it overheat?&lt;/p&gt;

&lt;p&gt;The good thing about Protectli is that you can just ask them.  I dropped them a line, and
got a reply the next day from a customer support rep saying that he believed it would
be fine, but he just wanted to double-check with one of their techs.  The following day,
he followed up to say that the tech had confirmed that it would be OK.&lt;/p&gt;

&lt;p&gt;Promising!  And because of that, plus their 30-day money-back guarantee, I decided to
go for it.&lt;/p&gt;

&lt;p&gt;A few days later, the new router arrived.  I named it &lt;code&gt;reggie&lt;/code&gt;, set it up with
my normal router Arch installation, plugged it into the ISP box and the wall...
and it worked just fine!&lt;/p&gt;

&lt;p&gt;So the setup at this point was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ISP box to WAN on &lt;code&gt;reggie&lt;/code&gt; the router.&lt;/li&gt;
&lt;li&gt;LAN on &lt;code&gt;reggie&lt;/code&gt; to wall socket.&lt;/li&gt;
&lt;li&gt;Patch panel socket corresponding to that wall socket to port 0 on the downstairs
RJ45-only switch, &lt;code&gt;nelly&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nelly&lt;/code&gt; port 1 to the patch panel corresponding to my study's wall socket.  (Other
ports to other things I'm disregarding for simplicity.)&lt;/li&gt;
&lt;li&gt;Wall socket in the study to the RJ45 SFP+ module in port 0 on &lt;code&gt;nigel&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nigel&lt;/code&gt; port 1: DAC to an SFP+ network card on &lt;code&gt;perry&lt;/code&gt;, my workstation.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;nigel&lt;/code&gt; port 2: DAC to the SFP+ 10Gb/s uplink on the old TRENDnet 2.5Gb/s switch to
handle the Proxmox cluster.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the same time I decided to move the main WiFi AP (&lt;code&gt;winona&lt;/code&gt;, a
&lt;a href="https://eu.store.ui.com/eu/en/products/u6-enterprise"&gt;Ubiquiti U6 Enterprise&lt;/a&gt;) that
was previously next to the router over
to my study -- so that was hanging off
the TRENDnet switch.&lt;/p&gt;

&lt;p&gt;After a bit of bedding in, I decided I wanted to move &lt;code&gt;winona&lt;/code&gt; back to the same
place as the router -- it's more central so it provides better WiFi coverage from there.  So I
got another CRS304-4XG-IN -- the 10GBASE-T MikroTik
switch, like the one by the patch panel -- so that the first part of the above topology became:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ISP box to WAN on &lt;code&gt;reggie&lt;/code&gt; the router.&lt;/li&gt;
&lt;li&gt;LAN on &lt;code&gt;reggie&lt;/code&gt; to the new switch (&lt;code&gt;norman&lt;/code&gt;) port 0.&lt;/li&gt;
&lt;li&gt;Port 1 on &lt;code&gt;norman&lt;/code&gt; to the wall socket (thence down to the patch panel).&lt;/li&gt;
&lt;li&gt;Port 2 on &lt;code&gt;norman&lt;/code&gt; to the WiFi AP via a PoE injector.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this is sitting in a sideboard next to the dining table with no ventilation.
That's probably close to a pathological case for hot-running network infrastructure
like this, so... how about those thermals?&lt;/p&gt;

&lt;h3 id="in-which-i-consider-replacing-my-airfryer-with-an-sfp-module"&gt;In which I consider replacing my airfryer with an SFP+ module&lt;/h3&gt;

&lt;p&gt;I like to keep track of what is going on with my zoo of computers, so I run
&lt;a href="https://www.influxdata.com/time-series-platform/telegraf/"&gt;Telegraf&lt;/a&gt; on all of them.  This
collects stats like the CPU temperature, system load, disk space, CPU and network use,
and so on.
They send this to an InfluxDB instance on a Proxmox VM (&lt;code&gt;varro&lt;/code&gt;, if you're keeping track).&lt;/p&gt;

&lt;p&gt;When I set all of this up, I also wanted to monitor the switches.  MikroTik switches
expose their stats over SNMP, so with a bit of help from various LLMs I was able to augment
the Telegraf config on &lt;code&gt;reggie&lt;/code&gt; to also scrape that data and send it to &lt;code&gt;varro&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;I use Grafana to get all of this stuff into various dashboards, and one of them is
the temperatures of the networking hardware.  Firstly, &lt;code&gt;reggie&lt;/code&gt; -- the Protectli
router with two SFP+ cages, each of which has a 10GBASE-T module.  I receive
separate temperatures for the CPU and for each SFP+ module:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/10g-ethernet-what-i-did/reggie-temps.png" alt="Reggie temperatures" title="Reggie temperatures" /&gt;&lt;/p&gt;

&lt;p&gt;That's not exactly running cool, but TBH it's not too bad!  I believe that the SFP+ cages
are thermally coupled to the case (which is essentially one giant heatsink).  So they're
running a bit hotter than the machine as a whole, but it's not baking.  Let's see how
that does as the weather warms -- you can see that it's been going up over the last week or
so as we had a bit of a heatwave here in Lisbon.&lt;/p&gt;

&lt;p&gt;How about &lt;code&gt;norman&lt;/code&gt;, the MikroTik CRS304-4XG-IN switch -- all native 10GBASE-T, in the
same sideboard as &lt;code&gt;reggie&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/10g-ethernet-what-i-did/norman-temps.png" alt="Norman temperatures" title="Norman temperatures" /&gt;&lt;/p&gt;

&lt;p&gt;A bit hotter than I'd like -- above the tested ambient temperature of up to 70C,
though of course this is internal rather than external; &lt;code&gt;reggie&lt;/code&gt;, which is right
next to &lt;code&gt;norman&lt;/code&gt;, having an internal temperature lower than
70C suggests that we're probably still OK, as its internal temperature can't be lower
than ambient.&lt;/p&gt;

&lt;p&gt;I think that both of those could be improved, though.  The sideboard they're in is
unventilated, and it has &lt;code&gt;winona&lt;/code&gt; the Ubiquiti U6 Enterprise WiFi AP in there too --
that runs pretty hot.
So a sensible first step is probably to move the AP elsewhere, and if that's not enough,
perhaps to add a USB fan to bring cooler air in through the back of the sideboard.&lt;/p&gt;

&lt;p&gt;Now, how about &lt;code&gt;nelly&lt;/code&gt;, the switch downstairs by the patch panel?  It's also in a
cupboard with no airflow, and while it's not sharing it with a router, there is a
PoE injector and another WiFi AP, &lt;code&gt;wilbur&lt;/code&gt;, in there (albeit a cooler-running one, a
&lt;a href="https://eu.store.ui.com/eu/en/products/u7-lite"&gt;Ubiquiti U7 Lite&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/10g-ethernet-what-i-did/nelly-temps.png" alt="Nelly temperatures" title="Nelly temperatures" /&gt;&lt;/p&gt;

&lt;p&gt;Not too bad at all!  Plenty of headroom there.&lt;/p&gt;

&lt;p&gt;Finally, let's go back upstairs to my study.  If you remember, I have &lt;code&gt;nigel&lt;/code&gt; there,
a MikroTik CRS305-1G-4S+IN -- a four-port SFP+ switch.  I get just data for the switch
itself and for the 10GBASE-T module -- the DACs don't report numbers.  Check
this out -- the right hand chart especially:&lt;/p&gt;

&lt;p&gt;&lt;img src="/post-assets/10g-ethernet-what-i-did/nigel-temps.png" alt="Nigel temperatures" title="Nigel temperatures" /&gt;&lt;/p&gt;

&lt;p&gt;Yikes!  The switch itself is OK at a comfortable 48C, but that SFP+ module is hovering
around 93C.  That's internal rather than the "touch" temperature, but assuming they're close,
it's definitely getting towards blistering temperatures if you touch it.
I'm getting a stick-on mini-heatsink -- the type you can get for Raspberry Pis -- to see if
that might help.  It's also sitting on a 11" rack, so I might see if I can find a way to
thermally couple it to that.&lt;/p&gt;

&lt;p&gt;But despite those somewhat concerning numbers, it's all working fine!
I have a periodic network test running on &lt;code&gt;perry&lt;/code&gt;, checking end-to-end out to Google's
8.8.8.8 nameservers, and I haven't seen a glitch.  &lt;code&gt;iperf3&lt;/code&gt; tests from &lt;code&gt;perry&lt;/code&gt; to
&lt;code&gt;reggie&lt;/code&gt; show negligible numbers of errors.&lt;/p&gt;

&lt;p&gt;It's a working system, so naturally I want to change things.  What?&lt;/p&gt;

&lt;h3 id="what-next"&gt;What next?&lt;/h3&gt;

&lt;p&gt;TBH, I think I'll be able to limit my desire to tinker in the short term to just
sorting those worrying thermal numbers.  For &lt;code&gt;reggie&lt;/code&gt; and &lt;code&gt;norman&lt;/code&gt; in the sideboard,
I think that moving &lt;code&gt;winona&lt;/code&gt; the WiFi AP out again will help.  It's power-over-Ethernet,
so I can just run one wire up the wall and hide the AP itself behind some art.&lt;/p&gt;

&lt;p&gt;For the almost-boiling-point SFP+ module on &lt;code&gt;nigel&lt;/code&gt;, the study switch, a stick-on Raspberry
Pi heatsink is, as I said, probably a good starting point.  If that isn't enough,
perhaps one with a cooling fan.  The actual amount of power being used there isn't much,
just 3W or so -- it's only reaching such a high temperature because it's in such a small
space.&lt;/p&gt;

&lt;p&gt;The more interesting question is, what will I do if and when it's time to take
the next step up, to 40Gb/s or higher?  As I said in
&lt;a href="/2026/04/10g-ethernet-what-i-relearned"&gt;my last post&lt;/a&gt;, 10GBASE-T is essentially
the end of the RJ45, twisted pair world we've been in for the last 20+ years.  CAT-8
cabling can, apparently, run up to 40Gb/s, but it comes with its own problems --
it's super-stiff, and hard to run around tight corners or to get into the limited
space in the boxes behind wall sockets.&lt;/p&gt;

&lt;p&gt;I think that the right thing to do would probably be to switch to optical fibre.
I did some initial research around this while I was still unsure if the existing cabling
would work, and it seems like replacing each cable drop (that is, run from a wall
socket to the patch panel) with at least a dual-fibre cable, one to send and one to
receive, would work fine, potentially even up to 800Gb/s with the right setup.
The wall sockets could
be LC duplex, which are designed to be easy to connect (by fibre standards).&lt;/p&gt;

&lt;p&gt;If I wanted to really future-proof things, it might even make sense to run four-fibre or
even eight-fibre cables, and leave all but two of each "dark".  That would potentially
leave even more space for improvement, and would actually cost very little extra --
the installation cost would be way higher than the cost of the cable.&lt;/p&gt;

&lt;p&gt;Still, at hundreds of Euros per cable drop, plus project overheads, I'm glad I
don't have to do that now.  A good decision to be able to punt down the line; who
knows what will change between now and whenever my ISP starts offering even faster
speeds?&lt;/p&gt;

&lt;p&gt;So let's wrap this up with the moment you've undoubtedly been waiting for...&lt;/p&gt;

&lt;h3 id="the-money-shot"&gt;The money shot&lt;/h3&gt;

&lt;p&gt;&lt;img src="/post-assets/10g-ethernet-what-i-did/network-speed.png" alt="Ookla speedtest: 8Gb/s" title="Ookla speedtest: 8Gb/s" /&gt;&lt;/p&gt;

&lt;p&gt;Not bad! Not quite the 10Gb/s advertised, but it's close -- and I've seen it get
up to 9Gb/s from time to time (but unfortunately not screenshotted it).  And to be clear, that was from &lt;code&gt;perry&lt;/code&gt; -- so the
speed was through all three of the switches, &lt;code&gt;nigel&lt;/code&gt;, &lt;code&gt;nelly&lt;/code&gt; and &lt;code&gt;norman&lt;/code&gt;, and
through &lt;code&gt;reggie&lt;/code&gt; the router.  Direct tests from &lt;code&gt;reggie&lt;/code&gt; from the CLI version of
the Ookla &lt;code&gt;speedtest&lt;/code&gt; app &lt;sup class="footnote-ref" id="fnref-5"&gt;&lt;a href="#fn-5"&gt;5&lt;/a&gt;&lt;/sup&gt; get similar results -- in fact, oddly, they tend to
be about 5% slower than the ones from &lt;code&gt;perry&lt;/code&gt;.  Not sure what to make of that.
I'll have to investigate further, but if anyone has any ideas about what might cause
it, I'd love to hear them.&lt;/p&gt;

&lt;p&gt;So now, when I'm uploading models to Hugging Face and downloading others,
syncing large &lt;code&gt;uv&lt;/code&gt; environments, downloading the latest Arch ISO, and streaming music,
while at the same time Sara is watching Netflix
and my Dropbox is Dropboxing, everything can run smoothly.&lt;/p&gt;

&lt;p&gt;Nice!  Mission accomplished.  I hope this was an interesting read, and perhaps
helpful for other people who are considering a similar upgrade.&lt;/p&gt;

&lt;p&gt;Now, time for me to go back to your regularly-scheduled all-AI, all-the-time content ;-)&lt;/p&gt;

&lt;div class="footnotes"&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id="fn-1"&gt;
&lt;p&gt;My OpenClaw instance, which runs there, has dubbed it "the Island of Misfit Computers".&amp;#160;&lt;a href="#fnref-1" class="footnoteBackLink" title="Jump back to footnote 1 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-2"&gt;
&lt;p&gt;I moved from a simple network to a multi-VLAN one at the same time as this upgrade, so managed
switches have become useful -- if you're just doing an upgrade to 10Gb, you can
do it all with unmanaged ones.&amp;#160;&lt;a href="#fnref-2" class="footnoteBackLink" title="Jump back to footnote 2 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-3"&gt;
&lt;p&gt;In case you're wondering about the naming strategy for machines on the network:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PCs, desktops, etc: name starts with &lt;strong&gt;P&lt;/strong&gt;, for example &lt;code&gt;perry&lt;/code&gt; or &lt;code&gt;poppy&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Laptops: name starts with &lt;strong&gt;L&lt;/strong&gt;.  Basically just &lt;code&gt;laura&lt;/code&gt;.  Sara named her own work laptop,
unrestricted by my convention, so it's called &lt;code&gt;dellbert&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Routers: name starts with &lt;strong&gt;R&lt;/strong&gt;: &lt;code&gt;reggie&lt;/code&gt;, &lt;code&gt;ronaldo&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Network infrastructure: name starts with &lt;strong&gt;N&lt;/strong&gt;: &lt;code&gt;nigel&lt;/code&gt;, &lt;code&gt;nelly&lt;/code&gt; and &lt;code&gt;norman&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;WiFi APs: name starts with &lt;strong&gt;W&lt;/strong&gt;, eg. &lt;code&gt;winona&lt;/code&gt; and &lt;code&gt;wilbur&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;VMs on Proxmox: name starts with &lt;strong&gt;V&lt;/strong&gt;: &lt;code&gt;virgil&lt;/code&gt;, &lt;code&gt;vinny&lt;/code&gt;, &lt;code&gt;varro&lt;/code&gt;, etc.&lt;/li&gt;
&lt;li&gt;I also have a bare metal server on Hetzner, which I've named &lt;code&gt;hannibal&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What can I say.  It passes the time.&amp;#160;&lt;a href="#fnref-3" class="footnoteBackLink" title="Jump back to footnote 3 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-4"&gt;
&lt;p&gt;It's largely old routers that populate the Proxmox cluster.&amp;#160;&lt;a href="#fnref-4" class="footnoteBackLink" title="Jump back to footnote 4 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;

&lt;li id="fn-5"&gt;
&lt;p&gt;&lt;a href="https://www.speedtest.net/apps/cli"&gt;Their own one&lt;/a&gt;, not the more commonly-used
&lt;a href="https://github.com/sivel/speedtest-cli"&gt;OSS Python one&lt;/a&gt;, which isn't fast enough
to handle speeds over about 5Gb/s.&amp;#160;&lt;a href="#fnref-5" class="footnoteBackLink" title="Jump back to footnote 5 in the text."&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description><guid isPermaLink="false">/2026/04/10g-ethernet-what-i-did</guid><pubDate>Wed, 29 Apr 2026 14:15:00 +0000</pubDate></item></channel></rss>