Writing an LLM from scratch, part 32d -- Interventions: adding attention bias

Posted on 6 February 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This is the third intervention I'm trying: adding bias to the attention weight matrices.

In the code from the book, we have this:

class MultiHeadAttention(nn.Module):

    def __init__(
        self,
        d_in, d_out,
        context_length,
        dropout,
        num_heads,
        qkv_bias=False
    ):
        ...

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

        ...

    def forward(self, x):
        ...

        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

So: we initialise the weights Wq, Wk and Wv as linear layers rather than simple matrices of weights, and have a parameter qkv_bias to say whether or not we should add bias to those. In all of our trains so far we've set that to False.

Why do we have this parameter, and where did it come from?

[ Read more ]


Writing an LLM from scratch, part 32c -- Interventions: removing dropout

Posted on 5 February 2026 in AI, LLM from scratch, TIL deep dives |

This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.678. Not much, but it's something!

This time, I wanted to see what happened if we trained without dropout. Would removing it make the test loss worse, or better?

[ Read more ]


Writing an LLM from scratch, part 32b -- Interventions: gradient clipping

Posted on 5 February 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the minimal training run in the book, just modified to run using DDP on an 8x A100 40 GiB/GPU machine in the cloud. There are a bunch of "interventions" I want to try to see if they'll make it better, as measured by the loss they get on a test set. I'll do a post for each intervention, and this is the first: gradient clipping.

[ Read more ]


Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

Posted on 4 February 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm rounding out my series of posts on Sebastian Raschka's book "Build a Large Language Model (from Scratch)" by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090, and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better.

For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud. That led to some refinements in the prompt-following test I was using, and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub.

Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to.

[ Read more ]