Writing an LLM from scratch, part 32g -- Interventions: weight tying

Posted on 24 March 2026 in AI, LLM from scratch, TIL deep dives, Python |

In Sebastian Raschka's book "Build a Large Language Model (from Scratch)", he writes that weight tying, while it reduces the parameter count of a model, in his experience makes it worse. As such, apparently people don't use it in modern LLMs. Intuitively, that makes sense -- I'll explain why in this post.

But as I'm trying various interventions to see if I can get my model -- based on Raschka's code, but trained for a fraction of the time that the original GPT-2 model was -- to perform as well as the original in terms of the loss it gets on a test set, I thought it would be worth seeing if it really is a negative for this particular tiny model of 163M parameters.

After all, the original weights use weight tying, and I did find that QKV bias appeared to help -- and that's another old-school technique that they used, which has since dropped out of fashion. Might this one help too?

Worth a try! Let's give it a go.

[ Read more ]


Writing an LLM from scratch, part 32f -- Interventions: weight decay

Posted on 23 March 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

In my training code, I have this code to create the optimiser:

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=0.0004, weight_decay=0.1
    )

In my last post I looked into the learning rate, the lr parameter in that code, and found a value for that, plus some extra code to schedule it -- that is, to vary it over time -- which gave better training results.

This time I want to go into the weight decay. What is it, what is it for, and is 0.1 really the best value?

[ Read more ]


Writing an LLM from scratch, part 32e -- Interventions: the learning rate

Posted on 10 March 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

In my training code, I have this code to create the optimiser:

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=0.0004, weight_decay=0.1
    )

The values in there -- 0.0004 for the learning rate, and 0.1 for the weight decay -- were just copied from the tiny training run that we do in section 5.2 of the book.

What do those values actually mean, and are those really the right values for them?

I felt I had a good handle on the learning rate, at least -- it's one of the first things you learn when you start looking at machine learning of any kind -- but how would you go about working out what the correct value for it was? On top of that, when I was reading the Chinchilla paper a while back, I noticed they repeatedly referred to a "cosine cycle" for the learning rate, which didn't fit into anything I'd learned about before.

The weight decay was pretty much an unknown for me -- I know it is a parameter controlling the behaviour of the optimiser, but I don't know how it does that.

In this post I want to look into the learning rate, and these mysterious cosines; I'll write a follow-up about the weight decay later.

[ Read more ]