Writing an LLM from scratch, part 16 -- layer normalisation

Posted on 8 July 2025 in AI, LLM from scratch, TIL deep dives |

I'm now working through chapter 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". It's the chapter where we put together the pieces that we've covered so far and wind up with a fully-trainable LLM, so there's a lot in there -- nothing quite as daunting as fully trainable self-attention was, but still stuff that needs thought.

Last time around I covered something that for me seemed absurdly simplistic for what it did -- the step that converts the context vectors our attention layers come up with into output logits. Grasping how a single matrix multiplication could possibly do that took a bit of thought, but it was all clear in the end.

This time, layer normalisation.

[ Read more ]