LLM from scratch

I'm working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)", writing up the things I find interesting and/or surprising, and filling in some of the gaps. The book is a brilliant explanation of how to build an LLM, but sometimes glosses over why we do particular things -- for example, why certain sequences of matrix multiplications result in a working, trainable attention mechanism.

I've finished the main body of the book, but the series is ongoing -- there are a few things I want to do before I can finally draw a line under this epic :-)

Here are the posts in this series: