- April 2026 (11)
- March 2026 (3)
- February 2026 (4)
- January 2026 (4)
- December 2025 (1)
- November 2025 (3)
- October 2025 (9)
- September 2025 (3)
- August 2025 (5)
- July 2025 (1)
- June 2025 (2)
- May 2025 (3)
- April 2025 (2)
- March 2025 (7)
- February 2025 (10)
- January 2025 (6)
- December 2024 (7)
- September 2024 (1)
- August 2024 (2)
- July 2024 (2)
- May 2024 (2)
- April 2024 (2)
- February 2024 (2)
- April 2023 (1)
- March 2023 (2)
- September 2022 (1)
- February 2022 (1)
- November 2021 (1)
- March 2021 (1)
- February 2021 (2)
- August 2019 (1)
- November 2018 (1)
- May 2017 (1)
- December 2016 (1)
- April 2016 (1)
- August 2015 (1)
- December 2014 (1)
- August 2014 (1)
- March 2014 (1)
- December 2013 (1)
- October 2013 (3)
- September 2013 (4)
- August 2013 (2)
- July 2013 (1)
- June 2013 (1)
- February 2013 (1)
- October 2012 (1)
- June 2012 (1)
- May 2012 (1)
- April 2012 (1)
- February 2012 (1)
- October 2011 (1)
- June 2011 (1)
- May 2011 (1)
- April 2011 (1)
- March 2011 (1)
- February 2011 (1)
- January 2011 (1)
- December 2010 (3)
- November 2010 (1)
- October 2010 (1)
- September 2010 (1)
- August 2010 (1)
- July 2010 (1)
- May 2010 (3)
- April 2010 (1)
- March 2010 (2)
- February 2010 (3)
- January 2010 (4)
- December 2009 (2)
- November 2009 (5)
- October 2009 (2)
- September 2009 (2)
- August 2009 (3)
- July 2009 (1)
- May 2009 (1)
- April 2009 (1)
- March 2009 (5)
- February 2009 (5)
- January 2009 (5)
- December 2008 (3)
- November 2008 (7)
- October 2008 (4)
- September 2008 (2)
- August 2008 (1)
- July 2008 (1)
- June 2008 (1)
- May 2008 (1)
- April 2008 (1)
- January 2008 (4)
- December 2007 (3)
- March 2007 (3)
- February 2007 (1)
- January 2007 (2)
- December 2006 (4)
- November 2006 (18)
- AI (82)
- TIL deep dives (75)
- Python (71)
- LLM from scratch (46)
- Resolver One (34)
- Blogkeeping (18)
- PythonAnywhere (17)
- Linux (16)
- Startups (15)
- TIL (15)
- NSLU2 offsite backup project (13)
- Hugging Face (12)
- Funny (11)
- Finance (10)
- Fine-tuning LLMs (10)
- Gadgets (10)
- Musings (10)
- C (9)
- Personal (8)
- Robotics (8)
- Website design (8)
- 3D (5)
- Rants (5)
- Cryptography (4)
- JavaScript (4)
- Music (4)
- Oddities (4)
- Quick links (4)
- Talks (4)
- Dirigible (3)
- Eee (3)
- Memes (3)
- Politics (3)
- Django (2)
- GPU Computing (2)
- LaTeX (2)
- MathML (2)
- OLPC XO (2)
- Retro Language Models (2)
- Space (2)
- VoIP (2)
- Copyright (1)
- Golang (1)
- Microprojects (1)
- Raspberry Pi (1)
- Software development tools (1)
- Agile Abstractions
- Astral Codex Ten
- :: (Bloggable a) => a -> IO ()
- David Friedman's Substack
- Econ & Energy
- Entrepreneurial Geekiness
- For some value of "Magic"
- Hackaday
- kaleidic.ai newsletter
- Knowing.NET
- Language Log
- Millennium Hand
- ntoll.org
- Obey the Testing Goat!
- PK
- PythonAnywhere News
- Simon Willison's Weblog
- Societive
- Software Deviser
- Some opinions, held with varying degrees of certainty
- tartley.com
Writing an LLM from scratch, part 32m -- Interventions: conclusion
Last November, when I finished the main body of "Build a Large Language Model (from Scratch)", I set myself a number of follow-on goals. One was "training the full GPT-2 base model myself".
I've reached the end of that journey, with a model that is almost -- if not quite -- as good as GPT-2 small, trained in 44 hours on my own machine, so I thought it would be worth summarising how it went.
In December, I trained my first model, taking two days, but was disappointed to see that it was worse in terms of loss, and in terms of how well it could be fine-tuned to follow instructions, than the original GPT-2 model.
I expected that a chunk of that difference was likely to be due to the original model having been trained for longer, but also noticed that there were a number of changes -- interventions -- that I could make to the model and the training run, and I thought they might help.
In January, I got a DDP training system together that would allow me to iterate on those interventions without having to wait for two days for each result.
In February, I got started by training a baseline model in the cloud, and I've since ground through all of the interventions, and come up with a set that lowered the loss nicely, both in the cloud, and locally.
Along the way, I've learned about, or refined my knowledge of, a bunch of ML concepts. In increasing order of how they helped with the loss (with the first two actually making it slightly worse):
- Weight tying, which I found made the loss worse, but it was interesting how simple it was to implement.
- PyTorch's Automated Mixed Precision, which also harmed the loss a tiny bit, but had the benefit of making training twice as fast, and 66% cheaper in the cloud -- well worth the loss penalty.
- Gradient clipping -- a cheap, but (somewhat to my surprise) not particularly effective intervention for this model.
- QKV bias -- that is, adding bias to the attention weight matrices -- which also helped a tiny bit, though I later felt that this might have been in the noise.
- Weight decay -- more effective, and something that's simple enough to understand with simple gradient descent. I still need to learn more about it in the context of optimisers, though -- particularly with AdamW.
- Dropout, which seems to be less than useful for single-epoch training: removing it helped the model quite a lot.
- The learning rate, which I built up quite a lot of new knowledge about, and by both increasing it and scheduling it, I got the biggest bang for the buck.
I've also learned how to upload my custom models to Hugging Face, found out some interesting things about how random noise affects training, and come up with improvements in the setup I have for using an LLM as a judge for instruction fine-tuned models.
There was a bit of a mystery when I tried out the instruction fine-tuning tests, though. Although two of my models were very close to GPT-2 small in terms of loss, I found that while one of them had an instruction fine-tuning result that was likewise close to GPT-2 small, the other was much worse! A mystery to dig into later, I think.
But it was still very satisfying that my best model -- trained locally in 44 hours -- was almost as good as GPT-2 small, even if it did fall somewhat short. So on that positive note, I'm going to wrap up this "Interventions" series-within-a-series, and move on to the two other things I wanted to do before wrapping up the "LLM from scratch" series as a whole:
- Going through the appendices in the book to see if there's anything I want to highlight there.
- The final test as to whether I've really understood everything: building my own LLM from scratch without reference to the book. I want to do that in a different framework, not PyTorch, to minimise the risk of just regurgitating code -- I asked people on X/Twitter which one I should use, and the winner was JAX -- so it should be interesting to see how that goes!
The appendices first, I think -- I'll post about them shortly. But I think the big one will be the JAX implementation -- really looking forward to that.
Here's a link to the next post in this series.