- December 2025 (1)
- November 2025 (3)
- October 2025 (9)
- September 2025 (3)
- August 2025 (5)
- July 2025 (1)
- June 2025 (2)
- May 2025 (3)
- April 2025 (2)
- March 2025 (7)
- February 2025 (10)
- January 2025 (6)
- December 2024 (7)
- September 2024 (1)
- August 2024 (2)
- July 2024 (2)
- May 2024 (2)
- April 2024 (2)
- February 2024 (2)
- April 2023 (1)
- March 2023 (2)
- September 2022 (1)
- February 2022 (1)
- November 2021 (1)
- March 2021 (1)
- February 2021 (2)
- August 2019 (1)
- November 2018 (1)
- May 2017 (1)
- December 2016 (1)
- April 2016 (1)
- August 2015 (1)
- December 2014 (1)
- August 2014 (1)
- March 2014 (1)
- December 2013 (1)
- October 2013 (3)
- September 2013 (4)
- August 2013 (2)
- July 2013 (1)
- June 2013 (1)
- February 2013 (1)
- October 2012 (1)
- June 2012 (1)
- May 2012 (1)
- April 2012 (1)
- February 2012 (1)
- October 2011 (1)
- June 2011 (1)
- May 2011 (1)
- April 2011 (1)
- March 2011 (1)
- February 2011 (1)
- January 2011 (1)
- December 2010 (3)
- November 2010 (1)
- October 2010 (1)
- September 2010 (1)
- August 2010 (1)
- July 2010 (1)
- May 2010 (3)
- April 2010 (1)
- March 2010 (2)
- February 2010 (3)
- January 2010 (4)
- December 2009 (2)
- November 2009 (5)
- October 2009 (2)
- September 2009 (2)
- August 2009 (3)
- July 2009 (1)
- May 2009 (1)
- April 2009 (1)
- March 2009 (5)
- February 2009 (5)
- January 2009 (5)
- December 2008 (3)
- November 2008 (7)
- October 2008 (4)
- September 2008 (2)
- August 2008 (1)
- July 2008 (1)
- June 2008 (1)
- May 2008 (1)
- April 2008 (1)
- January 2008 (4)
- December 2007 (3)
- March 2007 (3)
- February 2007 (1)
- January 2007 (2)
- December 2006 (4)
- November 2006 (18)
- AI (62)
- TIL deep dives (57)
- Python (56)
- Resolver One (34)
- LLM from scratch (29)
- Blogkeeping (18)
- PythonAnywhere (17)
- Linux (16)
- Startups (15)
- NSLU2 offsite backup project (13)
- TIL (13)
- Funny (11)
- Finance (10)
- Fine-tuning LLMs (10)
- Musings (10)
- C (9)
- Gadgets (8)
- Personal (8)
- Robotics (8)
- Website design (8)
- 3D (5)
- Rants (5)
- Cryptography (4)
- JavaScript (4)
- Music (4)
- Oddities (4)
- Quick links (4)
- Talks (4)
- Dirigible (3)
- Eee (3)
- Memes (3)
- Politics (3)
- Django (2)
- GPU Computing (2)
- LaTeX (2)
- MathML (2)
- OLPC XO (2)
- Retro Language Models (2)
- Space (2)
- VoIP (2)
- Copyright (1)
- Golang (1)
- Raspberry Pi (1)
- Software development tools (1)
- Agile Abstractions
- Astral Codex Ten
- :: (Bloggable a) => a -> IO ()
- David Friedman's Substack
- Econ & Energy
- Entrepreneurial Geekiness
- For some value of "Magic"
- Hackaday
- kaleidic.ai newsletter
- Knowing.NET
- Language Log
- Millennium Hand
- ntoll.org
- Obey the Testing Goat!
- PK
- PythonAnywhere News
- Simon Willison's Weblog
- Societive
- Software Deviser
- Some opinions, held with varying degrees of certainty
- tartley.com
Writing an LLM from scratch, part 28 -- training a base model from scratch on an RTX 3090
Having worked through the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I wanted to try an experiment: is it possible to train a base model of my own, on my own hardware?
The book shows you how to train your LLM, does a basic training run on a small dataset, and then we switch to downloading the "pre-cooked" weights from OpenAI. That makes sense given that not every reader will have access to enough hardware to really train from scratch. And right back at the start of this series, I did some naive scaling of numbers I'd got when fine-tuning LLMs and came to the conclusion that it would be impossible in a reasonable time.
But the speed I got with my RTX 3090 on the book's small training run made me think that perhaps -- just perhaps! -- it might actually be possible to train a model of this size -- about 163M parameters -- on my own hardware. Not, perhaps, on a small laptop, but at least on a reasonably high-end "gaming" PC.
Additionally, Andrej Karpathy recently announced nanochat,
"the best ChatGPT that $100 can buy". He mentions on the main page that he's trained
a model called d32, with 32 Transformer layers, which has 1.9B parameters, for about $800.
His smaller 20-layer d20 model, with 561M parameters, he says should be trainable
in about four hours on an 8x H100 GPU node, which costs about $24/hour -- hence the
$100 total price.
What's even more interesting about nanochat is that it's built with PyTorch; initially
I'd got the impression that it was based on his pure C/CUDA llm.c,
which I would imagine would give a huge speedup. But no -- he's using the same stack
as I have been in this series!
Karpathy's models are both larger than 163M parameters, so it definitely sounded like this might be doable. Obviously, I'm nowhere near as experienced an AI developer, and he's using a larger machine (8 GPUs and each of them has > 3x more VRAM than mine), but he's also including the time to train a tokeniser and instruction fine-tune into that four hours -- and his smaller model is more than three times larger than mine. So that should all help.
This post is a little less structured than the others in my LLM from scratch series, as it's essentially a tidied version of the notes I kept as I worked through the project.
But so as not to bury the lede: using the Hugging Face FineWeb-series datasets, I was able to train a GPT-2 small sized base model to a level where it was almost as good as the original in just over 48 hours on my own hardware! Base models: not just for the big AI labs.
Here's the full story.