- February 2026 (4)
- January 2026 (4)
- December 2025 (1)
- November 2025 (3)
- October 2025 (9)
- September 2025 (3)
- August 2025 (5)
- July 2025 (1)
- June 2025 (2)
- May 2025 (3)
- April 2025 (2)
- March 2025 (7)
- February 2025 (10)
- January 2025 (6)
- December 2024 (7)
- September 2024 (1)
- August 2024 (2)
- July 2024 (2)
- May 2024 (2)
- April 2024 (2)
- February 2024 (2)
- April 2023 (1)
- March 2023 (2)
- September 2022 (1)
- February 2022 (1)
- November 2021 (1)
- March 2021 (1)
- February 2021 (2)
- August 2019 (1)
- November 2018 (1)
- May 2017 (1)
- December 2016 (1)
- April 2016 (1)
- August 2015 (1)
- December 2014 (1)
- August 2014 (1)
- March 2014 (1)
- December 2013 (1)
- October 2013 (3)
- September 2013 (4)
- August 2013 (2)
- July 2013 (1)
- June 2013 (1)
- February 2013 (1)
- October 2012 (1)
- June 2012 (1)
- May 2012 (1)
- April 2012 (1)
- February 2012 (1)
- October 2011 (1)
- June 2011 (1)
- May 2011 (1)
- April 2011 (1)
- March 2011 (1)
- February 2011 (1)
- January 2011 (1)
- December 2010 (3)
- November 2010 (1)
- October 2010 (1)
- September 2010 (1)
- August 2010 (1)
- July 2010 (1)
- May 2010 (3)
- April 2010 (1)
- March 2010 (2)
- February 2010 (3)
- January 2010 (4)
- December 2009 (2)
- November 2009 (5)
- October 2009 (2)
- September 2009 (2)
- August 2009 (3)
- July 2009 (1)
- May 2009 (1)
- April 2009 (1)
- March 2009 (5)
- February 2009 (5)
- January 2009 (5)
- December 2008 (3)
- November 2008 (7)
- October 2008 (4)
- September 2008 (2)
- August 2008 (1)
- July 2008 (1)
- June 2008 (1)
- May 2008 (1)
- April 2008 (1)
- January 2008 (4)
- December 2007 (3)
- March 2007 (3)
- February 2007 (1)
- January 2007 (2)
- December 2006 (4)
- November 2006 (18)
- AI (70)
- TIL deep dives (65)
- Python (63)
- LLM from scratch (36)
- Resolver One (34)
- Blogkeeping (18)
- PythonAnywhere (17)
- Linux (16)
- Startups (15)
- NSLU2 offsite backup project (13)
- TIL (13)
- Hugging Face (12)
- Funny (11)
- Finance (10)
- Fine-tuning LLMs (10)
- Musings (10)
- C (9)
- Gadgets (8)
- Personal (8)
- Robotics (8)
- Website design (8)
- 3D (5)
- Rants (5)
- Cryptography (4)
- JavaScript (4)
- Music (4)
- Oddities (4)
- Quick links (4)
- Talks (4)
- Dirigible (3)
- Eee (3)
- Memes (3)
- Politics (3)
- Django (2)
- GPU Computing (2)
- LaTeX (2)
- MathML (2)
- OLPC XO (2)
- Retro Language Models (2)
- Space (2)
- VoIP (2)
- Copyright (1)
- Golang (1)
- Raspberry Pi (1)
- Software development tools (1)
- Agile Abstractions
- Astral Codex Ten
- :: (Bloggable a) => a -> IO ()
- David Friedman's Substack
- Econ & Energy
- Entrepreneurial Geekiness
- For some value of "Magic"
- Hackaday
- kaleidic.ai newsletter
- Knowing.NET
- Language Log
- Millennium Hand
- ntoll.org
- Obey the Testing Goat!
- PK
- PythonAnywhere News
- Simon Willison's Weblog
- Societive
- Software Deviser
- Some opinions, held with varying degrees of certainty
- tartley.com
An addendum to 'the maths you need to start understanding LLMs'
My last post, about the maths you need to start understanding LLMs, took off on Hacker News over the weekend.
It's always nice to see lots of people reading and -- I hope! -- enjoying something that you've written. But there's another benefit. If enough people read something, some of them will spot errors or confusing bits -- "given enough eyeballs, all bugs are shallow".
Commenter bad_ash made the excellent point that in the phrasing I originally had, a naive reader might think that activation functions are optional in neural networks in general, which of course isn't the case. What I was trying to say was that we can use one without an activation function for other purposes (and we do in LLMs). I've fixed the wording to (hopefully) make that a bit clearer.
ThankYouGodBless made a thoughtful comment about vector normalisation and cosine similarity, which was a great point in itself, but it also made something clear: although the post linked to an article I wrote back in February that covered the dot product of vectors, it really needed its own section on that. Without understanding what the dot product is, and how it relates to similarity, it's hard to get your head around how attention mechanisms work. I've added a section to the post, but for the convenience of anyone following along over RSS, here's what I said:
The dot product
The dot product is an operation that works on two vectors of the same length. It simply means that you multiply the corresponding elements, then add up the results of those multiplications:
Or, more concretely:
This is useful for a number of things, but the most interesting is that the dot product of two vectors of roughly the same length is quite a good measure of how close they are to pointing in the same direction -- that is, it's a measure of similarity. If you want a perfect comparison, you can scale them both so that they have a length of one, and then the dot product is exactly equal to the cosine of the angle between them (which is logically enough called cosine similarity).
But even without that kind of precise normalisation (which requires calculating squares and roots, so it's kind of expensive), so long as the vectors are close in length, it gives us meaningful numbers -- so, for example, it can give us a quick-and-dirty way to see how similar two embeddings are.
Unfortunately the proof of why the dot product is a measure of similarity is a bit tricky, but this thread by Tivadar Danka is reasonably accessible if you want to get into the details.
See you next time!
As promised, up next: how do we put all of that together, along with the high-level stuff I described about LLMs in my last post, to understand how an LLM works?