An addendum to 'the maths you need to start understanding LLMs'

Posted on 8 September 2025 in AI

My last post, about the maths you need to start understanding LLMs, took off on Hacker News over the weekend.

It's always nice to see lots of people reading and -- I hope! -- enjoying something that you've written. But there's another benefit. If enough people read something, some of them will spot errors or confusing bits -- "given enough eyeballs, all bugs are shallow".

Commenter bad_ash made the excellent point that in the phrasing I originally had, a naive reader might think that activation functions are optional in neural networks in general, which of course isn't the case. What I was trying to say was that we can use one without an activation function for other purposes (and we do in LLMs). I've fixed the wording to (hopefully) make that a bit clearer.

ThankYouGodBless made a thoughtful comment about vector normalisation and cosine similarity, which was a great point in itself, but it also made something clear: although the post linked to an article I wrote back in February that covered the dot product of vectors, it really needed its own section on that. Without understanding what the dot product is, and how it relates to similarity, it's hard to get your head around how attention mechanisms work. I've added a section to the post, but for the convenience of anyone following along over RSS, here's what I said:

The dot product

The dot product is an operation that works on two vectors of the same length. It simply means that you multiply the corresponding elements, then add up the results of those multiplications:

(abc)·(def)=a·d+b·e+c·f

Or, more concretely:

(234)·(102)=2×1+3×0+4×2=2+0+8=10

This is useful for a number of things, but the most interesting is that the dot product of two vectors of roughly the same length is quite a good measure of how close they are to pointing in the same direction -- that is, it's a measure of similarity. If you want a perfect comparison, you can scale them both so that they have a length of one, and then the dot product is exactly equal to the cosine of the angle between them (which is logically enough called cosine similarity).

But even without that kind of precise normalisation (which requires calculating squares and roots, so it's kind of expensive), so long as the vectors are close in length, it gives us meaningful numbers -- so, for example, it can give us a quick-and-dirty way to see how similar two embeddings are.

Unfortunately the proof of why the dot product is a measure of similarity is a bit tricky, but this thread by Tivadar Danka is reasonably accessible if you want to get into the details.

See you next time!

As promised, up next: how do we put all of that together, along with the high-level stuff I described about LLMs in my last post, to understand how an LLM works?