Writing an LLM from scratch, part 15 -- from context vectors to logits; or, can it really be that simple?!

Posted on 31 May 2025 in AI, Python, LLM from scratch, TIL deep dives

Having worked through chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and spent some time digesting the concepts it introduced (most recently in my post on the complexity of self-attention at scale), it's time for chapter 4.

I've read it through in its entirety, and rather than working through it section-by-section in order, like I did with the last one, I think I'm going to jump around a bit, covering each new concept and how I wrapped my head around it separately. This chapter is a lot easier conceptually than the last, but there were still some "yes, but why do we do that?" moments.

The first of those is the answer to a question I'd been wondering about since at least part 6 in this series, and probably before. The attention mechanism is working through the (tokenised, embedded) input sequence and generating these rich context vectors, each of which expresses the "meaning" of its respective token in the context of the words that came before it. How do we go from there to predicting the next word in the sequence?

The answer, at least in the form of code showing how it happens, leaped out at me the first time I looked at the first listing in this chapter, for the initial DummyGPTModel that will be filled in as we go through it.

In its __init__, we create our token and position embedding mappings, and an object to handle dropout, then the multiple layers of attention heads (which are a bit more complex than the heads we've been working with so far, but more on that later), then some kind of normalisation layer, then:

self.out_head = nn.Linear(
    cfg["emb_dim"], cfg["vocab_size"], bias=False
)

...and then in the forward method, we run our tokens through all of that and then:

logits = self.out_head(x)
return logits

The x in that second bit of code is our context vectors from all of that hard work the attention layers did -- folded, spindled and mutilated a little by things like layer normalisation and being run through feed-forward networks with GELU (about both of which I'll go into in future posts) -- but ultimately just the context vectors.

And all we do to convert it into these logits, the output of the LLM, is run it through a single neural network layer. There's not even a bias, or an activation function -- it's basically just a single matrix multiplication!

My initial response was, essentially, WTF. Possibly WTFF. Gradient descent over neural networks is amazingly capable at learning things, but this seemed quite a heavy lift. Why would something so simple work? (And also, what are "logits"?)

Unpicking that took a bit of thought, and that's what I'll cover in this post.

Embeddings

The first thing that I saw that pointed me in the right direction on understanding this was almost a throwaway line. Near the end of the chapter, when Raschka is explaining why the model he's shown us how to build has more parameters than the GPT-2 model it's meant to mirror, he mentions that GPT-2 used a trick called "weight tying". What that meant was that the matrix it used for the out_head layer at the end "reuses the weights from the token embedding layer".

So, we're reusing weights from an embedding layer -- which maps from tokens to embeddings that represent the meaning of those tokens -- in order to map from the context vectors to something related to getting the next token. You might see where this is heading, but it's worth working through step by step.

Firstly, what does "the weights for the embedding layer" actually mean? So far, I'd been thinking of an embedding layer as being something like a dictionary. You feed in a token ID and get an embedding vector out.

And possibly that is how the embedding layers in PyTorch work under the hood in simple cases when you're doing your forward pass. But if these are trainable, there must be some way to get a gradient for the data in that embedding "dictionary" against the error function so that it can be adjusted during the backward pass. I have no idea how you might go about differentiating what is essentially a hashtable :-)

So, how do these embedding layers really work? Back in part 3, I said I had speculated in the past that perhaps tokens were fed into LLMs as one-hot vectors (don't worry, I'll explain those shortly), and was pleasantly surprised when Raschka pointed out that if you take one-hot vectors, and pass them through a single fully-connected layer of neurons, you are performing the same calculations as you would to generate the embeddings.

Let's work through that to see exactly what it means.

Imagine we have a vocabulary of four tokens:

ID token
0 to
1 be
2 or
3 not

Our input sequence (unsurprisingly given the token list) is "to be or not to be". A one-hot vector is a way of representing a number by having a vector of zeros, as long as the number of possible options, with a one in the position relating to the number we want to represent. So, our input maps to one-hot vectors like this:

token token ID one-hot vector representation
to 0 [1, 0, 0, 0]
be 1 [0, 1, 0, 0]
or 2 [0, 0, 1, 0]
not 3 [0, 0, 0, 1]
to 0 [1, 0, 0, 0]
be 1 [0, 1, 0, 0]

Hopefully that's crystal-clear. We have four possible tokens (I'll use "vocabulary" or "vocab" below), so each element in our input sequence is mapped to a number between 0 and 3, and then the one-hot vector for it is a four-element list that is all zeros except for the element corresponding to its ID.

Now, let's say that we've somehow come up with these two-dimensional embeddings for our tokens:

ID token embedding
0 to [1, 2]
1 be [3, 4]
2 or [5, 6]
3 not [7, 8]

How do we generate a list of embeddings corresponding to "to be or not to be" with the single fully-connected layer of neurons that Raschka mentions?

This is machine learning, so I'm sure no-one is going to be surprised that it turns out to be a simple matrix multiplication.

Let's put our one-hot vector representation of the input sequence into a matrix, with one token per row:

X=[100001000010000110000100]

Now that's n×v, where n (as usual) is our sequence length and v is our vocab size. We can multiply it by any matrix that has v rows, so let's represent our embeddings as a weights matrix Wemb like this:

Wemb=[12345678]

That's v×d, where d is the dimensionality of our embeddings. What do we get it we multiply the two to get XWemb? It'll be an n×d matrix, which looks like the right shape for a matrix containing the embeddings for each token, one per row. Is that what it contains?

Well, let's think about the element it will have at row 0, column 0. This will be the dot product of the first row in X and the first column in Wemb:

(1000)·(1357)

The vector dot product is an element-wise multiplication then a sum of the results, so we get 1·1+0·3+0·5+0·7=1.

Likewise, the element at row 0, column 1 will be 1·2+0·4+0·6+0·8=2. Our one-hot vector is essentially acting as a selector -- the position of the 1 chooses a row in the second matrix to output as a result.

So, we can use a matrix multiplication to map a matrix of one-hot vectors for the input tokens to their embeddings, using a matrix of the embeddings for each token ID stacked on top of each other. As I said earlier, I don't think that PyTorch is doing this internally, at least when doing inference -- all of those multiplications by zero seem to be a lot of extra work for what is a simple lookup! But I can imagine it's easier to work out how to calculate and apply gradients at training time, so perhaps it's used then (or perhaps there's a simpler way to do that too).

But before we move on from this, there's one extra aspect of this kind of matrix multiplication I'd like to touch on.

Imagine that you had a sequence of tokens, but you weren't sure what one of them was. Let's say you had "to be SOMETHING not to be", perhaps from an OCR scan or something similar. You think that there's a 70% chance that the third token is "or" and a 30% chance that it's "to".

You could represent that by, instead of using a one-hot vector for the uncertain token, using one that represented the probabilities:

(0.300.70)

Now, it should be pretty obvious that if you feed that into the matrix multiplication above, the output you'll get for that token will be 70% of the "or" embedding and 30% of the "to" embedding. Which, for a rich-enough embedding space, will be meaningful!

So: the matrix representation of the embeddings allows us to handle uncertainty too. My example above was maybe a little artificial, but this view will be useful later.

Another useful way of looking at this (that we'll also come back to) is our old friend, the concept of matrix multiplications as projections. The standard rotation matrix

[cosθsinθsinθcosθ]

...projects a 2D space into another that is rotated anti-clockwise by θ degrees, the frustum matrix used in 3D graphics projects the 3D space into a 2D one that can be displayed on a screen, and in attention we use Wq and Wk to project input vectors from one embedding space into another where they can be matched up by the dot product.

And likewise, an interesting way to see the embedding matrix is that it's used to project from a vocab space -- the generalised idea of the one-hot vectors, where each element reflects the probability of this row being a particular token -- into the embedding space.

Reversing matrix multiplication (but not quite)

Now, I'd read somewhere else that with weight-tying, the way you'd use the embedding matrix in the output layer was that you'd multiply your matrix of context vectors by its transpose: CWembT.

Intuitively, then, if multiplying by the embedding matrix projected from vocab space to embedding space, multiplying by the transpose would be doing the inverse projection, projecting those embeddings back to vocab space. That felt like it was going in the right direction!

But does multiplying by a transpose reverse a matrix multiplication, like division reverses multiplication? Now, matrix division isn't a thing, though I've always felt it should be. The fact that we call this thing we do with matrices "multiplication" seems pretty messy; it's not commutative (that is, ABBA and indeed BA might not be meaningful even if AB is), and there's no equivalent of division to "undo" the operation.

In fact, if you think about it, in the general case, you literally would not be able to reverse a matrix multiplication at all. Imagine a one-row matrix times a one-column matrix -- that is, 1×n times n×1. You'll get a single-element result, just one number -- it's a lossy operation. So there would be no way to recover that one-row matrix from the result. Another (maybe more intuitive) example is the frustum matrix mapping from 3D space to 2D space -- it's going to make a large thing that is far away and a small thing that is close look the same, due to perspective, so you can't reverse it. 1

But, if we think of matrix multiplications as projections, there is a way to do something that's a bit like a reversal of a matrix multiplication. Remember that we can see XWemb as being an operation that projects the vectors that make up X's rows from a v-dimensional vocab space into a d-dimensional embedding space.

If we take the transpose of Wemb -- swapping its rows and columns about -- then we get another projection, one that projects from a d-dimensional space into a v-dimensional one.

But does it really project from embedding space to vocab space? Let's try it in code with the numbers from above, round-tripping those one-hot vectors:

In [1]: import torch

In [2]: X = torch.tensor([
   ...:     [1, 0, 0, 0],
   ...:     [0, 1, 0, 0],
   ...:     [0, 0, 1, 0],
   ...:     [0, 0, 0, 1],
   ...:     [1, 0, 0, 0],
   ...:     [0, 1, 0, 0]
   ...: ], dtype=torch.float)

In [3]: W_emb = torch.tensor([
   ...:     [1, 2],
   ...:     [3, 4],
   ...:     [5, 6],
   ...:     [7, 8]
   ...: ], dtype=torch.float)

In [4]: E = X @ W_emb

In [5]: E
Out[5]:
tensor([[1., 2.],
        [3., 4.],
        [5., 6.],
        [7., 8.],
        [1., 2.],
        [3., 4.]])

In [6]: E @ W_emb.T
Out[6]:
tensor([[  5.,  11.,  17.,  23.],
        [ 11.,  25.,  39.,  53.],
        [ 17.,  39.,  61.,  83.],
        [ 23.,  53.,  83., 113.],
        [  5.,  11.,  17.,  23.],
        [ 11.,  25.,  39.,  53.]])

Not great. I mean, we've got the same outputs for the same inputs -- the first and fifth, and second and sixth rows are the same. But they don't look much like one-hot vectors.

But this is because my example embeddings are, frankly, pretty rubbish. They're all pointing in pretty much the same direction! Real embeddings will have different words pointing in different directions, and will have many more dimensions.

Let's see what happens if we use some real ones. I worked with Claude to get some code to try it out.

Firstly, we load up the bert-base-uncased model from Hugging Face (Claude's suggested model):

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

...we extract its embedding matrix:

# Get the embedding matrix
embedding_matrix = model.embeddings.word_embeddings.weight
vocab_size, embedding_dim = embedding_matrix.shape
print(f"Vocabulary size: {vocab_size}, Embedding dimension: {embedding_dim}")

...then tokenise the text "to be or not to be", and print out what it looks like in terms of token IDs:

# Tokenize the input text
text = "to be or not to be"
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs.input_ids[0]  # Remove batch dimension for simplicity

# Print the tokens and their IDs
tokens = tokenizer.convert_ids_to_tokens(input_ids)
print(f"Tokens: {tokens}")
print(f"Token IDs: {input_ids}")

Now we convert those token IDs to a matrix with one row per token, each row being a one-hot vector, and we print out a summary of what that looks like (the vocab size is in the tens of thousands so we couldn't usefully look at the one-hot vectors themselves):

# Convert to one-hot vectors
one_hot = F.one_hot(input_ids, num_classes=vocab_size).float()
print(f"One-hot matrix shape: {one_hot.shape}")

# Print a meaningful summary of the one-hot vectors
print("\nOne-hot vector summary:")
for i, (token, token_id, vec) in enumerate(zip(tokens, input_ids, one_hot)):
    nonzero_indices = vec.nonzero().item()
    print(f"Token {i}: '{token}' (ID: {token_id}) → Position {nonzero_indices} is 1")

Now we convert them into embeddings with a matrix multiplication like we did with the fake ones above, and print out its shape:

# Project into embedding space with matrix multiplication
token_embeddings_via_matmul = torch.matmul(one_hot, embedding_matrix)
print(f"\nEmbedding matrix shape: {embedding_matrix.shape}")
print(f"Embeddings via matrix multiplication shape: {token_embeddings_via_matmul.shape}")

We sanity-check that the resulting matrix is the same as we'd get if we just used the embedding object like we should rather than doing this roundabout thing with one-hot vectors:

# Get embeddings via the standard lookup for comparison
standard_embeddings = embedding_matrix[input_ids]
print(f"Standard embedding lookup shape: {standard_embeddings.shape}")

# Verify both methods give the same result
is_equal = torch.allclose(token_embeddings_via_matmul, standard_embeddings)
print(f"Matrix multiplication and direct lookup match: {is_equal}")

Now we do that multiplication by the transpose of the embedding matrix.

# Project back to vocabulary space
logits = torch.matmul(token_embeddings_via_matmul, embedding_matrix.t())
print(f"\nLogits shape after projection back: {logits.shape}")

(Note that that mysterious word "logits" is coming in here again, thanks to Claude having worked out exactly what I'm doing. We'll come to what it means later.)

So once again we have a matrix with one row per token, and as many columns as there are items in the vocab. How do we extract anything useful from that?

A quick and easy way of seeing how close the results are to the original one-hot vectors is to get the highest three numbers in each row -- and their indices -- using torch.topk, and then to see which tokens they represent, and how large the numbers are. So that's what we do next:

# Find top 3 most similar tokens for each input token
top_k = 3
top_values, top_indices = torch.topk(logits, k=top_k, dim=1)

print("\nTop 3 tokens after projecting back:")
for i, (token, token_id) in enumerate(zip(tokens, input_ids)):
    print(f"\nOriginal token: '{token}' (ID: {token_id})")

    for j, (value, idx) in enumerate(zip(top_values[i], top_indices[i])):
        similar_token = tokenizer.convert_ids_to_tokens([idx])[0]
        print(f"  {j+1}. '{similar_token}' (ID: {idx}) with similarity: {value.item():.2f}")

    # Check if original token is in top predictions
    if token_id in top_indices[i]:
        rank = (top_indices[i] == token_id).nonzero().item() + 1
        print(f"  Original token is rank {rank} in predictions")
    else:
        print(f"  Original token not in top {top_k}")

I ran that, and here's what it printed out:

Vocabulary size: 30522, Embedding dimension: 768
Tokens: ['[CLS]', 'to', 'be', 'or', 'not', 'to', 'be', '[SEP]']
Token IDs: tensor([ 101, 2000, 2022, 2030, 2025, 2000, 2022,  102])
One-hot matrix shape: torch.Size([8, 30522])

One-hot vector summary:
Token 0: '[CLS]' (ID: 101) → Position 101 is 1
Token 1: 'to' (ID: 2000) → Position 2000 is 1
Token 2: 'be' (ID: 2022) → Position 2022 is 1
Token 3: 'or' (ID: 2030) → Position 2030 is 1
Token 4: 'not' (ID: 2025) → Position 2025 is 1
Token 5: 'to' (ID: 2000) → Position 2000 is 1
Token 6: 'be' (ID: 2022) → Position 2022 is 1
Token 7: '[SEP]' (ID: 102) → Position 102 is 1

Embedding matrix shape: torch.Size([30522, 768])
Embeddings via matrix multiplication shape: torch.Size([8, 768])
Standard embedding lookup shape: torch.Size([8, 768])
Matrix multiplication and direct lookup match: True

Logits shape after projection back: torch.Size([8, 30522])

Top 3 tokens after projecting back:

Original token: '[CLS]' (ID: 101)
  1. '[CLS]' (ID: 101) with similarity: 4.12
  2. '[MASK]' (ID: 103) with similarity: 2.06
  3. '##⋅' (ID: 30141) with similarity: 0.91
  Original token is rank 1 in predictions

Original token: 'to' (ID: 2000)
  1. 'to' (ID: 2000) with similarity: 0.82
  2. '297' (ID: 27502) with similarity: 0.50
  3. '313' (ID: 22997) with similarity: 0.50
  Original token is rank 1 in predictions

Original token: 'be' (ID: 2022)
  1. 'be' (ID: 2022) with similarity: 0.96
  2. '243' (ID: 22884) with similarity: 0.78
  3. '690' (ID: 28066) with similarity: 0.77
  Original token is rank 1 in predictions

Original token: 'or' (ID: 2030)
  1. 'or' (ID: 2030) with similarity: 0.89
  2. '840' (ID: 28122) with similarity: 0.60
  3. '385' (ID: 24429) with similarity: 0.58
  Original token is rank 1 in predictions

Original token: 'not' (ID: 2025)
  1. 'not' (ID: 2025) with similarity: 0.96
  2. '670' (ID: 25535) with similarity: 0.71
  3. '840' (ID: 28122) with similarity: 0.70
  Original token is rank 1 in predictions

Original token: 'to' (ID: 2000)
  1. 'to' (ID: 2000) with similarity: 0.82
  2. '297' (ID: 27502) with similarity: 0.50
  3. '313' (ID: 22997) with similarity: 0.50
  Original token is rank 1 in predictions

Original token: 'be' (ID: 2022)
  1. 'be' (ID: 2022) with similarity: 0.96
  2. '243' (ID: 22884) with similarity: 0.78
  3. '690' (ID: 28066) with similarity: 0.77
  Original token is rank 1 in predictions

Original token: '[SEP]' (ID: 102)
  1. '[CLS]' (ID: 101) with similarity: 0.64
  2. '[SEP]' (ID: 102) with similarity: 0.59
  3. '[MASK]' (ID: 103) with similarity: 0.41
  Original token is rank 2 in predictions

That's pretty damn close! The only one it didn't have the original as the #1 pick for the resulting row was the last one, and even then it was the second choice.

So: while it's not possible in general to reverse a matrix multiplication, if you use one to project something into a rich-enough space -- like an embedding projection -- you can get a kind of fuzzy reversal.

And that is what happens with this weight-tying trick. We're taking our context vectors and multiplying them by the transpose of the embedding matrix. And that -- somewhat fuzzily, kind of approximately -- does the reverse of the original conversion from one-hot vectors to embeddings. We're mapping from embeddings to vectors the same length as our vocabulary, where the size of each number is an indication of how probable it is that the context vector in question maps to the token whose ID is that number's index.

Or, equivalently, we're using the embedding matrix at the start of the LLM to project from vocab space to embedding space to vocab space, and then at the end we can use its transpose to project embeddings back to vocab space.

Logits

Now we're in a good position to look into why these values are called logits. We know that we're taking the context vectors and projecting them from embedding space into vocab space. We know that each one of these vectors is a list of numbers, one per possible token, where the larger the number, the more likely the LLM thinks it is to be the next token. Why does the book call them logits -- and why did Claude, when it was writing the code above for me, use the same word?

As is usually the case when looking at mathematical terminology, the Wikipedia page is scary -- and unfortunately doesn't clarify matters much. But Stack Overflow has a great answer, and from that, and from reading around a bit, "logits" seems to be used in neural network circles to mean a set of numbers provided by an NN that can be treated as a set of probabilities -- the kind of thing that, if you ran them through softmax to balance them out a bit and make them all add up to one, would be an actual believable probability distribution.

This really has the feel of one of those cases where the mathematicians have spent ages coming up with a neat, clean and precise definition of something, and then applied scientists have come along, borrowed the word, and started using it loosely for something adjacent but not really the same.

Well, at least we're not physicists.

One token per token

So now I think there's only one thing left unexplained: there are n context vectors, one for each of our input tokens. Each one represents the "meaning" of the token in question, in the context of the input sequence as a whole (or more precisely, given causal attention, in the context of the part of the input sequence to the left of it). So that means that we're going to get n predictions.

What does that mean? Well, as each context vector is about the token in question, that means that each of these outputs is about that token. What the LLM has learned to do through the multiple levels of attention is not just to make the context vector for a token represent the token itself given the context -- it's actually making it represent the embedding for the most likely token to come after it. Or in other words, our projected vocab-space vectors are predictions of the next word for their respective tokens.

Let's make that a bit more concrete. When we feed in "the fat cat sat on the", then the first token's logits will be its predictions of what the word after "the" should be, based on the (zero) tokens to its left. So it will probably be pretty broad -- after all, during training the LLM will have encountered lots of sequences starting with "the", so it will probably roughly equally weight "cat", "quick", and all kinds of other tokens.

Likewise, the logits for the second token will have been based on the context vector that means "'fat', but there's a 'the' to the left". So it will probably have quite a high probability for "cat" as the next token, but might also have "controller", or any other word that it's frequently seen after "the fat".

By the time we get to the third token, it's likely that the LLM will have "locked in" -- that is, it'll be pretty sure that "'cat' coming after 'the fat'" should be followed by "sat". And so on.

So, the core insight here for me was that all of these attention layers working together to build up the context vectors are ultimately trying to create things that represent the token that comes next.

This, by the way, answers a question I had back in part 3: why is it that the training data for an LLM isn't just the next token that we're trying to predict? At training time, if we're training on "the fat cat sat on the mat", the input we're putting in is "the fat cat sat on the", and the target output we're basing our error function on is "fat cat sat on the mat" -- not just "mat".

That's now clear; the LLM returns next-token predictions for all of the input tokens. By training on those kinds of pairs, we're training it that:

...and so on.

Now, at inference time, when we're actually using the LLM to generate next tokens based on an input, we throw away all of the other predictions and just use the last one. But at training time, it's useful and an important thing to measure to make sure that the model is learning.

I initially thought that there might be wasted effort in outputting those extra predictions -- perhaps we could simplify things to only predict the last one. But the work required to generate the context vector for (say) "fat" is necessary, at least in the early layers, because later layers will pick this up and add it on to the context for the tokens to its right. I kind of see this as information "seeping rightwards" over the attention process. Conceivably in the very last layer we could not bother with anything apart from the last token, but this doesn't sound like an optimisation that would gain much.

So: each token's set of logits in the output is the LLM's prediction of the token that should come after, based entirely on the token itself and the other tokens to its left.

Why we're avoiding weight-tying in practice

You may have noticed that throughout this I've been explaining things in terms of weight-tying, where the linear layer that projects the context vectors to logits is the embedding matrix, just transposed. But the code doesn't do that -- it uses a separate trainable linear layer -- and I think it's worth looking a little into why. Raschka explains that this is because he finds that results are better that way, but doesn't give any details.

From the understanding I've built above, my own intuition is that while the projection needed to take tokens into embedding space is likely to be similar to the one that should project them back (or strictly speaking, its transpose), there's actually no reason to assume it would be identical. The context vectors are inherently different to the embeddings we started with. For a start, the original token embeddings are only part of the input embeddings that are fed into the start of the LLM -- the position embeddings are added in, and presumably their influence will still be there in the final context vectors after all of the attention layers. But there's also all of the information from other tokens that has been mixed in there too.

Ultimately, the original embedding layer maps tokens to a nice crisp set of embeddings for tokens, where each one has a specific representation in embedding space. After those have had their positions mixed in and been mashed up with each other by attention, the mapping back is going to be different to at least some degree.

Of course, they will have some kind of similarity, and by having two completely separate trainable things, one for each, you have more parameters to train. But it makes sense that it could well be worth it in terms of better results.

So, yes, it was that simple

I think that kind of wraps this one up. It seemed crazy to me when I first saw it that a single linear layer could take these complex things, these context vectors our attention heads had built up over multiple layers of multi-head attention, and somehow spit out something that means "the most likely next token is X".

But when you think of it in terms of trainable projections from vocab space to embedding space and back again -- and especially when you see that a real-world set of embeddings can actually do that back-and-forth translation -- and when you consider that each context vector, after all of the attention layers, is essentially pointing in the direction of "this is the next token after my one" -- then it comes together.

So that's it for this time around! Tune in next time for -- I think -- layer normalisation. Hopefully that'll be an easier one.

Free bonus section! A note about perplexity

Just as I was finishing this off, I found myself thinking that logits were interesting because you could take some measure of how certain the LLM was about the next token from them. For example, if all of the logits were the same number, it would mean that the LLM has absolutely no idea what token might come back -- it's giving an equal chance to all of them. If all of them were zero apart from one, which was a positive number, then it would be 100% sure about what the next one was going to be. If you could represent that in a single number -- let's say, 0 means that it has only one candidate and 1 means that it hasn't even the slightest idea what is most likely -- then it would be an interesting measure of how certain the LLM was about its choice.

Turns out (unsurprisingly) that I'd re-invented something that's been around for a long time. That number is called perplexity, and I imagine that's why the largest AI-enabled web search engine borrowed that name.


  1. People familiar with linear algebra will know that there are specific cases -- with square, non-singular matrices -- where there are matrix inverses. But that doesn't apply here.