What AI chatbots are actually doing under the hood

Posted on 29 August 2025 in AI |

This article is the first of three "state of play" posts that explain how Large Language Models work, aimed at readers with the level of understanding I had in mid-2022: techies with no deep AI knowledge. It grows out of part 19 in my series working through Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

As a techie, I'm sometimes asked by less-geeky friends how tools like ChatGPT, Claude, Grok, or DeepSeek actually work under the hood. Over time I've refined and improved my answer, both because I found better ways to explain it, and because as I started to learn how to build one myself from scratch, my own understanding got better.

This post is an attempt to describe in plain English what is going on. It's made up of a series of descriptions, each one building on the one before but refining it a bit, until we reach something that's strictly accurate, but which might have been a bit overwhelming if presented in one go.

If you're reading this as a techie who wants to learn about AI, then I definitely recommend that you read until the end. But if you're reading for general interest, you can safely stop reading at any time -- hopefully you'll wind up with a more solid understanding of what's going on, even if it doesn't include all of the tiny details.

Before we kick off, though: AI bots aren't just chatbots any more; many are multimodal -- for example, ChatGPT can analyse images you provide or generate its own. They also have "thinking" modes, which allow them to ponder their answers before replying. I won't cover those aspects -- just the text-only systems that we had 12 months or so ago, in 2024. And I'll only be talking about inference -- that is, using an existing AI, rather than the training process used to create them.

So, with all that said, let's kick off! We can start by looking at next-word prediction.

[ Read more ]


Writing an LLM from scratch, part 19 -- wrapping up Chapter 4

Posted on 29 August 2025 in AI, LLM from scratch, TIL deep dives |

I've now finished chapter 4 in Sebastian Raschka's book "Build a Large Language Model (from Scratch)", having worked through shortcut connections in my last post. The remainder of the chapter doesn't introduce any new concepts -- instead, it shows how to put all of the code we've worked through so far into a full GPT-type LLM. You can see my code here, in the file gpt.py -- though I strongly recommend that if you're also working through the book, you type it in yourself -- I found that even the mechanical process of typing really helped me to solidify the concepts.

So instead of writing a post about the rather boring process of typing in code, I decided that I wanted to put together something in the spirit of writing the post that I wished I'd found when I started reading the book. I would summarise everything I've learned, with links back to the other posts in this series. As I wrote it, I realised that the best way to describe things was to try to explain things to myself as I was before ChatGPT came out, say mid-2022 -- a techie, yes, but with minimal understanding of how modern AI works.

Some 6,000 words in, I started thinking that perhaps I was trying to pack a little bit too much into it. So, coming up next, three "state of play" posts, targeting people with 2022-Giles' level of knowledge.

Next, it's time to move on to the next chapter, training. Hopefully all the time I spent fine-tuning LLMs last year will turn out to be useful there!

If you want to jump straight forward to that, here's the first post on training.


Writing an LLM from scratch, part 18 -- residuals, shortcut connections, and the Talmud

Posted on 18 August 2025 in AI, LLM from scratch, TIL deep dives |

I'm getting towards the end of chapter 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". When I first read this chapter, it seemed to be about tricks to use to make LLMs trainable, but having gone through it more closely, only the first part -- on layer normalisation -- seems to fit into that category. The second, about the feed-forward network is definitely not -- that's the part of the LLM that does a huge chunk of the thinking needed for next-token prediction. And this post is about another one like that, the shortcut connections.

The reason I want to highlight this is that the presentation in the book really is all about making a network trainable -- about helping with the vanishing gradients that deep neural networks are prone to.

But the more I looked into it, the more I realised that what we're doing with these shortcuts is a fundamental change to the architecture of the LLM from the way it's been expressed so far. Gradients do indeed vanish less than they would without them, but that's more of a side-effect than it is the reason for adding them.

Here's why.

[ Read more ]


The fixed length bottleneck and the feed forward network

Posted on 14 August 2025 in AI, Python, Musings |

This post is a kind of note-to-self of a hitch I'm having in my understanding of the mechanics of LLMs at this point in my journey. Please treat it as the musings of a learner, and if you have suggestions on ways around this minor roadblock, comments below would be very welcome!

Having read about and come to the seeds of a working understanding of the role of the feed-forward network in a GPT-style LLM, something has come to mind that I'm still working my way through. It's likely due to a bug in at least one of the mental models I've constructed so far, so what I'd like to do in this post is express the issue as clearly as I can. Hopefully having done that I'll be able to work my way through it in the future, and will be able to post about the solution.

The core of the issue is that the feed-forward network operates on a per-context-vector basis -- that is, the context vectors for each and every token are processed by the same one-hidden-layer neural network in parallel, with no crosstalk between them -- the inter-token communication is all happening in the attention mechanism.

But this means that the amount of data that the FFN is handling is fixed -- it's a vector of numbers, with a dimensionality determined by the LLM's architecture -- 768 for the 124M parameter GPT-2 model I'm studying.

Here's the issue: in my mental model of the LLM, the attention mechanism is working out what to think about, but the FFN is what's doing the thinking (for hand-wavy values of "thinking"). So, given that it's thinking about one context vector at a time, there's a limit to how much it can think about -- just whatever can be represented in those 768 dimensions for this size of GPT-2.

This reminds me very much of the fixed-length bottleneck that plagued early encoder-decoder translation systems. There's a limit to how much data you can jam into a single vector.

Now, this is an error of some kind on my side -- I'm far from being knowledgable enough about LLMs or AI in general to be able to spot problems like this. And I'm pretty sure that the answer lies in one of my mental models being erroneous.

It seems likely that it's related to the interplay between the attention mechanism and the FFNs; that's certainly what's come through in my discussions with various AIs about it. But none of the explanations I've read has been quite enough to gel for me, so in this post I'll detail the issue as well as I can, so that later on I can explain the error in my ways :-)

[ Read more ]


Writing an LLM from scratch, part 17 -- the feed-forward network

Posted on 12 August 2025 in AI, LLM from scratch, TIL deep dives |

I'm still working through chapter 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This chapter not only puts together the pieces that the previous ones covered, but adds on a few extra steps. I'd previously been thinking of these steps as just useful engineering techniques ("folding, spindling and mutilating" the context vectors) to take a model that would work in theory, but not in practice, and make it something trainable and usable -- but in this post I'll explain why that was wrong.

Last time I covered layer normalisation, which I managed to get a satisfactory (but not great) handle on -- it is a trick to constrain the outputs of a layer in the LLM so that one token doesn't "drown out" the signals from the others and cause problems like exploding or vanishing gradients during training (and, I would imagine, to some degree during inference). That definitely is one of those engineering techniques to ensure trainability.

This time I want to go through the feed-forward layer, which is a different kind of beast. It is covered in just four and a half pages in the book, and the implementation is really simple -- indeed, two of the pages are an in-depth look at GELU, the specific activation function that the GPT-2 style model that we're building uses, and the rest is just making concrete exactly how to write a simple neural network with one hidden layer that uses GELU.

So, the "how" was simple enough. It was the "why" that surprised me. The more I thought about it, the more I realised that this part of the LLM is just as important as the attention mechanism itself. In my current working model, at least, attention tells the LLM what to think about -- gathering the "meaning" of an input vector in the context of those to its left -- but it's the linear layers that actually do that thinking and allow the LLM as a whole to make next-token predictions. Indeed, there are more parameters in a normal LLM for these networks than there are for the attention mechanism itself! They're clearly super-important.

Let's dig in.

[ Read more ]