The fixed length bottleneck and the feed forward network
This post is a kind of note-to-self of a hitch I'm having in my understanding of the mechanics of LLMs at this point in my journey. Please treat it as the musings of a learner, and if you have suggestions on ways around this minor roadblock, comments below would be very welcome!
Having read about and come to the seeds of a working understanding of the role of the feed-forward network in a GPT-style LLM, something has come to mind that I'm still working my way through. It's likely due to a bug in at least one of the mental models I've constructed so far, so what I'd like to do in this post is express the issue as clearly as I can. Hopefully having done that I'll be able to work my way through it in the future, and will be able to post about the solution.
The core of the issue is that the feed-forward network operates on a per-context-vector basis -- that is, the context vectors for each and every token are processed by the same one-hidden-layer neural network in parallel, with no crosstalk between them -- the inter-token communication is all happening in the attention mechanism.
But this means that the amount of data that the FFN is handling is fixed -- it's a vector of numbers, with a dimensionality determined by the LLM's architecture -- 768 for the 124M parameter GPT-2 model I'm studying.
Here's the issue: in my mental model of the LLM, the attention mechanism is working out what to think about, but the FFN is what's doing the thinking (for hand-wavy values of "thinking"). So, given that it's thinking about one context vector at a time, there's a limit to how much it can think about -- just whatever can be represented in those 768 dimensions for this size of GPT-2.
This reminds me very much of the fixed-length bottleneck that plagued early encoder-decoder translation systems. There's a limit to how much data you can jam into a single vector.
Now, this is an error of some kind on my side -- I'm far from being knowledgable enough about LLMs or AI in general to be able to spot problems like this. And I'm pretty sure that the answer lies in one of my mental models being erroneous.
It seems likely that it's related to the interplay between the attention mechanism and the FFNs; that's certainly what's come through in my discussions with various AIs about it. But none of the explanations I've read has been quite enough to gel for me, so in this post I'll detail the issue as well as I can, so that later on I can explain the error in my ways :-)
Moving from Fabric3 to Fabric
I decided to see how long I could go without coding anything after starting my sabbatical -- I was thinking I could manage a month or so, but it turns out that the answer was one week. Ah well. But it's a little less deep-tech than normal -- just the story of some cruft removal I've wanted to do for some time, and I'm posting it because I'm pretty sure I know some people who are planning to do the same upgrade -- hopefully it will be useful for them :-)
I have a new laptop on the way, and wanted to polish up the script I have to install the OS. I like all of my machines to have a pretty consistent Arch install, with a few per-machine tweaks for different screen sizes and whatnot. The process is:
- Boot the new machine from the Arch install media, and get it connected to the LAN.
- Run the script on another machine on the same network -- provide it with the root password (having temporarily enabled root login via SSH) and the IP.
- Let it do its thing, providing it with extra information every now and then -- for example, after reboots it will ask whether the IP address is still the same.
This works pretty well, and I've been using it since 2017. The way it interacts with the machine is by using Fabric. I've never taken to declarative machine setup systems like Ansible -- I always find you wind up re-inventing procedural logic in them eventually, and it winds up being a mess -- so a tool like Fabric is ideal. You can just run commands over the network, upload and download files, and so on.
The problem was that I was using Fabric3. When I started writing these scripts in 2017, it was a bit of a weird time for Fabric. It didn't support Python 3 yet, so the only way to work in a modern Python was to use Fabric3, a fork that just added that.
I think the reason behind the delay in Python 3 support for the main project was that the team behind it were in the process of redesigning it with a new API, and wanted to batch the changes together; when Fabric 2.0.0 came out in 2018, with a completely different usage model, it was Python 3 compatible. (It does look like they backported the Python 3 stuff to the 1.x series later -- at least on PyPI there is a release of 1.15.0 in 2022 that added it.)
So, I was locked in to an old dependency, Fabric3, which hadn't been updated since 2018. This felt like something I should fix, just to keep things reasonably tidy. But that meant completely changing the model of how my scripts ran -- this blog post is a summary of what I had to do. The good news is: it was actually really simple, and the new API is definitely an improvement.
Writing an LLM from scratch, part 15 -- from context vectors to logits; or, can it really be that simple?!
Having worked through chapter 3 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and spent some time digesting the concepts it introduced (most recently in my post on the complexity of self-attention at scale), it's time for chapter 4.
I've read it through in its entirety, and rather than working through it section-by-section in order, like I did with the last one, I think I'm going to jump around a bit, covering each new concept and how I wrapped my head around it separately. This chapter is a lot easier conceptually than the last, but there were still some "yes, but why do we do that?" moments.
The first of those is the answer to a question I'd been wondering about since at least part 6 in this series, and probably before. The attention mechanism is working through the (tokenised, embedded) input sequence and generating these rich context vectors, each of which expresses the "meaning" of its respective token in the context of the words that came before it. How do we go from there to predicting the next word in the sequence?
The answer, at least in the form of code showing how it happens, leaped out at
me the first time I looked at the first listing in this chapter, for the initial
DummyGPTModel that will be filled in as we go through it.
In its __init__, we create our token and position embedding mappings,
and an object to handle dropout, then
the multiple layers of attention heads (which
are a bit more complex than the heads we've been working with so far, but more on that later),
then some kind of normalisation layer, then:
self.out_head = nn.Linear(
cfg["emb_dim"], cfg["vocab_size"], bias=False
)
...and then in the forward method, we run our tokens through all of that and then:
logits = self.out_head(x)
return logits
The x in that second bit of code is our context vectors from all of that hard
work the attention layers did -- folded, spindled and mutilated a little by things
like layer normalisation and being run through feed-forward networks with GELU (about
both of which I'll go into in future posts) -- but ultimately just the context vectors.
And all we do to convert it into these logits, the output of the LLM, is run it through a single neural network layer. There's not even a bias, or an activation function -- it's basically just a single matrix multiplication!
My initial response was, essentially, WTF. Possibly WTFF. Gradient descent over neural networks is amazingly capable at learning things, but this seemed quite a heavy lift. Why would something so simple work? (And also, what are "logits"?)
Unpicking that took a bit of thought, and that's what I'll cover in this post.
Writing an LLM from scratch, part 12 -- multi-head attention
In this post, I'm wrapping up chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered batches, which -- somewhat to my disappointment -- didn't involve completely new (to me) high-order tensor multiplication, but instead relied on batched and broadcast matrix multiplication. That was still interesting on its own, however, and at least was easy enough to grasp that I didn't disappear down a mathematical rabbit hole.
The last section of chapter 3 is about multi-head attention, and while it wasn't too hard to understand, there were a couple of oddities that I want to write down -- as always, primarily to get it all straight in my own head, but also just in case it's useful for anyone else.
So, the first question is, what is multi-head attention?
Writing an LLM from scratch, part 11 -- batches
I'm still working through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered dropout, which was nice and easy.
This time I'm moving on to batches. Batches allow you to run a bunch of different input sequences through an LLM at the same time, generating outputs for each in parallel, which can make training and inference more efficient -- if you've read my series on fine-tuning LLMs you'll probably remember I spent a lot of time trying to find exactly the right batch sizes for speed and for the memory I had available.
This was something I was originally planning to go into in some depth, because there's some fundamental maths there that I really wanted to understand better. But the more time I spent reading into it, the more of a rabbit hole it became -- and I had decided on a strict "no side quests" rule when working through this book.
So in this post I'll just present the basic stuff, the stuff that was necessary for me to feel comfortable with the code and the operations described in the book. A full treatment of linear algebra and higher-order tensor operations will, sadly, have to wait for another day...
Let's start off with the fundamental problem of why batches are a bit tricky in an LLM.
Writing an LLM from scratch, part 9 -- causal attention
My trek through Sebastian Raschka's "Build a Large Language Model (from Scratch)" continues... Self-attention was a hard nut to crack, but now things feel a bit smoother -- fingers crossed that lasts! So, for today: a quick dive into causal attention, the next part of chapter 3.
Causal attention sounds complicated, but it just means that when we're looking at a word, we don't pay any attention to the words that come later. That's pretty natural -- after all, when we're reading something, we don't need to look at words later on to understand what the word we're reading now means (unless the text is really badly written).
It's
"causal" in the sense of causality -- something can't have an effect on something that came before
it, just like causes can't some after effects in reality. One big plus about
getting that is that I finally now understand why I was using a class called
AutoModelForCausalLM for my earlier experiments in fine-tuning LLMs.
Let's take a look at how it's done.
Michael Foord: RIP
Michael Foord, a colleague and friend, passed away this weekend. His passing leaves a huge gap in the Python community.
I first heard from him in early 2006. Some friends and I had just started a new company and there were two of us on the team, both experienced software developers. We'd just hired our third dev, another career coder, but as an XP shop that paired on all production code, we needed a fourth. We posted on the Python.org jobs list to see who we could find, and we got a bunch of applications, among them one from the cryptically-named Fuzzyman, a sales manager at a building supplies merchant who was planning a career change to programming.
He'd been coding as a hobby (I think because a game he enjoyed supported Python scripting), and while he was a bit of an unusual candidate, he wowed us when he came in. But even then, we almost didn't hire him -- there was another person who was also really good, and a bit more conventional, so initially we made an offer to them. To our great fortune, the other person turned the offer down and we asked Michael to join the team. I wrote to my co-founders "it was an extremely close thing and - now that the dust is settling - I think [Michael] may have been the better choice anyway."
That was certainly right! Michael's outgoing and friendly nature changed the company's culture from an inward-facing group of geeks to active members of the UK Python community. He got us sponsoring and attending PyCon UK, and then PyCon US, and (not entirely to our surprise) when we arrived at the conferences, we found that he already appeared to be best friends with everyone. It's entirely possible that he'd never actually met anyone there before -- with Michael, you could never be sure.
Michael's warm-hearted outgoing personality, and his rapidly developing technical skills, made him become an ever-more visible character in the Python community, and he became almost the company's front man. I'm sure a bunch of people only joined our team later because they'd met him first.
I remember him asking one day whether we would consider open-sourcing the rather rudimentary mocking framework we'd built for our internal unit-testing. I was uncertain, and suggested that perhaps he would be better off using it for inspiration while writing his own, better one. He certainly managed to do that.
Sadly things didn't work out with that business, and Michael decided to go his own way in 2009, but we stayed in touch. One of the great things about him was that when you met him after multiple months, or even years, you could pick up again just where you left off. At conferences, if you found yourself without anyone you knew, you could just follow the sound of his booming laugh to know where the fun crowd were hanging out. We kept in touch over Facebook, and I always looked forward to the latest loony posts from Michael Foord, or Michael Fnord as he posted as during his fairly-frequent bans...
This weekend's news came as a terrible shock, and I really feel that we've lost a little bit of the soul of the Python community. Rest in peace, Michael -- the world is a sadder and less wonderfully crazy place without you.
[Update: I was reading through some old emails and spotted that he was telling me I should start blogging in late 2006. So this very blog's existence is probably a direct result of Michael's advice. Please don't hold it against his memory ;-)]
[Update: there's a wonderful thread on discuss.python.org
where people are posting their memories. I highly recommend reading it, and
posting to it if you knew Michael.]
An AI chatroom (a few steps further)
Still playing hooky from "Build a Large Language Model (from Scratch)" -- I was on our support rota today and felt a little drained afterwards, so decided to finish off my AI chatroom. The the codebase is now in a state where I'm reasonably happy with it -- it's not production-grade code by any stretch of the imagination, but the structure is acceptable, and it has the basic functionality I wanted:
- A configurable set of AIs
- Compatibility with the OpenAI API (for OpenAI itself, Grok and DeepSeek) and with Anthropic's (for Claude).
- Persistent history so that you can start a chat and have it survive a restart of the bot.
- Pretty reasonable behaviour of the AIs, with them building on what each other say.
An AI chatroom (beginnings)
So, I know that I decided I would follow a "no side quests" rule while reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", but rules are made to be broken.
I've started building a simple Telegram bot that can be used to chat with multiple AI models at the same time, the goal being to allow them to have limited interaction with each other. I'm not sure if it's going to work well, and it's very much a work-in-progress -- but here's the repo.
More info below the fold.
Writing an LLM from scratch, part 3
I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.
Here's a link to the previous post in this series.
Today I was working through the second half of Chapter 2, "Working with text data", which I'd started just before Christmas. Only two days off, so it was reasonably fresh in my mind :-)