Writing an LLM from scratch, part 11 -- batches

Posted on 19 April 2025 in AI, Python, LLM from scratch, TIL deep dives |

I'm still working through chapter 3 of Sebastian Raschka's "Build a Large Language Model (from Scratch)". Last time I covered dropout, which was nice and easy.

This time I'm moving on to batches. Batches allow you to run a bunch of different input sequences through an LLM at the same time, generating outputs for each in parallel, which can make training and inference more efficient -- if you've read my series on fine-tuning LLMs you'll probably remember I spent a lot of time trying to find exactly the right batch sizes for speed and for the memory I had available.

This was something I was originally planning to go into in some depth, because there's some fundamental maths there that I really wanted to understand better. But the more time I spent reading into it, the more of a rabbit hole it became -- and I had decided on a strict "no side quests" rule when working through this book.

So in this post I'll just present the basic stuff, the stuff that was necessary for me to feel comfortable with the code and the operations described in the book. A full treatment of linear algebra and higher-order tensor operations will, sadly, have to wait for another day...

Let's start off with the fundamental problem of why batches are a bit tricky in an LLM.

[ Read more ]


Writing an LLM from scratch, part 9 -- causal attention

Posted on 9 March 2025 in AI, Python, LLM from scratch, TIL deep dives |

My trek through Sebastian Raschka's "Build a Large Language Model (from Scratch)" continues... Self-attention was a hard nut to crack, but now things feel a bit smoother -- fingers crossed that lasts! So, for today: a quick dive into causal attention, the next part of chapter 3.

Causal attention sounds complicated, but it just means that when we're looking at a word, we don't pay any attention to the words that come later. That's pretty natural -- after all, when we're reading something, we don't need to look at words later on to understand what the word we're reading now means (unless the text is really badly written).

It's "causal" in the sense of causality -- something can't have an effect on something that came before it, just like causes can't some after effects in reality. One big plus about getting that is that I finally now understand why I was using a class called AutoModelForCausalLM for my earlier experiments in fine-tuning LLMs.

Let's take a look at how it's done.

[ Read more ]


Michael Foord: RIP

Posted on 26 January 2025 in Personal, Python |

Michael Foord, a colleague and friend, passed away this weekend. His passing leaves a huge gap in the Python community.

I first heard from him in early 2006. Some friends and I had just started a new company and there were two of us on the team, both experienced software developers. We'd just hired our third dev, another career coder, but as an XP shop that paired on all production code, we needed a fourth. We posted on the Python.org jobs list to see who we could find, and we got a bunch of applications, among them one from the cryptically-named Fuzzyman, a sales manager at a building supplies merchant who was planning a career change to programming.

He'd been coding as a hobby (I think because a game he enjoyed supported Python scripting), and while he was a bit of an unusual candidate, he wowed us when he came in. But even then, we almost didn't hire him -- there was another person who was also really good, and a bit more conventional, so initially we made an offer to them. To our great fortune, the other person turned the offer down and we asked Michael to join the team. I wrote to my co-founders "it was an extremely close thing and - now that the dust is settling - I think [Michael] may have been the better choice anyway."

That was certainly right! Michael's outgoing and friendly nature changed the company's culture from an inward-facing group of geeks to active members of the UK Python community. He got us sponsoring and attending PyCon UK, and then PyCon US, and (not entirely to our surprise) when we arrived at the conferences, we found that he already appeared to be best friends with everyone. It's entirely possible that he'd never actually met anyone there before -- with Michael, you could never be sure.

Michael's warm-hearted outgoing personality, and his rapidly developing technical skills, made him become an ever-more visible character in the Python community, and he became almost the company's front man. I'm sure a bunch of people only joined our team later because they'd met him first.

I remember him asking one day whether we would consider open-sourcing the rather rudimentary mocking framework we'd built for our internal unit-testing. I was uncertain, and suggested that perhaps he would be better off using it for inspiration while writing his own, better one. He certainly managed to do that.

Sadly things didn't work out with that business, and Michael decided to go his own way in 2009, but we stayed in touch. One of the great things about him was that when you met him after multiple months, or even years, you could pick up again just where you left off. At conferences, if you found yourself without anyone you knew, you could just follow the sound of his booming laugh to know where the fun crowd were hanging out. We kept in touch over Facebook, and I always looked forward to the latest loony posts from Michael Foord, or Michael Fnord as he posted as during his fairly-frequent bans...

This weekend's news came as a terrible shock, and I really feel that we've lost a little bit of the soul of the Python community. Rest in peace, Michael -- the world is a sadder and less wonderfully crazy place without you.

[Update: I was reading through some old emails and spotted that he was telling me I should start blogging in late 2006. So this very blog's existence is probably a direct result of Michael's advice. Please don't hold it against his memory ;-)]

[Update: there's a wonderful thread on discuss.python.org where people are posting their memories. I highly recommend reading it, and posting to it if you knew Michael.]


An AI chatroom (a few steps further)

Posted on 30 December 2024 in Python, AI |

Still playing hooky from "Build a Large Language Model (from Scratch)" -- I was on our support rota today and felt a little drained afterwards, so decided to finish off my AI chatroom. The the codebase is now in a state where I'm reasonably happy with it -- it's not production-grade code by any stretch of the imagination, but the structure is acceptable, and it has the basic functionality I wanted:

[ Read more ]


An AI chatroom (beginnings)

Posted on 29 December 2024 in Python, AI |

So, I know that I decided I would follow a "no side quests" rule while reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", but rules are made to be broken.

I've started building a simple Telegram bot that can be used to chat with multiple AI models at the same time, the goal being to allow them to have limited interaction with each other. I'm not sure if it's going to work well, and it's very much a work-in-progress -- but here's the repo.

More info below the fold.

[ Read more ]


Writing an LLM from scratch, part 3

Posted on 26 December 2024 in AI, Python, LLM from scratch, TIL deep dives |

I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and posting about what I found interesting every day that I read some of it.

Here's a link to the previous post in this series.

Today I was working through the second half of Chapter 2, "Working with text data", which I'd started just before Christmas. Only two days off, so it was reasonably fresh in my mind :-)

[ Read more ]


Writing an LLM from scratch, part 2

Posted on 23 December 2024 in AI, Python, LLM from scratch, TIL deep dives |

I'm reading Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and planning to post every day (or at least, every day I read some of it -- Christmas day I suspect I'll not be posting) with notes on what I found interesting.

Here's a link to the previous post in this series.

I had been planning to do a chapter a day, but that is looking optimistic for such a dense book! So today, I've read the first half or so of Chapter 2, "Working with text data". This gives an overview of the pre-processing that happens to text before it hits the LLM, goes on to describe a simple tokenization system (complete with source code), and then briefly covers the byte pair encoding method that we'll actually be using for the LLM.

[ Read more ]


Messing around with fine-tuning LLMs, part 10 -- finally training the model!

Posted on 22 December 2024 in AI, Python, Fine-tuning LLMs, TIL deep dives, Hugging Face |

For many months now, I've intermittently been working on building code to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA. I've been taking my time and letting myself be diverted by anything that looked interesting along the way, because the goal was to learn as much as possible about how this stuff works rather than to achieve a simple goal.

But all good things must come to an end. In this post I'll document the final steps of fine-tuning the model and getting it posted on Hugging Face.

Just to summarise what's happened so far, I have:

There are still a bunch of open questions left from those earlier posts, and I'm planning to do a summary post listing what those are (and, indeed what I've learned). But I figured it really was time to wrap up the final step. Check it out:

$ python chat_with_model.py gpjt/Meta-Llama-3-8B-openassistant-guanaco-llama2-format
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|============================================================================================================| 4/4 [00:03<00:00,  1.31it/s]
You: Please tell me about Leonardo da Vinci.
Bot:  Sure, Leonardo da Vinci was an Italian polymath and artist who lived in the
late 15th and early 16th centuries. He is considered one of the greatest minds of
the Renaissance era, and is remembered as one of the most influential artists of
all time. Leonardo was a master of many fields, including painting, sculpting,
architecture, engineering, and science. He is best known for his paintings, such
as the Mona Lisa and The Last Supper, and for his inventions, such as the flying
machine and the helicopter. Leonardo was also a skilled mathematician, and made
important contributions to fields such as cryptography and astronomy. He died at
the age of 67, without ever fully completing any of his many projects.
251 tokens in 6.11s: 41.10 tokens/s
You: Interesting!  Could you expand on what the Mona Lisa is?
Bot:   The Mona Lisa is a painting by Leonardo da Vinci that depicts a portrait
of a woman from the Renaissance era. It is one of the most famous paintings in
the world, and is considered one of the most recognizable pieces of art ever
created. The painting has been the subject of numerous studies and
interpretations, and is often used as an example of the aesthetic and
intellectual values of the Renaissance period.
79 tokens in 2.01s: 39.40 tokens/s
You: Where is it?  I would really like to see it.
Bot:   The Mona Lisa is located in the Louvre Museum in Paris, France. It is one
of the most visited museums in the world, and is a must-see destination for
anyone interested in art and culture. The Louvre Museum is open to the public,
and admission is free on the first Sunday of every month. Tickets can also be
purchased online or at the museum entrance. There are also many other museums and
attractions in Paris, such as the Eiffel Tower and the Notre Dame Cathedral, that
are worth visiting while you are in the city.
112 tokens in 2.81s: 39.90 tokens/s
You: Sounds cool!  What food should I try while I'm there?
Bot:   While you are in Paris, you should definitely try some of the delicious
French food! Some must-see dining spots are:

- Le Relais de l'EntrecĂ´te: a classic French bistro where you can enjoy a delicious steak with a side of French fries.
- L'Ami Jean: a hip and trendy restaurant that serves up delicious French food with a modern twist.
- Le Bistrot Paul Bert: a charming bistro that serves up classic French dishes with a focus on fresh, local ingredients.
- L'Assiette: a dining room that serves up delicious French food in a beautiful dining room.
- L'Ami Louis: a classic French restaurant that has been serving up delicious food for over 100 years.

I hope you enjoy your trip to Paris, and that you have a delicious dining experience
while you are there!
174 tokens in 4.38s: 39.70 tokens/s

And so on. The script is here, and you can run it against any model that accepts the Llama 2 prompt format -- see the command I used to start it above. There's an optional --debug flag so that you can see everything sent to and from the model.

So, about this one: it can't count the "r"s in strawberry, but it's a working assistant bot! Success :-)

Let's dig in to how it was fine-tuned.

[ Read more ]


Messing around with fine-tuning LLMs, part 9 -- gradient checkpointing

Posted on 3 September 2024 in AI, Python, Fine-tuning LLMs, TIL deep dives, Hugging Face |

This is the 9th installment in my ongoing investigations into fine-tuning LLM models. My goal is to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA. I'm doing this as a way to try to understand how to do full-on multi-GPU training of a model that cannot be trained on just one GPU -- so I'm taking it super-slowly and stopping and measuring everything along the way, which means that I'm learning a ton of new stuff pretty effectively.

So far, I've:

I'll look into Liger in more depth in a future post, because it looks incredibly useful, but first I wanted to investigate something that I learned about as a result of my original post about it. I posted my results on X, and Byron Hsu (who's building Liger at LinkedIn) thought it was weird that I was only able to squeeze a batch size of two (without Liger) or three (with it) into an 8x A100 80 GiB machine. In the Liger GitHub repo, they have an example of the kind of memory improvements the new kernels can provide; it shows without-Liger memory usages of (roughly) 55 GiB at a batch size of 32, 67 GiB at 48, and an OOM with 64. Now, they're using a sequence length of 512 rather than the 2048 I've been using, and that would have an effect, but not enough to allow batches that were sixteen times larger -- expecially because their benchmarks were being run on a machine with four A100 80 GiB cards, not eight.

Byron and I had a quick chat just to see if there was anything obviously dumb going on in my configuration, and one thing that stood out to him was that I wasn't using gradient checkpointing (which the Liger example is doing). That was something I'd very briefly looked into back in my earliest experiments into tuning the 8B model; I was following a Hugging Face guide to what to do if you hit memory problems. In their guide to DeepSpeed, they write:

A general process you can use is (start with batch size of 1):

  1. enable gradient checkpointing
  2. try ZeRO-2
  3. try ZeRO-2 and offload the optimizer

I had tried running my tune with both the gradient checkpointing enabled and Zero-2, but it blew up at iteration 24 (my first indication that there was something that kicked in at that point that increased memory usage), so I'd moved straight on to the optimizer offload.

At that point I was using instances with 8x A100 40 GiB. Since then, I'd switched to using 80 GiB per GPU machines, and done various tests comparing performance:

But what would happen if I just tried Zero 3 with no optimizer offload, with and without gradient checkpointing? That really sounded worth a look.

So, while I finished off my last post by saying

I think I'm finally in a place where I can do what should be the last experiment in this series: a full fine-tune of the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA.

...it now looks like there's something well worth investigating first: gradient checkpointing.

[ Read more ]


Messing around with fine-tuning LLMs, part 8 -- detailed memory usage across batch sizes

Posted on 25 August 2024 in AI, Python, Fine-tuning LLMs, TIL deep dives, Hugging Face |

This is the 8th installment in a mammoth project that I've been plugging away at since April. My goal is to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA. I'm doing this as a way to try to understand how to do full-on multi-GPU training of a model that cannot be trained on just one GPU -- so I'm taking it super-slowly and stopping and measuring everything along the way.

So far, I've:

I'd reached the conclusion that the only safe way to find out how much memory a fine-tune of either of the models I was working with would use was just to try it. The memory usage was important for three reasons -- firstly, whether the model could be trained at all on hardware I had easy access to, secondly, if it could be trained, whether I'd need to offload the optimizer (which had a serious performance impact), and thirdly what the batch size would be -- larger batches mean much better training speed.

This time around I wanted to work out how much of an impact the batch size would have -- how does it affect memory usage and speed? I had the feeling that it was essentially linear, but I wanted to see if that really was the case.

Here's what I found.

[ Read more ]