Messing around with fine-tuning LLMs, part 6 -- measuring memory usage more systematically

Posted on 10 July 2024 in Programming, Python, AI

My goal is to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA. I'm doing this as a way to try to understand how to do full-on multi-GPU training of a model that cannot be trained on just one GPU.

I've been building up to this goal gradually; so far, I've:

The experiments I did last time around were to find out why, when the DeepSpeed estimate_zero3_model_states_mem_needs_all_live function said that I would need just less than 18 GiB of VRAM per GPU to train the 8B model without offloading anything, in reality I needed 40 GiB and still had to offload the optimizer.

At the end of the experiments, I'd found:

  • At least part of the problem with the estimation function was that it did not take account of the sequence length being used for the training. In my very first post about fine-tuning, I'd found that the longer the sequence length, the more VRAM needed to tune (which makes perfect sense). My guess is that this is because the function is not designed for LLMs, but is rather for fixed-input models where the memory usage is more stable.
  • The memory usage for PyTorch is classified two ways: the "allocated" memory, which is actually in use for tensors, and the "reserved" memory, which is the allocated memory plus -- at least, from what I could tell from the docs at the time -- whatever is used for caches.
  • With a very short sequence length -- I had tested with it set to 10 -- the allocated memory in the train was closer to the results from the estimation function: in the case of the 0.5B model I was testing with locally, the function returned 8 GiB and the allocated VRAM was about 10 GiB.
  • Some extra memory above the allocated amount was needed for training; my take on that was that caches were (understandably) important.
  • However, it was possible to reduce the amount of reserved memory beyond the allocated (and to tell PyTorch to keep going even if it didn't have as much cache space as it wanted) if you set an environment variable:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

This time around I wanted to take a more systematic look at the effects of the sequence length and of that environment variable on memory usage and training speed. I'd previously been assuming that VRAM usage would vary linearly with sequence length, but I had no evidence for that. And while it looked like training speed decreased with increasing sequence length, I didn't have any hard numbers. Time to fix that hole in my knowledge!

The first step: do some careful measurements of those numbers on the 0.5B model locally. That's what this post is about -- the next one will be for the 8B model running on Lambda Labs.

[ Read more ]

Messing around with fine-tuning LLMs, part 5 -- memory optimization

Posted on 5 July 2024 in Programming, Python, AI

My goal is to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset, without using tricks like quantization or LoRA. I'm doing this as a way to try to understand how to do full-on multi-GPU training of a model that cannot be trained on just one GPU.

I've been building up to this goal gradually; so far, I've:

This time around, I wanted to find out why I had to offload the optimizer, because it didn't seem like it should be necessary. Hugging Face helpfully document a DeepSpeed function that you can call to estimate the VRAM requirements for training a model with ZeRO, and when I ran it against the 8B model, I got this:

(fine-tune) ubuntu@130-61-28-84:~/fine-tune-2024-04$ python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("meta-llama/Meta-Llama-3-8B"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=8, num_nodes=1)'
[2024-05-17 23:19:31,667] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
Loading checkpoint shards: 100%|============================================================================================================| 4/4 [00:02<00:00,  1.61it/s]
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 7504M total params, 525M largest layer params.
  per CPU  |  per GPU |   Options
  188.72GB |   1.96GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  335.50GB |   1.96GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  167.75GB |   3.70GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  335.50GB |   3.70GB | offload_param=none, offload_optimizer=cpu , zero_init=0
   23.48GB |  17.68GB | offload_param=none, offload_optimizer=none, zero_init=1
  335.50GB |  17.68GB | offload_param=none, offload_optimizer=none, zero_init=0

It was saying that I only needed 17.68 GiB VRAM per GPU with no optimizer offload -- but I had needed to offload it even though I had 40 GiB per GPU. Why was that? What was I doing wrong? The documents that mention that function also say:

these are just the memory requirements for the parameters, optimizer states and gradients, and you'll need a bit more for the CUDA kernels and activations

...but more than 22 GiB extra is more than "a bit more".

Digging into this took an embarrassing amount of time -- I started work on it shortly after publishing my last post in this series, so that's been more than a month! And it's embarrassing that I took so long because the reason why I should not trust the number reported by that script was staring me in the face from the start, and involved something I'd discovered in my first explorations into this stuff.

Still, I learned a lot over the course of these investigations, so I think it's worth showing at least some of the journey. The post below is a distilled version of my lab notes and is a little rambling, but you might find it interesting if you're also digging into memory usage during LLM training as a beginner. If not, and you're looking for more carefully planned experiments and results, hopefully the next post in this series will have more of those :-)

Let's get going.

[ Read more ]

Messing around with fine-tuning LLMs, part 4 -- training cross-GPU.

Posted on 21 May 2024 in Programming, Python, AI

My goal is to fine-tune an 8B model -- specifically, the Llama 3 8B base model -- on the openassistant-guanaco dataset. I'm doing this as a way to try to understand how to do full-on multi-GPU training of a model that literally cannot be trained on just one GPU. I've been building up to this goal gradually; so far, I've:

In that last step, I'd found a very useful page in the Hugging Face documentation. It split multi-GPU situations into three categories:

  1. Your model fits onto on a GPU.
  2. Your model doesn't fit onto a GPU (but the layers taken individually do).
  3. The largest layer in your model is so big that it doesn't fit onto a GPU.

I'd interpreted that first point as "you can load the model onto just one GPU" -- that is, you can run inference on it because all of the parameters fit there (with some overhead for the data, activations, etc). However, my experiences showed that it meant "you can train the model on one GPU", which takes up significantly more VRAM than inference does. The suggested approaches they had for that category were all about having the model loaded and training on each GPU, which is good for speeding up training by training on multiple batches simultaneously, but doesn't help if you want multiple GPUs simply because you can't train the model on one GPU alone.

So my goal this time was to change my training strategy to use a technique that allowed the training of the entire model to be split across GPUs. Here's what I did.

[ Read more ]

Messing around with fine-tuning LLMs, part 3 -- moar GPUs

Posted on 15 May 2024 in Programming, Python, AI

Having fine-tuned a 0.5B model on my own machine, I wanted to try the same kind of tuning, but with an 8B model. I would need to do that on a cloud provider, because my own machine with its RTX 3090 would definitely not be up to the job, and I'd tried out Lambda Labs and found that it worked pretty well.

Importantly, I was going to need to train with multiple GPUs in order to do this. The Lambda Labs options open to me last time around were:

  • 1x H100 (80GiB PCIe) at $2.49/hour
  • 1x A10 (24GiB PCIe) at $0.75/hour
  • 1x A100 (40GiB SXM4) at $1.29/hour
  • 8x A100 (40GiB SXM4) at $10.32/hour

The last of those looked like it should be ample for my needs, which I'd estimated at being 160GiB VRAM (vs 320GiB on that monster machine). But before trying to use a multi-GPU setup to fine-tune a model I'd never worked with before, I figured it would be a good idea to try with the model that I've been using to date. The plan was:

  • Spin up a 1x A100 and run the same fine-tune as I've been doing on it.
  • If all was well, try again with the 8x A100 instance and start to learn about multi-GPU training.

Then once that was all done, I'd hopefully be in a good position to try the fine-tune for the bigger model.

Here's how it went.

[ Read more ]

Messing around with fine-tuning LLMs, part 2 -- to the cloud!

Posted on 28 April 2024 in Programming, Python, AI

Having fine-tuned a 0.5B model on my own machine, I wanted to try the same kind of tuning, but with an 8B model. My experiments suggested to me that the VRAM required for the tuning was roughly linear with two meta-parameters -- the length of the samples and the batch size -- and I'd found resources online that suggested that it was also linear with the number of parameters.

The 16x scale going from 0.5B parameters to 8B would suggest that I would need 16x24GiB to run this fine-tune, which would be 384GiB. However, the chart I'd seen before suggested I could do it with a bit more than 160GiB -- that being the number they gave for a 7B parameter model.

What I clearly needed to do was find a decent-looking cloud GPU platform where I could start with a smaller machine and easily switch over to a larger one if it wasn't sufficient. Here are my first steps, running one of my existing fine-tune notebooks on a cloud provider.

[ Read more ]

Messing around with fine-tuning LLMs

Posted on 27 April 2024 in Programming, Python, AI

Fine-tuning an LLM is how you take a base model and turn it into something that can actually do something useful. Base models are LLMs that have been trained to learn to predict the next word on vast amounts of text, and they're really interesting to play with, but you can't really have a conversation with one. When you ask them to complete some text, they don't know whether you want to complete it as part of a novel, a technical article, or an unhinged tweetstorm. (The obvious joke about which type of people the same applies to is left as an exercise for the reader.)

Chat-like AIs like ChatGPT become possible when a base model has been fine-tuned on lots of texts representing transcriptions (real or fake) of conversations, so that they specialise in looking at texts like this:

Human: Hello!

Bot: Hello, I'm a helpful bot.  What can I do for you today?

Human: What's the capital city of France?

Bot:

...and can work out that the next word should be something like "The", and then "capital", and so on to complete the sentence: "of France is Paris. Is there anything else I can help you with?"

Getting a solid intuition for how this all works felt like an interesting thing to do, and here are my lab notes on the first steps.

[ Read more ]

LLM Quantisation Weirdness

Posted on 27 February 2024 in Programming, Python, AI

I bought myself an Nvidia RTX 3090 for Christmas to play around with local AI models. Serious work needs larger, more powerful cards, and it's easy (and not that expensive) to rent such cards by the minute from the likes of Paperspace. But the way I see it, I'm not going to be doing any serious work -- and what I really want to do is be able to run little experiments quickly and easily without worrying about spinning up a machine, getting stuff onto it, and so on.

One experiment that I tried the other day was to try to get a mental model of how model size and quantisation affect the quality of responses from LLMs. Quantisation is the process of running a model that has, say, 16 bits for each of its parameters with the parameters clipped to eight bits, four bits, or even less -- people have found that it often has a surprisingly small effect on output quality, and I wanted to play with that. Nothing serious or in-depth -- just trying stuff out with different model sizes and quantisations, and running a few prompts through them to see how the outputs differed.

I was comparing three sizes of the Code Llama HF model, with different quantisations:

Code Llama is a model from Meta, and the HF version ("Human Feedback") is designed to receive questions about programming (with specific formatting), and to reply with code. I chose those particular quantisations because the 13b model wouldn't fit in the 3090's 24GiB RAM without quantisation to a least 8-bit, and the 34b model would only fit if it was 4-bit quantised.

The quality of the response to my test question was not too bad with any of these, apart from codellama/CodeLlama-34b-Instruct-hf in 4-bit, which was often (but not always) heavily glitched with missing tokens -- that is, it was worse than codellama/CodeLlama-7b-Instruct-hf in 4-bit. That surprised me!

I was expecting quantisation to worsen the results, but not to make a larger model worse than a smaller one at the same level of quantisation. I've put a repo up on GitHub to see if anyone can repro these results, and to find out if anyone has any idea why it's happening.

[ Read more ]

Giving up on the AI chatbot tutorial (for now)

Posted on 27 February 2024 in Programming, Python, AI

I'm a big fan of learning in public, and early last year I started trying to do that by writing an AI chatbot tutorial as I learned the technology myself. But somehow it just wasn't working -- perhaps because my understanding was evolving so quickly that each time I sat down to write, I spotted dozens of errors in the previous posts, and felt I should fix those first. So I've decided to give up on that one, at least for now.

So, back to something a bit more achievable! Some lab notes will be coming on things I've been working on, including -- later on this evening -- a post about an oddity I found the other day.

In the meantime, here's a blog post I did for PythonAnywhere late last year: Five steps to create your own PythonAnywhere AI guru, on PythonAnywhere.

Building an AI chatbot for beginners: part 2

Posted on 4 April 2023 in Programming, Python, AI

[Note that this series kind of dried up; when I started the series, I knew that I knew very little about the subject, but I was hoping to learn better by learning in public. However, as time went by it turned out that this wasn't working. There are a lot of better tutorials out there!]

Welcome to the second part of my tutorial on how to build a chatbot using OpenAI's interface to their Large Language Models (LLMs)! You can read the introduction here, and the first part here. As a reminder, I'm writing this not because I'm an expert, but because I'm learning how to do it myself, and writing about it helps me learn faster. Caveat lector :-)

In this post, we'll give the bot some memory of the conversation so far.

At the end of the first part, we had a program that would accept input from a user, combine it with some static text to make a prompt that an LLM would complete in the character of a chatbot (stopping at the point that the chatbot should stop, and not trying to carry on the conversation), then send it to OpenAI's API specifying an LLM model, and print out the result.

[ Read more ]

Building an AI chatbot for beginners: part 1

Posted on 19 March 2023 in Programming, Python, AI

[Note that this series kind of dried up; when I started the series, I knew that I knew very little about the subject, but I was hoping to learn better by learning in public. However, as time went by it turned out that this wasn't working. There are a lot of better tutorials out there!]

Welcome to the first part of my tutorial on how to build a chatbot using OpenAI's interface to their Large Language Models (LLMs)! You can read the introduction here.

If you're reading this and want to get the best out of it, I strongly recommend that you run the code on your own machine as you go along: trust me, it will stick in your mind much better if you do that.

The goal in this post is to write a basic bot script that accepts user input, and just bounces it off an OpenAI LLM to generate a response.

[ Read more ]