Automating starting Lambda Labs instances

Posted on 2 April 2026 in Microprojects, AI |

I've been trying to get an 8x A100 instance on Lambda Labs to do a training run for my LLM from scratch series, but they're really busy at the moment, and it's rare to see anything.

Thanks to the wonders of agentic coding, I spent an hour today getting something up and running to help, which I've called lambda-manager. It has three commands:

  1. list-instance-types, which prints which kinds of instances are available.
  2. list-instance-type-descriptions, which prints out all of the possible instance types (available or not) with both their "friendly" names -- what you'd see on the website -- and the instance type names that the API uses.
  3. launch-when-available, which polls the API until it sees a specified type of instance, at which point it starts one and sends a Telegram message.

Let's see if that helps -- though it's been running for six hours now, with no luck...


Writing an LLM from scratch, part 32g -- Interventions: weight tying

Posted on 24 March 2026 in AI, LLM from scratch, TIL deep dives, Python |

In Sebastian Raschka's book "Build a Large Language Model (from Scratch)", he writes that weight tying, while it reduces the parameter count of a model, in his experience makes it worse. As such, apparently people don't use it in modern LLMs. Intuitively, that makes sense -- I'll explain why in this post.

But as I'm trying various interventions to see if I can get my model -- based on Raschka's code, but trained for a fraction of the time that the original GPT-2 model was -- to perform as well as the original in terms of the loss it gets on a test set, I thought it would be worth seeing if it really is a negative for this particular tiny model of 163M parameters.

After all, the original weights use weight tying, and I did find that QKV bias appeared to help -- and that's another old-school technique that they used, which has since dropped out of fashion. Might this one help too?

Worth a try! Let's give it a go.

[ Read more ]


Writing an LLM from scratch, part 32f -- Interventions: weight decay

Posted on 23 March 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

In my training code, I have this code to create the optimiser:

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=0.0004, weight_decay=0.1
    )

In my last post I looked into the learning rate, the lr parameter in that code, and found a value for that, plus some extra code to schedule it -- that is, to vary it over time -- which gave better training results.

This time I want to go into the weight decay. What is it, what is it for, and is 0.1 really the best value?

[ Read more ]


Writing an LLM from scratch, part 32e -- Interventions: the learning rate

Posted on 10 March 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

In my training code, I have this code to create the optimiser:

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=0.0004, weight_decay=0.1
    )

The values in there -- 0.0004 for the learning rate, and 0.1 for the weight decay -- were just copied from the tiny training run that we do in section 5.2 of the book.

What do those values actually mean, and are those really the right values for them?

I felt I had a good handle on the learning rate, at least -- it's one of the first things you learn when you start looking at machine learning of any kind -- but how would you go about working out what the correct value for it was? On top of that, when I was reading the Chinchilla paper a while back, I noticed they repeatedly referred to a "cosine cycle" for the learning rate, which didn't fit into anything I'd learned about before.

The weight decay was pretty much an unknown for me -- I know it is a parameter controlling the behaviour of the optimiser, but I don't know how it does that.

In this post I want to look into the learning rate, and these mysterious cosines; I'll write a follow-up about the weight decay later.

[ Read more ]


Writing an LLM from scratch, part 32d -- Interventions: adding attention bias

Posted on 6 February 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still seeing what I can do to improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". This is the third intervention I'm trying: adding bias to the attention weight matrices.

In the code from the book, we have this:

class MultiHeadAttention(nn.Module):

    def __init__(
        self,
        d_in, d_out,
        context_length,
        dropout,
        num_heads,
        qkv_bias=False
    ):
        ...

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

        ...

    def forward(self, x):
        ...

        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

So: we initialise the weights Wq, Wk and Wv as linear layers rather than simple matrices of weights, and have a parameter qkv_bias to say whether or not we should add bias to those. In all of our trains so far we've set that to False.

Why do we have this parameter, and where did it come from?

[ Read more ]


Writing an LLM from scratch, part 32c -- Interventions: removing dropout

Posted on 5 February 2026 in AI, LLM from scratch, TIL deep dives |

This is the second in my series of attempts to improve the loss on my test dataset -- interventions, as I'm calling them -- for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

Last time around I saw what gradient clipping can do -- it improved loss over the baseline by 0.014, bringing it down from 3.692 to 3.678. Not much, but it's something!

This time, I wanted to see what happened if we trained without dropout. Would removing it make the test loss worse, or better?

[ Read more ]


Writing an LLM from scratch, part 32b -- Interventions: gradient clipping

Posted on 5 February 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm still working on training the best GPT-2 small sized base model that I can with a number of FLOPs roughly equal to two days on my own machine -- my "extra credit" exercise after having worked through Sebastian Raschka's book "Build a Large Language Model (from Scratch)".

In the last post I trained a baseline model -- one with the same architecture and almost the same training code as in the minimal training run in the book, just modified to run using DDP on an 8x A100 40 GiB/GPU machine in the cloud. There are a bunch of "interventions" I want to try to see if they'll make it better, as measured by the loss they get on a test set. I'll do a post for each intervention, and this is the first: gradient clipping.

[ Read more ]


Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

Posted on 4 February 2026 in AI, LLM from scratch, TIL deep dives, Python |

I'm rounding out my series of posts on Sebastian Raschka's book "Build a Large Language Model (from Scratch)" by seeing how I could train the best base model I can from scratch on my own hardware. I started by training one in two days on my RTX 3090, and found that while it was a decent little model, it wasn't as good as the original GPT-2 small, either in terms of the loss it got on my test dataset, or in terms of how good it was at following instruction prompts after fine-tuning on them. I decided that I wanted to see what levers I could pull -- dropout, attention weight biases, and so on -- to make it better.

For that, I didn't want to have my PC tied up for days at a time with multiple long training runs, so I learned how to train faster in the cloud. That led to some refinements in the prompt-following test I was using, and I also spent a bit of time on a side quest getting the various models I'd trained onto Hugging Face Hub.

Now it's time to try the various "interventions", as I'll call them -- the levers to pull to see if I can make the model better. This post is to recap what they are, and to describe what I did to establish a baseline model to compare to.

[ Read more ]


Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)

Posted on 28 January 2026 in AI, Hugging Face, TIL deep dives, Python |

I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series, and wanted to share them with anyone that was interested. I managed to get it done, but it was kind of tricky to get right.

The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end.

This post is the tutorial I wish I'd found before I started, and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-)

Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it?

You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on.

That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library, using models that had been uploaded to their hub.

What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this:

from transformers import pipeline
pipe = pipeline(task="text-generation", model="some-hf-user/some-model-name", trust_remote_code=True)
out = pipe(
    "Every effort moves you",
    max_new_tokens=20,
    do_sample=True,
    temperature=1.4,
    top_k=25,
)
print(out[0]["generated_text"])

...rather than something daunting like this code with its 24 lines just to sample a few tokens from the model. Or to train it using code like what you see in this notebook -- a bit of config then trainer.train -- rather than like this, with its >100-line train function.

Here's what I had to do to get it working.

[ Read more ]


Writing an LLM from scratch, part 31 -- the models are now on Hugging Face

Posted on 17 January 2026 in AI, LLM from scratch, TIL deep dives, Python, Hugging Face |

As part of my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally, and four in the cloud. I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something closer to the original OpenAI weights' loss on my own hardware, or at least on something I can rent without breaking the bank.

It makes sense to share these models somewhere, both so that other people can take a look if they like, and also to build the knowledge of how to do it so that if I produce something more interesting in the future, I'll know how to share that too.

Raschka's code is all released under the Apache v2 open source license, so I can share my stuff under the same license without worrying about triggering any legal issues. So: I've put all of the models I've trained so far on Hugging Face under that license, and made them reasonably HF-native (I'll explain what I mean by that later).

From the post where I trained the models locally, we have:

Then, from the post where I trained on a bunch of different kinds of machines on Lambda Labs, four models (with two checkpoints from one of them):

You can see how they compare on my evals at the bottom of this post.

I wanted to make them all usable within the Hugging Face ecosystem -- that is, I didn't want to just dump a bunch of weights and code into repos there, but rather to have something that someone coming to them without much context could make sense of. Let's dig into that.

[ Read more ]