Getting a custom PyTorch LLM onto the Hugging Face Hub (Transformers: AutoModel, pipeline, and Trainer)

Posted on 28 January 2026 in AI, Hugging Face, TIL deep dives, Python, PyTorch |

I spent some time recently getting some models uploaded onto the Hugging Face Hub. I'd trained a bunch of GPT-2 small sized base models from scratch as part of my LLM from scratch series, and wanted to share them with anyone that was interested. I managed to get it done, but it was kind of tricky to get right.

The Hugging Face documentation is great if you're using the built-in models, but the coverage of custom architectures is... not quite as comprehensive. There are scattered examples, but they're all a bit vague and there's nothing really bringing them all together. But with what I could find, plus a lot of running things repeatedly, seeing how they failed, tweaking changes, banging my head against obscure stacktraces, and talking to various LLMs, I got there in the end.

This post is the tutorial I wish I'd found before I started, and I hope it's useful for people in a similar position. The one warning I'd give is that I did not dig into tokenisers in any depth. My own models use the standard GPT-2 one, and so I could just use the version that is built into Transformers. The setup you need to do with custom tokenisers doesn't look all that different to what you need do to for custom models, but as I haven't spent lots of time looking into it, I won't try to write a tutorial for something I've not done :-)

Firstly, why would you want to upload a model you've trained to Hugging Face? Well, let's say you've written and trained your own LLM -- you're learning how they work, or you've got a brilliant idea about how to tweak transformers to get that one step closer to AGI using the old gaming PC in your basement. You have some PyTorch code and a bunch of weights. How do you share it?

You could, of course, just dump the code on GitHub and share the weights somewhere. If people want to play with your model, they just need to download everything, install the dependencies, and then write code to load the weights and talk to your LLM -- run inference, fine-tune it, and so on.

That's quite a big "just", though. Not everyone who is going to want to look at your model will have the relatively deep knowledge required to do all of that. Speaking for myself, I spent quite some time fine-tuning and running inference on models long before I knew how the internals worked. I was able to do this because of the easy-to-use abstraction layer in Hugging Face's Transformers library, using models that had been uploaded to their hub.

What it would be nice to do is share the model within the Hugging Face ecosystem in a way that works smoothly. Let people run inference on it like this:

from transformers import pipeline
pipe = pipeline(task="text-generation", model="some-hf-user/some-model-name", trust_remote_code=True)
out = pipe(
    "Every effort moves you",
    max_new_tokens=20,
    do_sample=True,
    temperature=1.4,
    top_k=25,
)
print(out[0]["generated_text"])

...rather than something daunting like this code with its 24 lines just to sample a few tokens from the model. Or to train it using code like what you see in this notebook -- a bit of config then trainer.train -- rather than like this, with its >100-line train function.

Here's what I had to do to get it working.

[ Read more ]


Writing an LLM from scratch, part 31 -- the models are now on Hugging Face

Posted on 17 January 2026 in AI, LLM from scratch, TIL deep dives, Python, Hugging Face, PyTorch |

As part of my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally, and four in the cloud. I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something closer to the original OpenAI weights' loss on my own hardware, or at least on something I can rent without breaking the bank.

It makes sense to share these models somewhere, both so that other people can take a look if they like, and also to build the knowledge of how to do it so that if I produce something more interesting in the future, I'll know how to share that too.

Raschka's code is all released under the Apache v2 open source license, so I can share my stuff under the same license without worrying about triggering any legal issues. So: I've put all of the models I've trained so far on Hugging Face under that license, and made them reasonably HF-native (I'll explain what I mean by that later).

From the post where I trained the models locally, we have:

Then, from the post where I trained on a bunch of different kinds of machines on Lambda Labs, four models (with two checkpoints from one of them):

You can see how they compare on my evals at the bottom of this post.

I wanted to make them all usable within the Hugging Face ecosystem -- that is, I didn't want to just dump a bunch of weights and code into repos there, but rather to have something that someone coming to them without much context could make sense of. Let's dig into that.

[ Read more ]


Writing an LLM from scratch, part 30 -- digging into the LLM-as-a-judge results

Posted on 9 January 2026 in AI, LLM from scratch, TIL deep dives |

I'm still working on my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Last time around, I trained four base models, using the GPT-2 architecture from the book, on Lambda Labs machines. I was using two ways to compare them with each other, with three models that I'd trained locally, and with the original GPT-2 weights from OpenAI:

  1. A simple cross entropy loss over a fixed test set.
  2. The results for an instruction fine-tune test that's covered in the book.

Here were the results I got, sorted by the loss:

Test loss IFT score
OpenAI weights: medium 3.231 38.53
OpenAI weights: small 3.500 22.98
Cloud FineWeb, 8x A100 40 GiB 3.674 17.09
Cloud FineWeb, 8x H100 80 GiB 3.725 11.98
Cloud FineWeb, 8x A100 80 GiB 3.730 11.71
Cloud FineWeb, 8x B200 160 GiB 3.771 13.89
Local FineWeb train 3.944 16.01
Local FineWeb-Edu extended train 4.135 14.55
Local FineWeb-Edu train 4.167 16.86

Now, you'd expect there to be at least a loose correlation; the lower the loss, the higher the IFT score. But, while we can see a difference between the OpenAI weights and our own, within our own there doesn't seem to be a logical pattern.

I think that the problem is that the results from the GPT-5.1 LLM-as-a-judge are not consistent between models. That's not a complaint about the code or its original design, of course -- it was originally written as part of the LLM book as a way of doing a quick test on an instruction fine-tuned model that we'd spent the previous 238 pages writing -- just something that was a bit more efficient than reading hundreds of input/output pairs ourselves. It was never meant to be a tool to compare models in the way I'm using it now.

In this post I'll dig into why it doesn't work for this kind of thing, and see if that's something we can change.

[ Read more ]


Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud

Posted on 7 January 2026 in AI, LLM from scratch, TIL deep dives, Python, PyTorch |

I'm carrying on with my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Having proven that I could train a GPT-2 small scale base model from scratch on my RTX 3090 in 48 hours, I wanted to try training it on a multi-GPU machine on Lambda Labs. There are two benefits I see in doing that:

  1. I can learn what you need to change in a simple single-GPU training loop to make it multi-GPU.
  2. If I can get the training time for a full base model down from 48 hours to something more manageable (and hopefully not too expensive) -- then I can try a few experiments to see how I can improve the quality of the trained model. I have a bunch of ideas about why my own base model wasn't as good as the original OpenAI one, and it would be good to know which (if any) of them are right.

In addition, I wanted to see if anything unexpected dropped out of it; after all, there were four different sizes of machines that I wanted to try, so I'd be doing four from-scratch trains on the same dataset. Does the machine size affect the quality of the model in some way?

Here's what happened. As with the last post, this is a set of tidied-up lab notes, so you can see the full journey. There's a lot to it! I was considering splitting it into multiple posts -- "writing the code", "building the datasets", "running the trains" -- but they're interleaved. Each train taught me something about how to structure the code to make it easier to use, so the code kept changing.

So I think it's worth documenting the process as it really was. If at some point I want to write a how-to document on porting single-GPU code to multi-GPU, I'll be able to mine this for resources, and in the meantime, hopefully this will be of use to readers -- even if it's just at the level of "I got this error message, how do I fix it?"

Anyway, once again I don't want to bury the lede, so: after spending US$215.16 on various trains on various servers, I was able to find that a reasonably cheap instance on Lambda Labs, with 8x A100 GPUs, each of which has 40 GiB of VRAM, is the sweet spot for this particular 163M-parameter, ~Chinchilla-optimal single-epoch run. They can train the model in less than four hours, they happen to be the right size for batches that minimise loss (more on that later), and can do that train for about US$35, excluding validation.

If you'd like to read the gory details of what I did, then read on -- but if you prefer, you can jump straight to the results.

[ Read more ]