Writing an LLM from scratch, part 24 -- the transcript hack

Posted on 28 October 2025 in AI, LLM from scratch, TIL deep dives

Chapter 7 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)" explains how we fine-tune our LLM to follow instructions -- essentially turning a model that can do next-token completion for text generation into something we can use for a chatbot.

Back when I first started looking into LLMs, I used a setup that didn't require that, and got surprisingly good results, at least with later OpenAI models.

The trick was to present the text as something that made sense in the context of next-token prediction. Instead of just throwing something like this at the LLM:

User: Provide a synonym for 'bright'

Bot:

...you would instead prepare it with an introductory paragraph, like this:

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot:

Earlier OpenAI models couldn't do this when I accessed them through the API, but later ones could.

How does our GPT-2 model stack up with this kind of thing -- and for comparison, how about a newer, more sophisticated base (as in, not instruction fine-tuned) model?

Comparing the GPT-2 models

I wrote a simple script that just allowed me to test the transcript above against different GPT-2 models, using the infrastructure that I'd already written for the code in the book. It uses the generate_text_simple function -- that is, the basic one using greedy decoding (always pick the most likely next token) and just gets the next 23 tokens.

Here's what I got with the different GPT-2 models:

gpt2-small (124M)

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot:

User:

Bot:

User:

Bot:

User

It looks like it has the concept of a transcript, at least -- but not very useful.

gpt2-medium (355M)

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot:  Bright is a synonym for bright.

User:  Bright is a synonym for bright.

Better in a way -- it's got some actual text in there.

gpt2-large (774M)

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot: ಠ_ಠ

User: Provide a synonym for 'bright'

Bot:

Hmm. Still not looking good, and that dodgy Unicode isn't great (unless the bot was trying to say ^_^ ;-)

gpt2-xl (1558M)

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot: ????

User: ????

Bot: ????

User: ????

Bot:

OK, still not getting there.

So it looks like the GPT-2 model can't do this. That actually makes a lot of sense -- at the time I wrote my post using this transcript hack, I found that the ada, babbage and curie versions could not either, and nor could earlier versions of davinci. These were all versions of GPT-3, so one generation later than the version we're working with here. The first version that "got it" was text-davinci-003, which was GPT-3.5.

But wait!

Looking back at my blog post way back when, I spotted something about the text-davinci-003 model. At the time, the docs on the OpenAI site said that it can:

do any language task with better quality, longer output, and consistent instruction-following than the curie, babbage, or ada models

Did it perhaps have instruction-following built in -- that is, had it been instruction fine-tuned already?

That doesn't quite make sense, though, because instruction fine-tuning is normally done using a particular format. Imagine that a model was fine-tuned on chats that looked like this Guanaco-format dataset:

### Human: Hello there
### Assistant: Hello, how are you?
### Human: I'm fine, how about you?

...and then it was fed something more like this (which is from a Reddit post about the Llama 2 prompt format):

<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as
possible, while being safe.  Your answers should not include any harmful, unethical,
racist, sexist, toxic, dangerous, or illegal content. Please ensure that your
responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why
instead of answering something not correct. If you don't know the answer to a
question, please don't share false information.
<</SYS>>

Hello there [/INST]
Hello, how are you?
[INST] I'm fine, how about you? [/INST]

That is, at least intuitively, going to cause problems.

What happens if we take a look at a more modern model, but make sure it's a base one?

Qwen3-0.6B-Base

I've been playing with Qwen/Qwen3-0.6B-Base, so I wrote a script to use it. Here's what I got:

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot:  Brightness is a synonym for illumination, which is the process of shining
light into a space, such as a room or a room.

User: What's a synonym for 'bright'?

Bot: Brightness is a synonym for illumination, which is the process of shining
light into a space, such as a room or a room.

User: What's a synonym for 'bright'?

Bot:

That's actually not half-bad! The first response is OK apart from the "a room or a room". Its prediction of what the user would say as their second question isn't great, though. Let's see what happens if we prompt it with a more reasonable second round -- I modified the script to complete this sequence:

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot:  Brightness is a synonym for illumination, which is the process of shining
light into a space, such as a room or a room.

User: What is an antonym of 'complicated'?

Bot:

The full result (including the input sequence) was:

This is a transcript of a conversation between a helpful bot, 'Bot', and a human,
'User'.  The bot is very intelligent and always answers the human's questions
with a useful reply.

User: Provide a synonym for 'bright'

Bot:  Brightness is a synonym for illumination, which is the process of shining
light into a space, such as a room or a room.

User: What is an antonym of 'complicated'?

Bot:  Complicated is an antonym for simple, which is easy and straightforward,
and easy to understand, which is easy and not difficult.

User: What is the difference between'reality' and 'fiction'?

Bot:  Reality is the real world or physical world, which is based on observable
facts and can be observed, and fiction is the story or idea created by a person
or a group

This is looking almost solid. And there's no mention of instruction training on the model card for Qwen/Qwen3-0.6B-Base.

Conclusion

I think that my initial intuition was right -- a sufficiently advanced base model can operate as a chatbot without instruction fine-tuning.

However, I think I was actually wrong about it when I originally had the intuition! The OpenAI GPT-3.5 model that I believed was a base one, text-davinci-003, had already had some kind of instruction-following tuning after it was initially trained. It was really impressive that it was able to generalise from that to the somewhat ad-hoc chat transcript format that I was using at the time, but it was not a base model.

By contrast, a modern 600M-parameter model -- smaller than GPT-2 large (at 774M) and less than half the size of GPT-2 XL (at 1.5B) -- can actually work with a transcript without difficulty.

My guess is that as well as the architectural improvements that have happened over the years since GPT-2, it's the size and quality of the dataset it was trained on.

GPT-2 was trained on "8 million documents for a total of 40GB of text" according to the paper. It's not entirely clear how many tokens that is, but I've seen claims of 10B tokens (for example here), and that seems in line with 40GB, as that would be 40 billion bytes, and 4 bytes/token seems reasonable.

GPT-3, according to Wikipedia, was trained on around 500B tokens.

The Qwen3 series, by contrast, according to the model card, was trained on 36 trillion tokens across 119 languages.

That's 72 times as much data -- and the data was probably much more curated, too. That's a big difference! Perhaps it had seen lots of transcripts in there, and that was why it was able to mimic them? Or just lots of different kinds of text in general?

I guess it's no surprise that training runs are getting ever-more expensive if that's the size of a frontier model run, though.

So, that's all for the transcript trick -- base models actually can work as chatbots without instruction fine-tuning, if they're sufficiently advanced and trained on enough data. That's useful to know!

Time to go back to the book; coming next, my notes on the actual fine-tuning I was meant to be doing rather than messing around with this :-)

Here's a link to the next post in this series.