Writing an LLM from scratch, part 18 -- residuals, shortcut connections, and the Talmud

Posted on 18 August 2025 in AI, LLM from scratch, TIL deep dives

I'm getting towards the end of chapter 4 of Sebastian Raschka's book "Build a Large Language Model (from Scratch)". When I first read this chapter, it seemed to be about tricks to use to make LLMs trainable, but having gone through it more closely, only the first part -- on layer normalisation -- seems to fit into that category. The second, about the feed-forward network is definitely not -- that's the part of the LLM that does a huge chunk of the thinking needed for next-token prediction. And this post is about another one like that, the shortcut connections.

The reason I want to highlight this is that the presentation in the book really is all about making a network trainable -- about helping with the vanishing gradients that deep neural networks are prone to.

But the more I looked into it, the more I realised that what we're doing with these shortcuts is a fundamental change to the architecture of the LLM from the way it's been expressed so far. Gradients do indeed vanish less than they would without them, but that's more of a side-effect than it is the reason for adding them.

Here's why.

Shortcuts to "fix" vanishing gradients

The book introduces the idea of shortcuts by saying:

Originally, shortcut connections were proposed for deep networks... to mitigate the challenge of vanishing gradients.

Now, I'm sure that's correct, but the remainder of the section explaining them -- taken naively -- makes it sound like they're a simple fix to the problem. "One Neat Trick To Stop Your Gradients From Vanishing!"

Raschka gives an example of a five-layer neural network, and shows (with code) that the mean of the absolute size of the gradients at each layer in a backward pass generally drops as we move from the last layer to the first -- from 0.0050 in the last to 0.0013 to 0.0007 to 0.0001 and then a slight jump to 0.0002 for the first layer. A nice illustration of vanishing gradients.

He then adds on shortcut connections so that the input of each layer is added on to its output, and shows that the gradients no longer decay; in this example, there's no shortcut connection for the last layer, so the resulting gradient averages are 1.32, 0.26, 0.32, 0.20, 0.22 -- clearly an improvement. (I assume that the higher gradients even for the last layer, and then "upstream" to the penultimate, are due to different input data or weights.)

So, shortcut connections did appear to stop the gradients from dropping off quite well. But at what cost?

Let's think about a really simple example, a two-layer network:

A two-layer NN with gradients decreasing

This is basically just layers 3 and 4 of the example in the book. We can see that the gradients are smaller in the first layer, as expected.

What do those gradients at layer one actually mean? Well, the output layer's gradients mean the adjustments that would need to be made to that layer's parameters to change the error in the output. And layer one's gradients mean the same -- but importantly, layer one only affected the output through the values it fed forward to the output layer. So, the gradients at layer one show how we would change its parameters to affect the output through its effects on the output layer.

Now, let's look at the same network but with a shortcut connection across the (admittedly now somewhat-misnamed) output layer:

A two-layer NN with a shortcut, gradients not decreasing

Our gradients at layer one are larger -- which is a win! But the meaning has changed. Now those gradients are how we would change the parameters to affect the output both directly via the shortcut and indirectly via the output layer.

There's no particular reason to think that the portion of the gradients attributable to the indirect effect on the output -- that is, layer one's output to the output layer -- will be any larger as a result of the shortcut connection. Looking at it just from the viewpoint of the vanishing gradient problem -- yay, number went up! -- ignores the fact that we've fundamentally changed what the network is doing. Maybe the higher gradients could all be attributed to that shortcut path, and we're not doing anything to improve the path from layer one to the output layer.

A metaphor that comes to mind (as an avid if incompetent guitarist) is what would happen if you added shortcut connections around pedals.

I have a distortion pedal; the wire from my guitar goes to that pedal and then from there to the amplifier. If the pedal is switched off, then I have a "clean" guitar sound; if I switch it on I can rock out with a distorted sound. However, one problem with distortion pedals is that they can add in unwanted noise -- even the background electromagnetic "hum" from the household electric mains can come through.

If I rigged up some cables and other components to have a "shortcut" connection that routed around the distortion pedal, I might be able to reduce that hum on average -- but I'd have a very different sound, a mixture of distorted and clean, and the distorted portion of that would still have just as much hum as it did before. 1

So: shortcut connections don't just prevent vanishing gradients -- they change the network quite deeply. And intuitively, there's no obvious reason why they might stop the gradients from vanishing when you consider the path through the network that doesn't follow them -- they just provide a route that sends back gradients that ignore the effects of outputs on the bypassed layers.

So, why would we use them at all? Well, they change the architecture -- and to me, at least, the change to the architecture -- what those numbers flowing through the network "mean" -- is much more interesting than their effect on gradients.

How shortcuts change the architecture

In the design that Raschka introduces in the next section, each transformer block -- that is, the combined normalisation, attention, and feed-forward processes that make up a layer in our LLM -- does this:

This is shown nicely in figure 4.13 in the book.

My mental model for what is actually happening here is that the norm-MHA-dropout and the norm-FFN-dropout sections are not taking in input vectors, processing them, and producing new context vectors for the next layer, as I was imagining them doing in the past.

Instead, they're adding new information to the input vectors. Each MHA layer and each following FFN layer is just adding notes to the existing data that came in rather than replacing it with its own, updated version.

This is quite a big change!

What it reminds me of most is the Talmud.

The Talmud, labelled Schoeneh, CC BY-SA 4.0, via Wikimedia Commons

I'm not Jewish, so -- as always! -- corrections welcome in the comments. But as I understand it, the Talmud differs from most other religious texts because it's not just the core text (the Mishna) -- or alternatively, just someone's commentary on the core text. Instead, it's the Mishna in the center (the small purple box labelled 15 in the picture), and then comments, and comments on comments, and comments on comments on comments from different scholars over the centuries around it.

For me that's a really nice metaphor for what the successive layers of transformers are doing.

The input vectors come in to the first layer; let's treat them as the core text, the Mishna, in the metaphor. We take a copy, and then normalise them, run them through MHA, and (during training) we do dropout. Then those stashed-away values, often called residuals, are added back in. What we have is something like the Mishna with one level of commentary.

Now we stash that version away as new residuals, and do the normalise-feed-forward-dropout phase of our first layer. We add the residuals we took before that, and we have our "Mishna" with two levels of commentary.

Then we run the result through our next layer, and our next, and so on -- each one adding on two levels of "commentary", the one from the normalise-attention portion, and then the one from the normalise-feed-forward network.

Wrapping up

So, another part of chapter 4 where in a few pages, we're completely changing the way the LLM works :-) I suspect this is all on me, not on the book -- Raschka almost certainly didn't expect anyone to use it as a framework for understanding how LLMs work, but as a practical guide on how to write them for people who'd learned the underlying concepts elsewhere. But to be honest, I'm finding it a super-useful way to structure my own "why" curriculum -- the repeated "wait, what?" followed by "aha!" moments are a really satisfying way to learn.

The one thing I'm taking away from this section is a personal preference in terminology, at least for this stage of my understanding. Shortcut connections are a very descriptive term for what is happening here, but thinking in terms of residuals -- you add "residual" copies of the input values back into the output before sending it on -- clicks better for me.

I think that pretty much wraps up the new stuff for chapter 4, anyway! In my next post, I'll run through the architecture of the LLM that we've been going through at a high level, linking back to my earlier posts, to see if I can come up with a reasonable summary.

And then it will be time to move on to the next big thing -- training this monster...

Here's a link to the next post in this series.


  1. Kind of tempted to try this, it might sound interesting.