A fun bug

28 March 2014

While I’m plugging the memory leaks in my epoll-based C reverse proxy, I thought I might share an interesting bug we found today on PythonAnywhere. The following is the bug report I posted to our forums.

So, here’s what was happening.

Each web app someone has on PythonAnywhere runs on a backend server. We have a cluster of these backends, and the cluster is behind a loadbalancer. Every backend server in the cluster is capable of running any web app; the loadbalancer’s job is to spread things out between them so that each one at any given time is only running an appropriately-sized subset of them. It has a list of backends, which we can update in realtime as we add or remove backends to scale up or down, and it looks at incoming requests and uses the domain name to work out which backend to route a request to.

That’s all pretty simple. The twist comes when we add the code that reload web apps to the mix.

Reloading a PythonAnywhere web app is simply a case of making an authenticated request to a specific URL. For example, right now (and this might change, it’s not an official API, so don’t do anything that relies on it) to reload www.foo.com owned by user fred, you’d hit the URL `http://www.pythonanywhere.com/user/fred/webapps/www.foo.com/reload`

Now, the PythonAnywhere website itself is just another web app running on one of the backends (a bit recursive, I know). So most requests to it are routed based on the normal loadbalancing algorithm. But calls specifically to that “reload” URL need to be routed differently — they need to go to the specific backend that is running the site that needs to be reloaded. So, for that URL, and that URL only, the loadbalancer uses the domain name that’s specified second-to-the-end in the path bit of the URL to choose which backend to route the request to, instead of using the hostname at the start of the URL.

So, what happened here? Well, the clue was in the usernames of the people who were affected by the problem — IronHand and JoeButy. Both of you have mixed-case usernames. And your web apps are ironhand.pythonanywhere.com and joebuty.pythonanywhere.com.

But the code on the “Web” tab that specifies the URL for reloading the selected domain specifies it using your mixed-case usernames — that is, it specifies that the reload calls should go to the URL for IronHand.pythonanywhere.com or JoeButy.pythonanywhere.com.

And you can probably guess what the problem was — the backend selection code was case-sensitive. So requests to your web apps were going to one backend, but reload messages were going to another different backend. The fix I just pushed made the backend selection code case-insensitive, as it should have been.

The remaining question — why did this suddenly crop up today? My best guess is that it’s been there for a while, but it was significantly less likely to happen, and so it was written off as a glitch when it happened in the past.

The reason it’s become more common is that we actually more than doubled the number of backends yesterday. Because of the way the backend selection code works, when there’s a relatively small number of backends it’s actually quite likely that the lower-case version of your domain will, by chance, route to the same backend as the mixed-case one. But the doubling of the number of servers changed that, and suddenly the probability that they’d route differently went up drastically.

Why did we double the number of servers? Previously, backends were m1.xlarge AWS instances. We decided that it would be better to have a larger number of smaller backends, so that problems on one server impacted a smaller number of people. So we changed our system to use m1.large instances instead, span up slightly more than twice as many backend servers, and switched the loadbalancer across.

So, there you have it. I hope it was as interesting to read about as it was to figure out :-)