Writing a reverse proxy/loadbalancer from the ground up in C, part 4: Dealing with slow writes to the network

Posted on 10 October 2013 in Linux, Programming

This is the fourth step along my road to building a simple C-based reverse proxy/loadbalancer, rsp, so that I can understand how nginx/OpenResty works -- more background here. Here are links to the first part, where I showed the basic networking code required to write a proxy that could handle one incoming connection at a time and connect it with a single backend, to the second part, where I added the code to handle multiple connections by using epoll, and to the third part, where I started using Lua to configure the proxy.

This post was was unplanned; it shows how I fixed a bug that I discovered when I first tried to use rsp to act as a reverse proxy in front of this blog. The bug is fixed, and you're now reading this via rsp. The problem was that when the connection from a browser to the proxy was slower than the connection from the proxy to the backend (that is, most of the time), then when new data was received from the backend and we tried to send it to the client, we sometimes got an error to tell us that the client was not ready. This error was being ignored, so a block of data would be skipped, so the pages you got back would be missing chunks. There's more about the bug here.

Just like before, the code that I'll be describing is hosted on GitHub as a project called rsp, for "Really Simple Proxy". It's MIT licensed, and the version of it I'll be walking through in this blog post is as of commit f0521d5dd8. I'll copy and paste the code that I'm describing into this post anyway, so if you're following along there's no need to do any kind of complicated checkout.

So, the problem was that when we wrote to the file descriptor that represented the connection to the client, we ignored the return value:

        write(closure->client_handler->fd, buffer, bytes_read);

If the connection to the client is backed up, this call can:

Return a value with a number of bytes that's less than the number we asked it to write, saying "that's all I was able to write right now" or
Return -1, with an error code in errno of EAGAIN or EWOULDBLOCK, meaning "I wasn't able to write anything"

Under those circumstances, all we can really do is stash away the data that we've just received from the backend in a buffer somewhere, and wait until the client is able to receive data again. So, how do we find out when the client is ready for us? We use epoll.

Until now, all of our epoll event handlers have been listening for events related to reading stuff: EPOLLIN, for when there's data to read, and EPOLLRDHUP, for when the connection has been closed. But there's also an EPOLLOUT, which is emitted when the connection is ready for writing.

So in an ideal world, we'd just change the code to buffer data when it's unable to write stuff, and when we received an EPOLLOUT event we'd just send down whatever's in the buffer, and continue. But it's not that simple.

Remember, until now we've been using epoll in level-triggered mode. For our reading-related events, that means that if there's something there for us to read, we get an event. If we don't read all of it, that's fine -- because whether or not we get an event is based on the "level" of stuff that's waiting, we'll get another event shortly to tell us that there's still stuff to read.

But if you ask for output-related events in level-triggered mode, you will get an event every time you call epoll_wait if it's possible to write to the file descriptor. And for a file descriptor, being writable is pretty much the default state -- so almost every run around the loop triggers an event. That happens tens of thousands of times a second, which is really wasteful; your code is being told "hey, you can write to this socket", and it needs to check if there's anything that needs writing, discover that there isn't, and return. CPU usage goes to 100% for no good reason.

So, to do pretty much anything sensible with EPOLLOUT, we have to switch away from level-triggered epoll, to edge-triggered. Edge-triggering means that you only get told when something changes. If there's data to read, you're told once -- and it's up to you to read it all, because if you don't you won't be told again. But on the other hand, if a socket that was non-writable unblocks and you can write to it again, you'll be told about it once, which is what we want.

Now, it would be great if we could register two event handlers for our connection to the client -- a level-triggered one for reading, and an edge-triggered one for writing. But unfortunately that's not possible; both reading and writing happen to the same file descriptor. And if you try to add the same file descriptor to an epoll instance twice, you get an error. So either everything has to be level-triggered, or everything has to be edge-triggered.

To summarise:

In order to fix the bug, we need to detect when a client connection is temporarily unwritable because of network speed, and start buffering the data. When the connection becomes writable again, we can send the contents of the buffer to it, and then continue as normal.
In order to detect when the connection becomes writeable again, we need to listen for EPOLLOUT events.
In order to work with EPOLLOUT events, we need to switch to edge-triggered mode.

So let's take a look at the changes for switching to edge-triggered mode first. They're actually pretty simple. Firstly, we need to add EPOLLET, the flag that says "this descriptor should be dealt with in edge-triggered mode", to the flags we pass down to our epoll_ctl function, which is wrapped by add_epoll_wrapper. So, previously, we did this to add the client connection:

    add_epoll_handler(epoll_fd, client_socket_event_handler, EPOLLIN | EPOLLRDHUP);

And now we do this instead:

    add_epoll_handler(epoll_fd, client_socket_event_handler, EPOLLIN | EPOLLRDHUP | EPOLLET);

The other change is to make sure that we always read everything when we're told that there's stuff to read. Previously, we handled EPOLLIN events like this:

        bytes_read = read(self->fd, buffer, BUFFER_SIZE);
        if (bytes_read == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
            return;
         }



        if (bytes_read == 0 || bytes_read == -1) {
            close_client_socket(closure->client_handler);
            close_backend_socket(self);
            return;
        }



        write(closure->client_handler->fd, buffer, bytes_read);

So, we just read up to BUFFER_SIZE bytes, do some error handling, then write the results to the client. Now that we're in edge-triggered mode, we need to make sure that we read everything, so we update the code to use a simple loop:

        while ((bytes_read = read(self->fd, buffer, BUFFER_SIZE)) != -1 && bytes_read != 0) {
            if (bytes_read == -1 && (errno == EAGAIN || errno == EWOULDBLOCK)) {
                return;
            }



            if (bytes_read == 0 || bytes_read == -1) {
                close_client_socket(closure->client_handler);
                close_backend_socket(self);
                return;
            }



            write(closure->client_handler->fd, buffer, bytes_read);
         }

It's basically the same code, with a while loop wrapped around the outside. Simple!

Now, the code still has the bug we're trying to fix -- it still ignores the results of the call to write -- so let's sort that out. How do we do the buffering?

The first thing is that we want each of our connection handler objects -- the structures that represent the connection to the backend and the connection to the client -- to own its own buffer. Historically, when we wrote to the client connect in our event handler for the backend connection, we used its file descriptor directly:

            write(closure->client_handler->fd, buffer, bytes_read);

Now, the client handler needs to take a bit more control over the process, so it makes sense to move this into a function that lives with it:

            write_to_client(closure->client_handler, buffer, bytes_read);

So what does that look like? Here's how it starts:

void write_to_client(struct epoll_event_handler* self, char* data, int len)
{
    struct client_socket_event_data* closure = (struct client_socket_event_data* ) self->closure;



    int written = 0;
    if (closure->write_buffer == NULL) {
        written = write(self->fd, data, len);
        if (written == len) {
            return;
        }
    }

So, we get hold of the data about our client socket, and then we check if it has a write buffer -- that is, is there anything to be written queued up. If there's not, we try writing our new data straight to the connection -- basically, what we used to do. If it works, and we're told that all of stuff we wanted to be written was written, then we're done. That's the simple case, covering the situation when the client is accepting data as fast as we're sending it. Now things get a little more complicated:

    if (written == -1) {
        if (errno != EAGAIN && errno != EWOULDBLOCK) {
            perror("Error writing to client");
            exit(-1);
        }
        written = 0;
    }

If the call to write returns -1, either it's because the client connection is backed up, or it's because something's gone fundamentally wrong. If it's the latter, we just exit. If it's the former, we're fine -- we just note that the amount of data that was written was zero.

    int unwritten = len - written;
    struct data_buffer_entry* new_entry = malloc(sizeof(struct data_buffer_entry));
    new_entry->is_close_message = 0;
    new_entry->data = malloc(unwritten);
    memcpy(new_entry->data, data + written, unwritten);
    new_entry->current_offset = 0;
    new_entry->len = unwritten;
    new_entry->next = NULL;

Here, we work out how much of our data was unwritten (remember, the call to write will return the number of bytes that were actually written, so if things are clogged up then we could potentially get -1 and an appropriate error, which is the case we've already handled, or just a number less than the number of bytes we told it to write). We then create a new struct data_buffer_entry, and set a few fields on it. The obvious ones are data, which is the data that we were unable to write, and len, which is its length. is_close_message and current_offset we'll come to in a little bit; next, as you might guess, is because these structures form a linked list.

And the next thing to do is to add our new structure to the list:

    add_write_buffer_entry(closure, new_entry);
 }

...and that finishes the write_to_client function.

It's probably worth having a quick look at add_write_buffer_entry at this point; it's fairly boilerplate code to add something to a linked list -- the first item in the list being stored in the client_socket_event_data as write_buffer, which you'll remember we checked at the start of write_to_client as a way of seeing whether anything was currently buffered.

void add_write_buffer_entry(struct client_socket_event_data* closure, struct data_buffer_entry* new_entry)
{
    struct data_buffer_entry* last_buffer_entry;
    if (closure->write_buffer == NULL) {
        closure->write_buffer = new_entry;
    } else {
        for (last_buffer_entry=closure->write_buffer; last_buffer_entry->next != NULL; last_buffer_entry=last_buffer_entry->next)
            ;
        last_buffer_entry->next = new_entry;
    }
}

Fairly simple code, I won't go through it.

Right, what about that is_close_message field in the buffer entries? Well, previously, when a backend connection was closed, we just closed the client connection right away. But now we can't -- there might still be stuff buffered up that needs to be written to the client. So we simply regard the request to close the socket as another thing we should put into the buffer, using that flag. The old code to close the client connection (as called by the backend connection) looked like this:

void close_client_socket(struct epoll_event_handler* self)
{
    close(self->fd);
    free(self->closure);
    free(self);
}

Now, instead, we do this:

void close_client_socket(struct epoll_event_handler* self)
{
    struct client_socket_event_data* closure = (struct client_socket_event_data* ) self->closure;
    if (closure->write_buffer == NULL) {
        really_close_client_socket(self);

So, if there's nothing in the buffer we call this new really_close_client_socket function. really_close_client_socket just has the code from the original close_client_socket. If there is something in the buffer, we do this:

    } else {
        struct data_buffer_entry* new_entry = malloc(sizeof(struct data_buffer_entry));
        new_entry->is_close_message = 1;
        new_entry->next = NULL;



        add_write_buffer_entry(closure, new_entry);
    }
}

We just create a new buffer entry to represent the socket close, and put it on the buffer.

So that's how we buffer stuff. The next thing to look at is the code that, when the connection becomes available for writing, drains that buffer so that we can get back on track. Obviously, one thing we need to do is start listening for EPOLLOUT events, so

    add_epoll_handler(epoll_fd, client_socket_event_handler, EPOLLIN | EPOLLRDHUP | EPOLLET);

becomes

    add_epoll_handler(epoll_fd, client_socket_event_handler, EPOLLIN | EPOLLRDHUP | EPOLLET | EPOLLOUT);

But equally obviously, the interesting stuff is in handle_client_socket_event. We get a new bit of code to handle our new event, in cases where we have something in the write buffer.

    if ((events & EPOLLOUT) && (closure->write_buffer != NULL)) {

First, a bit of setup:

        int written;
        int to_write;
        struct data_buffer_entry* temp;

And now we try to iterate through the complete buffer until it's empty:

        while (closure->write_buffer != NULL) {

Now, if we encounter a message saying "close the connection", then we do so (using the same really_close_client_socket function as we did in the other function), tidy up a bit, and exit the event handler:

            if (closure->write_buffer->is_close_message) {
                really_close_client_socket(self);
                free(closure->write_buffer);
                closure->write_buffer = NULL;
                return;
            }

There could, in theory, be other events that we've been asked to handle at this point -- but we've just shut down the socket we're managing, so there's not much we could do with them anyway. We could also, in theory, have other items on the write_buffer list after the close, the memory for which would be leaked by this code; I don't think that's possible, though, so I'm not going to worry about it for now...

So, we've handled socket closes. That means that the buffer entry we're dealing with contains something to write to the client socket. Now, it's possible that we tried to write this buffer item before, and were only able to write part of it. That's what the current_offset field in the buffer entries is for; we initialised it to zero, but if a write fails to write everything we use it to keep track of how far we've got so far. So the number of bytes we need to write is the total number of bytes in this buffer entry minus the number of bytes from the entry that have already been written:

            to_write = closure->write_buffer->len - closure->write_buffer->current_offset;

So let's try and write them:

            written = write(self->fd, closure->write_buffer->data + closure->write_buffer->current_offset, to_write);

Now, our normal error handling to cover cases where we weren't able to write everything:

            if (written != to_write) {

If we got an error back, we either bomb out if it's not something we're expecting, or we set the number of bytes written to zero if it's just another "this would block" message:

                if (written == -1) {
                    if (errno != EAGAIN && errno != EWOULDBLOCK) {
                        perror("Error writing to client");
                        exit(-1);
                    }
                    written = 0;
                }

So at this point we know that we weren't able to write everything, but the socket is still OK. So we update the current offset in the buffer entry, then break out of our loop over all of the buffer entries -- we're going to have to wait for another EPOLLOUT event to write more:

                closure->write_buffer->current_offset += written;
                break;

So now we're in the code to handle the case where all of the buffer entry was successfully written. If that's the case, we just need to free up this buffer entry, and move on to the next one so that we can run through this loop again:

            } else {
                temp = closure->write_buffer;
                closure->write_buffer = closure->write_buffer->next;
                free(temp->data);
                free(temp);
            }
        }
    }

And that's it! That is the total set of changes to the codebase required to handle buffering of data when the client connection backs up and we keep receiving stuff from the backend.

There are a couple of problems with the code as it stands now: firstly, it's getting a bit complicated; the handle_client_socket_event method is getting a bit long. Secondly, we're not handling the case when the connection to the backend backs up and we have stuff from the client -- significantly less likely to happen, but potentially possible.

I think the next thing we need to do is a thorough refactoring; there's far too much code duplication between the backend and the client socket handlers. Making those two more like each other so that they can share this buffering code is what I'll post about next.