LLM Quantisation Weirdness

Posted on 27 February 2024 in Programming, Python, AI

I bought myself an Nvidia RTX 3090 for Christmas to play around with local AI models. Serious work needs larger, more powerful cards, and it's easy (and not that expensive) to rent such cards by the minute from the likes of Paperspace. But the way I see it, I'm not going to be doing any serious work -- and what I really want to do is be able to run little experiments quickly and easily without worrying about spinning up a machine, getting stuff onto it, and so on.

One experiment that I tried the other day was to try to get a mental model of how model size and quantisation affect the quality of responses from LLMs. Quantisation is the process of running a model that has, say, 16 bits for each of its parameters with the parameters clipped to eight bits, four bits, or even less -- people have found that it often has a surprisingly small effect on output quality, and I wanted to play with that. Nothing serious or in-depth -- just trying stuff out with different model sizes and quantisations, and running a few prompts through them to see how the outputs differed.

I was comparing three sizes of the Code Llama HF model, with different quantisations:

Code Llama is a model from Meta, and the HF version ("Human Feedback") is designed to receive questions about programming (with specific formatting), and to reply with code. I chose those particular quantisations because the 13b model wouldn't fit in the 3090's 24GiB RAM without quantisation to a least 8-bit, and the 34b model would only fit if it was 4-bit quantised.

The quality of the response to my test question was not too bad with any of these, apart from codellama/CodeLlama-34b-Instruct-hf in 4-bit, which was often (but not always) heavily glitched with missing tokens -- that is, it was worse than codellama/CodeLlama-7b-Instruct-hf in 4-bit. That surprised me!

I was expecting quantisation to worsen the results, but not to make a larger model worse than a smaller one at the same level of quantisation. I've put a repo up on GitHub to see if anyone can repro these results, and to find out if anyone has any idea why it's happening.

[ Read more ]

Giving up on the AI chatbot tutorial (for now)

Posted on 27 February 2024 in Programming, Python, AI

I'm a big fan of learning in public, and early last year I started trying to do that by writing an AI chatbot tutorial as I learned the technology myself. But somehow it just wasn't working -- perhaps because my understanding was evolving so quickly that each time I sat down to write, I spotted dozens of errors in the previous posts, and felt I should fix those first. So I've decided to give up on that one, at least for now.

So, back to something a bit more achievable! Some lab notes will be coming on things I've been working on, including -- later on this evening -- a post about an oddity I found the other day.

In the meantime, here's a blog post I did for PythonAnywhere late last year: Five steps to create your own PythonAnywhere AI guru, on PythonAnywhere.

Building an AI chatbot for beginners: part 2

Posted on 4 April 2023 in Programming, Python, AI

[Note that this series kind of dried up; when I started the series, I knew that I knew very little about the subject, but I was hoping to learn better by learning in public. However, as time went by it turned out that this wasn't working. There are a lot of better tutorials out there!]

Welcome to the second part of my tutorial on how to build a chatbot using OpenAI's interface to their Large Language Models (LLMs)! You can read the introduction here, and the first part here. As a reminder, I'm writing this not because I'm an expert, but because I'm learning how to do it myself, and writing about it helps me learn faster. Caveat lector :-)

In this post, we'll give the bot some memory of the conversation so far.

At the end of the first part, we had a program that would accept input from a user, combine it with some static text to make a prompt that an LLM would complete in the character of a chatbot (stopping at the point that the chatbot should stop, and not trying to carry on the conversation), then send it to OpenAI's API specifying an LLM model, and print out the result.

[ Read more ]

Building an AI chatbot for beginners: part 1

Posted on 19 March 2023 in Programming, Python, AI

[Note that this series kind of dried up; when I started the series, I knew that I knew very little about the subject, but I was hoping to learn better by learning in public. However, as time went by it turned out that this wasn't working. There are a lot of better tutorials out there!]

Welcome to the first part of my tutorial on how to build a chatbot using OpenAI's interface to their Large Language Models (LLMs)! You can read the introduction here.

If you're reading this and want to get the best out of it, I strongly recommend that you run the code on your own machine as you go along: trust me, it will stick in your mind much better if you do that.

The goal in this post is to write a basic bot script that accepts user input, and just bounces it off an OpenAI LLM to generate a response.

[ Read more ]

Building an AI chatbot for beginners: part 0

Posted on 19 March 2023 in Programming, Python, AI

[Note that this series kind of dried up; when I started the series, I knew that I knew very little about the subject, but I was hoping to learn better by learning in public. However, as time went by it turned out that this wasn't working. There are a lot of better tutorials out there!]

Like a lot of people, I've been blown away by the capabilities of Large Language Model (LLM) based systems over the last few months. I'm using ChatGPT regularly for all kinds of things, from generating basic code to debugging errors to writing emails.

I wanted to understand more about how these tools worked, and feel strongly that there's no better way to learn something than by doing it. Building an LLM is, at least right now, super-expensive -- in the millions of dollars (although maybe that will be coming down fast?). It also requires a lot of deep knowledge to get to something interesting. Perhaps something to try in the future, but not right now.

However, using LLMs to create something interesting -- that's much easier, especially because OpenAI have a powerful API, which provides ways to do all kinds of stuff. Most relevantly, they provide access to a Completion API. That, as I understand it, is the lowest-level way of interacting with an LLM, so building something out of it is probably the best bang for the buck for learning.

Over the last few weeks I've put together a bunch of things I found interesting, and have learned a lot. But it occurred to me that an even better way to learn stuff than by building it is to build it, and then explain it to someone else, even if that person is an abstract persona for "someone out there on the Internet". So: time for a LLM chatbot tutorial!

[ Read more ]

Fun with network namespaces, part 1

Posted on 13 March 2021 in Linux, Programming

Linux has some amazing kernel features to enable containerization. Tools like Docker are built on top of them, and at PythonAnywhere we have built our own virtualization system using them.

One part of these systems that I've not spent much time poking into is network namespaces. Namespaces are a general abstraction that allows you to separate out system resources; for example, if a process is in a mount namespace, then it has its own set of mounted disks that is separate from those seen by the other processes on a machine -- or if it's in a process namespace, then it has its own cordoned-off set of processes visible to it (so, say, ps auxwf will just show the ones in its namespace).

As you might expect from that, if you put a process into a network namespace, it will have its own restricted view of what the networking environment looks like -- it won't see the machine's main network interface,

This provides certain advantages when it comes to security, but one that I thought was interesting is that because two processes inside different namespaces would have different networking environments, they could both bind to the same port -- and then could be accessed from outside via port forwarding.

To put that in more concrete terms: my goal was to be able to start up two Flask servers on the same machine, both bound to port 8080 inside their own namespace. I wanted to be able to access one of them from outside by hitting port 6000 on the machine, and the other by hitting port 6001.

Here is a run through how I got that working; it's a lightly-edited set of my "lab notes".

[ Read more ]

Creating a time series from existing data in pandas

Posted on 9 May 2017 in Programming, Python

pandas is a high-performance library for data analysis in Python. It's generally excellent, but if you're a beginner or you use it rarely, it can be tricky to find out how to do quite simple things -- the code to do what you want is likely to be very clear once you work it out, but working it out can be relatively hard.

A case in point, which I'm posting here largely so that I can find it again next time I need to do the same thing... I had a list start_times of dictionaries, each of which had (amongst other properties) a timestamp and a value. I wanted to create a pandas time series object to represent those values.

The code to do that is this:

import pandas as pd
series = pd.Series(
    [cs["value"] for cs in start_times],
    index=pd.DatetimeIndex([cs["timestamp"] for cs in start_times])
)

Perfectly clear once you see it, but it did take upwards of 40 Google searches and help from two colleagues with a reasonable amount of pandas experience to work out what it should be.

Parsing website SSL certificates in Python

Posted on 9 December 2016 in Programming, Python, PythonAnywhere

A kindly PythonAnywhere user dropped us a line today to point out that StartCom and WoSign's SSL certificates are no longer going to be supported in Chrome, Firefox and Safari. I wanted to email all of our customers who were using certificates provided by those organisations.

We have all of the domains we host stored in a database, and it was surprisingly hard to find out how I could take a PEM-formatted certificate (the normal base-64 encoded stuff surrounded by "BEGIN CERTIFICATE" and "END CERTIFICATE") in a string and find out who issued it.

After much googling, I finally found the right search terms to get to this Stack Overflow post by mhawke, so here's my adaptation of the code:

from OpenSSL import crypto

for domain in domains:
    cert = crypto.load_certificate(crypto.FILETYPE_PEM, domain.cert)
    issuer = cert.get_issuer().CN
    if issuer is None:
        # This happened with a Cloudflare-issued cert
        continue
    if "startcom" in issuer.lower() or "wosign" in issuer.lower():
        # send the user an email

pam-unshare: a PAM module that switches into a PID namespace

Posted on 15 April 2016 in Linux, Programming, PythonAnywhere

Today in my 10% time at PythonAnywhere (we're a bit less lax than Google) I wrote a PAM module that lets you configure a Linux system so that when someone sus, sudos, or sshes in, they are put into a private PID namespace. This means that they can't see anyone else's processes, either via ps or via /proc. It's definitely not production-ready, but any feedback on it would be very welcome.

In this blog post I explain why I wrote it, and how it all works, including some of the pitfalls of using PID namespaces like this and how I worked around them.

[ Read more ]

SHA-1 sunset in Chromium, and libnss3

Posted on 6 August 2015 in Linux, Programming

This post is a combination of a description of a Chrome bug (fixed in May), a mea culpa, and an explanation of of the way HTTPS certificates work. So there's something for everyone! :-)

Here's the situation -- don't worry if you don't understand all of this initially, a lot of it is explained later. Last year, the Chromium team decided that they should encourage site owners to stop using HTTPS certificates signed using the SHA-1 algorithm, which has security holes. The way they are doing this is by making the "padlock" icon in the URL bar show that a site is not secure if it has a certificate that expires after the end of 2015 if either the certificate itself is signed with SHA-1, or if any of the certificates in its chain are. I encountered some weird behaviour related to this when we recently got a new certificate for PythonAnywhere. Hopefully by posting about it here (with a bit of background covering the basics of how certificates work, including some stuff I learned along the way) I can help others who encounter the same problem.

tl;dr for people who understand certificates in some depth -- if any certificate in your chain, including your own cert, is signed with multiple hashing algorithms, then if you're using Chrome and have a version of libnss < 3.17.4 installed, Chrome's check to warn about SHA-1 signatures, instead of looking at the most-secure signature for each cert, will look at the least-secure one. So your certificate will look like it's insecure even if it's not. Solution for Ubuntu (at least for 14.04 LTS): sudo apt-get install libnss3. Thank you so much to Vincent G on Server Fault for working out the fix.

Here's the background. It's simplified a bit, but I think is largely accurate -- any corrections from people who know more about this stuff than I do would be much appreciated!

Public/private keys

Most readers here probably have a decent working knowledge of asymmetrical cryptography, so I'm going to skip a lot of detail here; there are excellent primers on public key encryption all over the Internet and if you need to know more, google for one that suits you.

But if you just want to get through this post, here's the stuff you need to know: public and private keys are large numbers, generated in public/private pairs. Each of them can be used on its own to encrypt data. Stuff that is encrypted with the private key can be decrypted with the public key, and vice versa.

It is almost impossibly unlikely that a particular bit of data encrypted with one private key would be the same as the same data encrypted with a different private key. So, if I want to prove that I sent you something, and you already have a copy of my public key (let's ignore how you got hold of it for now), I can send you the data encrypted with my private key, and if you can decrypt it then it's pretty much guaranteed that it was me who sent it -- or at least it was someone who has a copy of my private key.

Furthermore, we can extend this to a concept of digital signatures. If I want to send you some data and to prove that it came from me, I can use a hash function to reduce that data down to a short(ish) number, then I can encrypt that hash with my private key. I then send you the data, with the encrypted hash tacked onto the end as a signature. To verify that it really came from me, you hash the data using the same algorithm, then decrypt the signature using my public key. If they match, then you know I really did sign it.

This has a couple of advantages over simply encrypting the whole thing using my private key -- the most relevant for this post being that the data itself is in the clear: we're only using the encryption to prove who provided it.

Certificates

When you want to run an HTTPS site, you need to get a certificate. This is basically some data that states the domain that's covered by the certificate and who owns it, and a public key. It basically claims "the owner of the private key associated with this public key is the person that this data relates to". It also has some data saying "this is who says that this claim is true" -- the issuer. So, for the certificate that you bought from, say, Joe's SSL Certificates, the issuer will be some identifier for that company.

So, the question is, how do we stop people from just issuing themselves certificates saying that Joe has said that the private/public key pair they've just generated is the correct one for google.com? The certificate is digitally signed using Joe's SSL Certificates' private key. So, assuming that the browser has Joe's SSL Certificates' public key, and it trusts Joe to only sign certificates that he really knows are OK, it just uses Joe's public key to validate the signature.

Browsers come with a bunch of public keys installed as ones they should trust (actually, some rely on a list provided by the operating system). They're called "root CAs", and in this case, we're saying that Joe's SSL Certificates is one of them. In the real world, maybe not every browser in the world does trust the specific issuer who signed your certificate.

Certificate chains

What happens under these circumstances is that Joe gets his own certificate. This includes his public key, and is signed by someone else. So, the browser receives your certificate and Joe's when they visit your website. They check your certificate against Joe's, using the public key in Joe's certificate, and then they check Joe's against the certificate they have stored for whoever signed Joe's one.

And obviously this can go on and on; perhaps Joe's certificate was signed by Amy, who your browser doesn't trust... so your web server has to send your certificate, and Joe's certificate, and Amy's certificate, and Amy's certificate is signed by someone the browser does trust and we're done.

This could in theory go on pretty much indefinitely, with hundreds of certificates being sent, giving a chain of trust from your certificate up to someone the browser really does trust -- that is, back to an issuer and their associated public key that came packaged with Chrome or Firefox or whatever browser is being used. But in practice, chains are normally between two and five certificates long.

So, if you've ever set up SSL for a domain on PythonAnywhere or on your own server, now you know what all that stuff with the "certificate chain" or "bundle" was. You were putting together a set of certificates that started with your own one, then went from there across a sequence of signatures and public keys to one that all web browsers trust by default.

Signature hashing algorithms

One thing we've glossed over a bit until now is the hashing algorithm. It's really important that different certificates hash to different numbers. Let's imagine we had a somewhat naive algorithm:

def hash(certificate):
    return 4

I could get a certificate issued for my own domain, which would include as its signature the number 4 encrypted by my issuer's private key. Because I know that every other certificate in the world also hashes to the number 4, I can just change my certificate so that it says that it relates to google.com, keeping the same public key, and it will still pass the signature validation browsers do. This means that if I can somehow trick your computer (through some kind of clever network hackery) into thinking that my website is google.com, then I can present my hacked certificate when you visit, and your browser will accept it and show you a green padlock in the URL bar. This is obviously not a Good Thing.

This is where we can actually start talking about SHA-1. It's a hashing algorithm designed by the NSA back in 1995. It's no longer regarded as being very good -- initial problems started surfacing in 2005. This is for cryptographic values of "not very good" -- it's still pretty much impossible to produce a valid certificate that would hash to the same value as one that you're trying to impersonate. But it's possible in theory, and has been for quite a while, so it's time to move on to SHA-2. SHA-2 is also an NSA creation -- it's worth noting that the Snowden revelations don't seem to have done the hashing algorithm's reputation any harm, and it's still regarded as the way to go.

(Just to confuse matters a bit, SHA-2 is actually a family of hash functions, called SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256. So if someone is talking about something hashed with SHA-256, they could equally, if less precisely, say that it was hashed with SHA-2. I'll try to use SHA-2 in the remainder of this blog post to stop my own brain from melting when I reread it...)

Sign it once, sign it twice

So there are different hash algorithms that can be used to sign things. But that obviously could cause problems -- given a particular signature, how does the person checking the signature know which one was used? The simple fix is to add an extra field to the signature saying which algorithm was used.

And what if the checker doesn't know how to perform the particular hash that the signature requires? After all, this is the Internet, and there are computers running software of all different ages knocking around. The answer is that each certificate can have multiple signatures, all using different hashing algorithms but encrypted with the same private key (and so, decryptable with the same public key). It's simple to do -- hash the data once using SHA-1, encrypt that with your private key, and store that as the SHA-1 signature. Then hash the original data again using SHA-2, encrypt the result with the same private key, and then store that as the SHA-2 one.

Once that's done, a browser trying to validate a certificate can go through the signatures, and find the most secure signature it understands the hashing algorithm for. Then it can check that one and ignore the rest. Old software that doesn't understand the latest and greatest hashing algorithms has lower security than new software, but it'll still have something. And new software can stay ahead and always choose the newest algorithm.

The sunset

So, SHA-1 bad, SHA-2 good. Google quite understandably wanted to reduce the number of sites using certificates using the old, broken encryption. So they decided that when your browser loads a website, it will look along the chain and see if there are any certificates there that are SHA-1 only. If it's expiring soon (as in, before the end of 2015), they'll accept it anyway. That means that no-one with an old SHA-1 certificate was forced to upgrade quickly. But if it expires later than that, then they show a "broken security" warning. If you click on that, it tells you that one of the certificates in the chain uses SHA-1 so it's not secure.

The problem

Phew! Finally I can tell you what triggered this whole infodump :-)

We recently got a new certificate for PythonAnywhere. Our current one expires in August 2015, so we needed a new one for another year. I was checking it out prior to pushing it to our live service, and got the "broken security" warning. "Tsk", I thought, "what a useless certificate issuer we have!" A bit of googling around led me to pages that seemed to be claiming that our issuer had a habit of issuing SHA-1 certs and you had to kick them to get an SHA-2 one. So I sent them a grumpy email asking for an SHA-2 version of our cert, or whichever one in the chain it was. (At this point I didn't realise that a certificate could have multiple signatures, and -- I think -- the version of Chrome I was using to explore the certificate chain on the installed certificate on the test servers only showed me the SHA-1 signatures when I checked.)

Then a client provided us with a new cert for their own domain. We installed it, and bang -- they got the same warning. I let them know about what the problem appeared to me to be. But after a couple of back-and-forwards with their certificate issuers, mediated by them (and I have to thank them for their patience in this) I started doubting myself. My computer was showing the SHA-1 errors in Chrome. A Crunchbang virtual machine I was using, likewise. But the client themselves couldn't see it. They checked on Firefox, but Mozilla aren't doing this sunsetting thing, so that was expected. But then they checked in Chrome on Windows, and they didn't get it. The same Chrome version as I was running, but no warning.

The clincher for me was when my colleague Glenn checked it in his browser, running on Ubuntu, just like me. Same Chrome version... no error. There was obviously something wrong with my own machine! A whole bunch of Googling later and I found this answer to a Server Fault post, by Vincent G.

There's an Ubuntu library, libnss3, described as "Network Security Service libraries". In version 3.17.4, there was a fix to a bug described as "NSS incorrectly preferring a longer, weaker chain over a shorter, stronger chain". It looks like this was pushed live in May.

I use i3 as my window manager, which means that the normal Ubuntu "you have updates you need to install" stuff doesn't happen, so I need to update my OS manually. It looks like it was a while since I did that... (Cue spearphishing attacks.)

I updated, and suddenly both our certificate and the client's looked OK. Sincere apologies emailed to both clients and to our respective CAs...

So, just to reiterate what happened... both we and our clients were issued with new certificates. These expired after the end of 2015. Each certificate in the chain up to a root cert was signed with SHA-2 hashing, and also (for backward compatibility) was also signed with SHA-1. When loaded into a Chrome with no buggy libraries, the browser would look along the chain and recognise that every certificate had an SHA-2 signature, so it would decide it was fine. But in my version with the buggy libnss3, it would look along the chain and spot the SHA-1 signatures. It would erroneously decide to ignore the SHA-2 ones, and would report the certificate as broken.

The moral of the story? Keep updating your system. And if something looks broken, check it on as many OS/browser combinations as possible... On the other hand, when select really is broken, it's a real pain to debug.