It’s election week here in the UK; on Thursday, we’ll be going to the polls to choose our next government. At Resolver Systems, thanks to energy and inventiveness of our PR guys over at Chameleon, we’ve been doing a bunch of things related to this, including some analysis for the New Statesman that required us to index vast quantities of tweets and newspaper articles.
Last week I was looking at the results of this indexing, and was reminded of the fun I had playing with NLTK back in February. NLTK is the Python Natural Language Toolkit; as you’d expect, it has a lot of clever stuff for parsing and interpreting text. More unexpectedly (at least for me), it has the ability to take some input text, analyse it, and then generate more text in the same style. Here’s something based on the Book of Genesis:
In the selfsame day entered Noah , and asses , flocks , and Maachah . And Joseph said unto him , Abrah and he asses , and told all these things are against me . And Jacob told Rachel that he hearkened not unto you . And Sarah said , I had seen the face of the air ; for he hath broken my covenant between God and every thing that creepeth upon the man : And Eber lived after he begat Salah four hundred and thirty years , and took of every sort shalt thou be come thither .
It was the work of a moment to knock together some code that would read in all of the newspaper articles that we’d tagged as being about a particular subject, run them through a Beautiful Soup-based parser to pull out the article text, and feed that into NLTK, then to dump the results into a WordPress blog (after a little manual polishing for readability).
They’re interested in local government, free TV licences, pension credits and child trust fund, Carrousel Capital, run by local Liberal Democrats. TV Exclusive Trouser Clegg Nick Clegg, but clashed on how the vexing issue of honesty, principles and policies of electric shock. It is easy to do. "Louis Vuitton advertising used to pay back your debts", he declared that he has delivered his strongest warning yet on the party first place and still obsessed with outdated class structures. Inspired by Barack Obama’s repertoire, they advise you to send a message to voters at home. "You haven’t want to try to counter the threat of it yet," he says.
So, what does the code look like? It’s actually trivially simple. Let’s say that we’ve downloaded all of contents of the newspaper articles (I shan’t waste your time with HTML-munging code here) and put them into objects with content fields. Here’s what REABot does:
import nltk tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+') content_text = ' '.join(article.content for article in articles) tokenized_content = tokenizer.tokenize(content_text) content_model = nltk.NgramModel(3, tokenized_content) starting_words = content_model.generate(100)[-2:] content = content_model.generate(words_to_generate, starting_words) print ' '.join(content)
It’s a bit of a hack — I’m sure an NLTK expert could write something much more elegant — but it works :-) What this does is generate a single string, which is formed of the text of all of our relevant articles, and runs it through a tokeniser, which splits it up into words and punctuation symbols, so that (for example) the string
"I spent some time this afternoon playing with NLTK, the Python Natural Language Toolkit; the book is highly recommended." would be turned into the list
['I', 'spent', 'some', 'time', 'this', 'afternoon', 'playing', 'with', 'NLTK', ',', 'the', 'Python', 'Natural', 'Language', 'Toolkit', ';', 'the', 'book', 'is', 'highly', 'recommended', '.']
This is then fed into an NgramModel. This is nothing to do with Scientology; Ngram is a word created by extension from “bigram” and “trigram” to refer to collections of n tokens. What we’re doing with the expression
nltk.NgramModel(3, tokenized_content) is creating an NLTK object that, in effect, knows about every three-token sequence (trigram) that occurs in the tokenised text (
['I', 'spent', 'some'],
['spent', 'some', 'time'],
['some', 'time', 'this'], and so on), and knows how frequently each one occurs.
Once we’ve got the set of all possible trigrams and their respective frequencies, it’s pretty easy to see how we can generate some text given two starting words and a simple Markov-chain algorithm:
- Let’s say that we start off with
- Our analysis might tell us that there are three trigrams starting with those two tokens,
['The', 'tabloid', 'headlines']50% of the time,
['The', 'tabloid', 'newspapers']10% of the time, and
['The', 'tabloid', 'titles']40% of the time.
- We generate a random number, and if it’s less than 0.5, we emit “headlines”, if it’s between 0.5 and 0.6, we emit “newspapers”, and if it’s between 0.6 and 1.0, we emit “titles”. Let’s say it was 0.7, so we now have
['The', 'tabloid', 'titles'].
- The next step is to look at the trigrams starting
['tabloid', 'titles']; we work out the probabilities, roll the dice again, and get (say)
['tabloid', 'titles', 'have']
- Repeat a bunch of times, and we can generate any arbitrarily long text.
This is pretty much what the
generate method does. Of course, the question is, how do we get two words to start with? By default, the method will always use the first two tokens in the input text, which means that every article we generate based on the same corpus starts with the same words. (Those who know the Bible will now know why the bit from Genesis started with the words “In the”.)
I worked around this by telling it to first generate a 100-token stream of text and pick out the last two:
starting_words = content_model.generate(100)[-2:]
…and then to generate the real output using those two as the starting point:
content = content_model.generate(words_to_generate, starting_words)
It’s kind of (but not really ;-) like seeding your random number generator.
And that’s it! Once the text has been generated, I just copy and paste it into a WordPress blog, do a bit of prettification (for example, remove the spaces from before punctuation and — perhaps this is cheating a little — balance brackets and quotation marks), add appropriate tags, and hit publish. It takes about 5 minutes to generate an article, and to be honest I think the end result is better than a lot of the political blogs out there…
[An aside to UK readers: does anyone know if the business news in The Day Today was generated by something like this?]