Generating political news using NLTK

Posted on 4 May 2010 in Funny, Politics, Programming, Python, Resolver Systems

It's election week here in the UK; on Thursday, we'll be going to the polls to choose our next government. At Resolver Systems, thanks to energy and inventiveness of our PR guys over at Chameleon, we've been doing a bunch of things related to this, including some analysis for the New Statesman that required us to index vast quantities of tweets and newspaper articles.

Last week I was looking at the results of this indexing, and was reminded of the fun I had playing with NLTK back in February. NLTK is the Python Natural Language Toolkit; as you'd expect, it has a lot of clever stuff for parsing and interpreting text. More unexpectedly (at least for me), it has the ability to take some input text, analyse it, and then generate more text in the same style. Here's something based on the Book of Genesis:

In the selfsame day entered Noah , and asses , flocks , and Maachah . And Joseph said unto him , Abrah and he asses , and told all these things are against me . And Jacob told Rachel that he hearkened not unto you . And Sarah said , I had seen the face of the air ; for he hath broken my covenant between God and every thing that creepeth upon the man : And Eber lived after he begat Salah four hundred and thirty years , and took of every sort shalt thou be come thither .

It was the work of a moment to knock together some code that would read in all of the newspaper articles that we'd tagged as being about a particular subject, run them through a Beautiful Soup-based parser to pull out the article text, and feed that into NLTK, then to dump the results into a Wordpress blog (after a little manual polishing for readability).

[ Read more ]

Regular expressions and Resolver One column-level formulae

Posted on 26 April 2010 in Programming, Python, Resolver One

Recently at Resolver we've been doing a bit of analysis of the way people, parties and topics are mentioned on Twitter and in the traditional media in the run-up to the UK's next national election, on behalf of the New Statesman.

We've been collecting data, including millions of tweets and indexes to newspaper articles, in a MySQL database, using Django as an ORM-mapping tool -- sometime in the future I'll describe the system in a little more depth. However, from our perspective the most interesting thing about it is how we're doing the analysis -- in, of course, Resolver One.

Here's one little trick I've picked up; using regular expressions in column-level formulae as a way of parsing the output of MySQL queries.

Let's take a simple example. Imagine you have queried the database for the number of tweets per day about the Digital Economy Bill (or Act). It might look like this:

| Date       | count(*) |
+------------+----------+
| 2010-03-30 |       99 |
| 2010-03-31 |       30 |
| 2010-04-01 |       19 |
| 2010-04-02 |       12 |
| 2010-04-03 |        2 |
| 2010-04-04 |       13 |
| 2010-04-05 |       30 |
| 2010-04-06 |      958 |
| 2010-04-07 |     1629 |
| 2010-04-08 |     1961 |
| 2010-04-09 |     4038 |
| 2010-04-10 |     2584 |
| 2010-04-11 |     1940 |
| 2010-04-12 |     3333 |
| 2010-04-13 |     2421 |
| 2010-04-14 |     1319 |
| 2010-04-15 |     1387 |
| 2010-04-16 |     3194 |
| 2010-04-17 |      860 |
| 2010-04-18 |      551 |
| 2010-04-19 |      859 |
| 2010-04-20 |      685 |
| 2010-04-21 |      528 |
| 2010-04-22 |      631 |
| 2010-04-23 |      591 |
| 2010-04-24 |      320 |
| 2010-04-25 |      363 |
| 2010-04-26 |      232 |
+------------+----------+

Now, imagine you want to get these numbers into Resolver One, and because it's a one-off job, you don't want to go to all the hassle of getting an ODBC connection working all the way to the DB server. So, first step: copy from your PuTTY window, and second step, paste it into Resolver One:

Shot 1

Right. Now, the top three rows are obviously useless, so let's get rid of them:

Shot 2

Now we need to pick apart things like | 2010-03-30 | 99 | and turn them into separate columns. The first step is to import the Python regular expression library:

Shot 3

...and the next, to use it in a column-level formula in column B:

Shot 4

Now that we've parsed the data, we can use it in further column-level formulae to get the dates:

Shot 5

...and the numbers:

Shot 6

Finally, let's pick out the top 5 dates for tweets on this subject; we create a list

Shot 7

...sort it by the number of tweets in each day...

Shot 8

...reverse it to get the ones with the largest numbers of tweets...

Shot 9

...and then use the "Unpack" command (control-shift-enter) to put the first five elements into separate cells.

Shot 10

Now, once we've done this once, it's easy to use for other data; for example, we might want to find the fives days when Nick Clegg was mentioned most on Twitter. We just copy the same kind of numbers from MySQL, paste them into column A, and the list will automatically update:

Shot 11

So, a nice simple technique to create a reusable spreadsheet that parses tabular data.

OpenCL: .NET, C# and Resolver One integration -- the very beginnings

Posted on 18 March 2010 in GPU Computing, Programming, Python, Resolver One

Today I wrote the code required to call part of the OpenCL API from Resolver One; just one function so far, and all it does is get some information about your hardware setup, but it was great to get it working. There are already .NET bindings for OpenCL, but I felt that it was worthwhile reinventing the wheel -- largely as a way of making sure I understood every spoke, but also because I wanted the simplest possible API, with no extra code to make it more .NETty. It should also work as an example of how you can integrate a C library into a .NET/IronPython application like Resolver One.

I'll be documenting the whole thing when it's a bit more finished, but if you want to try out the work in progress, and are willing to build the interop code, here's how:

  • Make sure you have OpenCL installed -- here's the NVIDA OpenCL download page, and here's the OpenCL page for ATI. I've only tested this with NVIDIA so far, so I'm keen to hear of any incompatibilities.
  • Clone the dot-net-opencl project from Resolver Systems' GitHub account.
  • Load up the DotNetOpenCL.sln project file in the root of the project using Visual C# 2008 (here's the free "Express" version if you don't have it already).
  • Build the project
  • To try it out from IronPython, run ipy test_clGetPlatformIDs.py
  • To try it in Resolver One, load test_clGetPlatformIDs.rsl

That should be it! If you want to look at the code, the only important bit is in DotNetOpenCL.cs -- and it's simply an external method definition... the tricky bit was in working out which OpenCL function to write an external definition for, and what that definition should look like.

I've put a slightly tidied version of the notes I kept as I implemented this below, for posterity's sake; if you're interested in finding out how the implementation went, read on...

[ Read more ]

Playing with NLTK

Posted on 18 February 2010 in Programming, Python

I spent some time this afternoon playing with NLTK, the Python Natural Language Toolkit; the book is highly recommended. I couldn't resist sharing the results of the first time I tried its generate function, which takes a particular text and tries to automatically generate more text in the same style.

First, from the Book of Genesis:

In the selfsame day entered Noah , and asses , flocks , and Maachah . And Joseph said unto him , Abrah and he asses , and told all these things are against me . And Jacob told Rachel that he hearkened not unto you . And Sarah said , I had seen the face of the air ; for he hath broken my covenant between God and every thing that creepeth upon the man : And Eber lived after he begat Salah four hundred and thirty years , and took of every sort shalt thou be come thither .

Next, Jane Austen:

[ Sense and Sensibility by Jane Austen 1811 ] CHAPTER 1 The family of his relations , I know Marianne ' s indisposition , beyond giving time for the world to seeing YOU .-- Very far from being weary of her feeling how more than commonly joyful . He was four years . Every thing was disregarded . I had called in Berkeley Street , January . " Mrs . Jennings , sagaciously smiling , " For my part , not being Willoughby , where much of his fate with Miss Morton , with two pointers playing round him ,

And finally, from a corpus of IRC conversations:

now im lost i agree i saw the whole thing JOIN PART google 's start page .. lol yeah , he said this morning he is hoping they win tonight im gay ....... im happy as hell ...... jus aint #### lol U42 , how are you NICK : U23 what a combo hehehe JOIN . ACTION pictures the blues brothers behind that chicken wire screen . (((((((((( U34 ))))))))))))) Hi U7 ......... how are ya ll gotta watch my manners or she wo n't you play another somebody done somebody wrong song ? JOIN . ACTION wonders if U16

Scarily accurate :-)

London Financial Python Users' Group

Posted on 16 February 2010 in Finance, Python, Talks

I clearly need to post more stuff here so that it doesn't just turn into a blog announcing the LFPUG's meetings :-)

However, in the meantime, here are the details of the next one: it'll be on 11 March 2010, and is hosted this time by Man Investments Ltd at Sugar Quay, Lower Thames Street, London EC3R 6DU. As before, all are welcome, but for security reasons you need to register in advance; just drop an email to Didrik Pinte. (Update: old mailto link removed.)

Guest of honour this time around is Travis Oliphant, the creator of SciPy and the architect of NumPy. He'll be talking about NumPy memory maps and structured data-types, and Didrik will also give a talk about integrating C/C++ libraries using Cython. More suggestions for talks (or even better, offers to give talks!) are very welcome -- once again, just email Didrik, or post something in the LinkedIn group.

Next London Financial Python Users Group meeting

Posted on 28 January 2010 in Finance, Python

The next meeting of the London Financial Python Users Group will be on Feb 3, 2010 at 7pm, and is being kindly hosted by KBC Financial Products at their offices: 111 Old Broad Street, EC2N 1FP (just opposite Tower 42).

All are welcome, but for security reasons you need to register in advance; just drop an email to Didrik Pinte. (Update: old mailto link removed)

The topics planned for this meeting are:

  • Improving NumPy performance with the Intel MKL - Didrik Pinte, Enthought
  • Python to Excel bridges:
    • "PyXLL, a user friendly Python-Excel bridge" - Tony Roberts
    • Discussion on connecting Python and Excel (xlrd/xlwt, pyinex, win32com, pyxll, ...)
  • Speeding up Python code using Cython - Didrik Pinte, Enthought

A website for LFPUG

Posted on 7 December 2009 in Finance, Python, Resolver One

Didrik Pinte has put together a web page on the Python.org Wiki for the London Financial Python Users Group. Only a little content so far, but it will grow... if you're doing financial work in Python in London, do come along to the next meeting -- it will be 7pm next Monday (14 December) at MWB Regent Street, Liberty House 222 Regent Street, London W1B 5TR. You may have to put up with me talking for a while about a spreadsheet you already know everything about, but there will be interesting bits too ;-)

New York Financial Users Group

Posted on 13 November 2009 in Finance, Python

A quick follow-up to my last post; the guys at Enthought are also starting a Financial Python Users Group for New York. If you're interested, the LinkedIn group is here.

London Financial Python Users Group

Posted on 11 November 2009 in Finance, Python

Last night, I went to the inaugural meeting of the London Financial Python Users Group. It was a small gathering (as you'd expect for a new group), just four of us, but a very interesting one. Didrik Pinte gave a presentation of an interesting new library called pandas, a useful layer on top of existing stats packages that provides some very neat data-alignment capabilities, and we also discussed what future meetings should involve (lightning talks are definitely on the cards).

As with all these things, the most interesting discussions were left for the pub afterwards. I was particularly interested in what Didrik had to say about Cython, a tool which lets you write CPython C extensions in Python (!). I'll have to play with it and see if it can easily integrate with Ironclad...

Anyway, the next meeting is very tentatively planned for 14 December; the best way to track the group is probably currently the LinkedIn group, though I will definitely also post the details when they firm up.

R in Resolver One (and perhaps IronPython generally)

Posted on 2 March 2009 in Programming, Python, Resolver One

We've just announced the winner of this month's round of our competition at Resolver Systems, and it's a great one; Marjan Ghahremani, a student at UC Davis, managed to work out how you can call R (a powerful statistical analysis language) from our spreadsheet product Resolver One. You can download a ZIP file with a detailed PDF describing how it works and a bunch of examples.

If you're not interested in Resolver One, but want to use R from your own IronPython scripts, you may be able to do that too, using her instructions as guidelines -- I've not tried it myself, but there are no obvious blockers. If you do try it out, I'd love to hear how it goes.