Now I’m on the last chapter of my PhD (ok, so the other three still have some things I have to add, but hey-ho), it’s time to face up to the challenge of topic modelling.
The good thing about topic modelling is it’s a clustering problem: it assumes you know nothing about your data, and wants to find out what people have been writing about. Actually, what topic modelling is, is the application of the Latent Dirichlet Application algorithm which assumes that if you know something about the words in the corpus, you can work out what the topics are. It also assumes that all the documents belong to all the topics, but the probability for some documents belonging to all the classes will be 0, or very, very close to 0. The bad thing about topic modelling is that is assumes you know nothing about your data, and therefore will lie to you unless and until you treat it properly: feed it clean data, restrict the calorie intake of the data by removing anything unnecessarily fatty or sugary, and tell it how many topics you actually want it to find. But hang on, you don;t know anything about the data, right? So how can you possibly know what to look for? And the topic modelling algorithm just sits quietly with a smug grin on its face.
In the space of ten years or so, the computer science community, or at least the bit of it that’s tried to wipe the smug grin off its face, has gone from ‘look! It’s a miracle!’ to ‘this thing just LIES and LIES and LIES….’. Some attempts have been made to try and address the problem, but they are computationally expensive, or the code hasn’t been incorporated into the popular algorithms yet, and so the non-computer-sciency person like me is left trying to work out the best method of making the bloody thing work.
And that’s the real issue: there’s a line between ‘making it work’ and ‘fiddling the figures until you get what you’re looking for’ that must not be crossed. Not if you want an examiner to believe your research, anyway. And it turns out there are a thousand ways to tweak the data / the parameters of the algorithm / both together to ‘make it work’.
Ultimately, the goad is to uncover “….better topics that are more human interpret-able” but even if you know the domain from which your corpus has been drawn, this can still be challenging.
So, the general consensus in all research papers I’ve read seems to be that the data needs to be cleaned up (URLs removed, contractions expanded, punctuation deleted etc.); stopwords removed (including any additional words that are particular to the corpus); and the corpus tokenised (each word is now a token, not a functional part of a sentence). Some papers advocate stemming – reducing words to their root form – other’s don’t mention this at all. Stemming reduces the overall number of tokens as words like ‘teacher’, ‘teaching’, ‘taught’ become ‘teach’. The next step is then to produce a matrix of word count vectors – the number of times every word in the entire corpus is used in each document. That’s potentially a lot of zeros, which is fine. Having done this, it’s possible to go one step further and add a weighting to each word so that that count is now in inverse proportion to the number of documents its used in. In short, the least frequently used words get higher ‘scores’ and become more important as a way of signalling a particular topic. The 232 documents in my trial set of data (blog posts, in case you haven’t read anything else I’ve written) contain 7,047 unique tokens. That’s a matrix (table, if you like) of 7047 x 7047 word counts of TFIDF scores.
If I use a simple count vector and ask the algorithm to find 8 topics (a lot of trial and error suggested that this might be the optimal number. Don’t ask. Let’s just say it was a lot of running and re-running code), plus stemming and tokenising, the spread of topics looks like this:
What you’re looking at here is each of 8 topics ‘clustered’ in two-dimensional space (principal components 1 and 2). The top-30 most-used are shown on the right. The same parameters, but this time using TFIDF, are shown here:
It looks a bit different. One topic dominates all the others. What makes this all more than a bit annoying is that every time the algorithm is run, the results are slightly different, so anyone wedded to 100% accuracy and replicability is going to be apoplectic very quickly. Nevertheless, if you can take a deep breath and get beyond this, it’s entirely possible that what is being shown as a reasonable representation of the topics in a corpus as a result of the decisions made on the way to the final convergence of the algorithmic process. It’s not ‘true’, but it’s ‘true for a given value of true’.
If anyone wants to have a look at the interactive files, they’re here in my GitHub. You’ll need to download the files, and then open them using a browser.
As well as trying this out on data that’s been stemmed and tokenised, I also tried using just a count vectoriser. You’ll find the files by clicking on the link above.
The verdict? Well, the goal is to be able to add a meaningful label to each cluster. I haven’t had a really close look at them yet, but first glance suggests applying a simple count vectoriser, and only tokenising the data, seems to produce the clearest results. In the end, the method I choose to arrive at the results has to be consistent, and once I’ve decided what it is going to be, I have to accept them as they are, because I have to repeat this on 14 sets of blogs. It’s also entirely likely that, for some years, there will be more than 8 topics (or less) so there will still be some faffing around in terms of topic numbers, but that will be it. Everything else will stay the same.