Having had chance to think about, and articulate some ideas as to how to deal with my data set, I started dividing it up into blogs posts by year. I like using Pandas for Python, although it can be difficult to find help with it that is pitched at the right level. Anyway, I separated out all the year from 2004 to 2017 and saved them in individual .csv files.
Than I had a go at clustering posts from 2017. With ‘only’ 230 blog posts, this was relatively easy in terms of processing using the hardware available on my laptop. I stuck with 10 clusters as I’d used this arbitrary number when I clustered the whole set. I’ll talk in more detail about the results in the next post, but some issues remain to be addressed:
- What to do with the entries that don’t include the year they were posted.
- The stop words obviously need sorting out, as I’m getting rubbish like ‘facebooktwittergoogleprintmoreemaillinkedinreddit’ as one of the top terms in a cluster. Two clusters, in fact.
- As mentioned in the previous post, some of the titles include ‘posted on’ followed by the date of posting, and/or the category; and sometimes the blog post itself rather than the title. I should probably try and remove the ‘posted by’ from the beginning, and I can probably get rid of the category as well. Following that, the first sentence would probably do as the title.
The big question, though, is should I use the data from the entire set as training data for these subsequent sub-sets? That would probably mean experimenting with different numbers of clusters until I got what looked like a coherent set of topics (which will obviously be down to my own professional judgement and inevitable researcher bias) and label them, or should I subject each subset to the principles of unsupervised learning and see what happens?
Then there’s presenting my data. I would like something like this, explained here by the late, great Hans Rosling.
I’m imagining my timeline along the horizontal axis, probably starting around 2004 and finishing with the present. This will probably be broken down into quarters. The vertical axis will be the topics discussed, summed up in one or two words if possible. How cool would that be?