Clustering Blog Posts: Part 1

Today, while my laptop was employed scraping Edu-blog posts from 2011, I decided to play around with Orange.  This is one of the suite of tools offered by Anaconda which I use for all my programming needs.

The lovely thing about Orange is that it allows you to build a visual workflow,  while all the actual work in the form of the lines and lines of code is done behind the scenes.  You have to select some parameters, which can be tricky, but all the heavy lifting is done for you.

This was my workflow today, although this was about halfway through so by the time I’d finished, there were a few more things there.  Still, it’s enough to show you how it works.


My corpus is a sample of 9,262 blog posts gathered last year.  Originally, there were over 11,000 posts but they’ve been whittled down by virtue of having no content, having content that had been broken up across several rows in the spreadsheet, or being duplicates.  I also deleted a few that simply weren’t appropriate, usually because they were written by educational consultants, as means to sell something other tangible such as books or software, or political in some way such as blogs written for one of the teaching unions.  What I’ve tried to do is identify blog URLs that contain posts by individuals, preferably but not exclusively teachers, with a professional interest in education and writing from an individual point of view.  This hasn’t been easy, and I’m certain that when I have the full set of data (which will contain many tens of thousands of blog posts) some less than ideal ones will have crept in, but that’s one of the many drawbacks of dealing with BIG DATA:  it’s simply too big to audit.

You may recall that the point of all this is to classify as much as the Edu-blogosphere as I possibly can –  to see what Edu-professionals talk about, and to see if the topics they discuss change over time.  Is there any correlation between, for example, Michael Gove being appointed Secretary of State for Education and posts discussing his influence?  We’ll see.  First of all, I have to try and cluster the posts into groups according to content.  I’ve been doing this already, and developed a methodology.  However, while I’m still gathering data, and labelling a set of ‘training data’ (of which more in a future blog post) I’ve been experimenting with a different set of tools.

So, here’s my first step using Orange.  Open the corpus, shuffle (randomise) the rows, and take a sample of 25% for further analysis, which equates to 2316 documents or rows.


Open the document, select the data you need, e.g. ‘Content’.  The other features can still be accessed.




The ‘corpus viewer’ icons along the top of the workflow shown above mean that I can see the corpus at each stage of the process.  This is a glimpse of the second one, after shuffling.  Double-clicking any of the icons brings up a view of the output, as well as the various options available for selection.


The next few steps are will give me an insight into the data, and already there are a series of decisions to make.  I’m only interested in  pre-processing the content of each blog post.  The text has to be broken up into separate words, or ‘tokens’, punctuation removed, plus any URLs that are embedded in the text.  In addition, other characters such as /?&* or whatever are also stripped out.  So far, so straightforward as this screen grab shows:



Words can also be stemmed or lemmatised (see above).

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.  Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. ”  The Stanford Natural Language Processing Group

If I lammatise the text as is, I get this lovely word cloud. BTW, it’s just a coincidence that ‘example’ is in red.WCloudLemmatiser

As you would expect from blogs written by teachers, words like ‘student’ and ‘teacher’ feature heavily.  If I use the snowball stemmer (which is basically an efficient algorithm for stemming, explained here) then the word cloud looks like this:


Both of these word clouds are also generated from the corpus after a series of words know as ‘stop words’ are removed.  These words are the ones we often use most frequently, but add little value to a text; words such as ‘the’, ‘is’, ‘at’ or ‘which’.  There is no agreed standard, although most algorithms use the list provided by the Natural Language Toolkit (NLTK).  I’ve chosen to use the list provided by Scikit-Learn, a handy module providing lots of useful algorithms.  Their list is slightly longer.  The use of stop words is well researched and recommended to reduce the number of unique tokens (words) in the data, otherwise referred to by computer scientists as dimensionality reduction.   I also added some other nonesense that I noticed when I was preprocessing this data earlier in the year- phrases like ‘twitterfacebooklike’ – so in the end I created my own list combining the ‘standard’ words and the *rap, and copied them into a text file.  This is referred to as ‘NY17stopwordsOrange.txt’ in the screenshot below.


My next big question, though, is what happens if I add to the list some of the words that are most frequently used in my data set – words like ‘teach’, ‘student’, ‘pupil’, ‘year’?  So I added these words to the list: student, school, teacher, work, year, use, pupil, time, teach, learn, use.  This is the result, using the stemmer:


There is research to suggest that creating a bespoke list of stop words that is domain-specific is worth doing as a step before going on to try and classify a set of documents.  It’s the least-used words in the corpus that are arguably the most interesting and valuable.  I’ll explore this some more in the next post, along with the following steps in the workflow.


New Allotment

So, a few weeks ago I was finally allocated an allotment and I went to have a look.  As is normal with allotments (it seems), by the time the tenant finally gives them up, they’ve been left for at least a year, often two.  This one was no different.

From top to bottom…

I admit I thought about it for a couple of days, but in the end decided I probably wasn’t going to get anything any better any time soon, so accepted and went to the local council offices to sign the paperwork.

The sheds at the bottom were both falling down. The first task was making one good one out of the remains.  Oh, and clearing the brambles.

Let me tell you, some of those brambles were massive.  And have you ever tried buying a sickle or a scythe? Basically, it’s impossible to pick either up from a shop, even a large garden centre, but it turns out you can buy any number of sharp agricultural implements online.  Not that we needed to.  We found both in the smaller shed!

Under construction…

...a bit more...


Yesterday, I put a lock on the shed and took down all the tools and other useful things I’d brought from home.  This isn’t my first allotment.

Why do I need extra work? Well, aside from the fresh air, exercise, and the fact that I’m a vegetarian who loves all things that come from the soil  (except sprouts and turnips), as some of you know I live on a boat and use a waterless toilet.  This generates stuff that goes into a ‘hot bin’, an insulated compost bin that heats the contents to over 90°C.  This is essential to kill of any bacteria that may be lurking, although given my vegetable diet there probably aren’t many that are harmful. Certainly emptying solids into the bin isn’t an unpleasant chore.  Meat eaters produce horrid stuff.

Once the bin has been doing it’s thing at over 60°C for three months, it’s ready to be spread on a garden, and this is why I need an allotment.  I have a small patch of garden to use where my boat is moored, but it’s not enough.  Every three months I get a wheelbarrow-load of lovely compost, and it needs to go somewhere.  At least now I can take it to the allotment and will be able to use it there in the spring.

The next stage is to kill back all the grass and weeds so that I can dig over a strip for late autumn / spring planting.   While I try and be as environmentally friendly as I can, I’ll use chemicals for this.  There really isn’t anything more effective, and these days it’s relatively safe.  Plus, I’m not treating it all, just about a third.

The thing is, what looks like a huge undertaking is really only a series of small jobs.  And this is what I keep telling myself…

The Problem With Guessing K-Means

I’ve been grappling with the problem of how to find out what a group of professionals blog about. That seems simple enough on the face of it, but when there are over 9,000 blogs in a sample set of data, it’s not so easy. I can’t read every one, and even if I could, can you imagine how long it might take me to group them into topics?

Enter computer science in the form of algorithms.

I’ll gloss over the hours…. days…. weeks of researching how the various alternatives work, and why algorithm A is better than algorithm B for my type of data. Turns out k-means is the one I need.

Put very simply, each blog post (document) is made up of words. Each word is used x amount of times, both in the document and in the entire collection of documents (corpus). An adjustment must be made for the overall length of the document (a word used ten times in a document of 100 words doesn’t have the same significance as the same word used ten times in document of 1000 words), but once this has been done it’s possible to give each document an overall ‘score’, which is converted to a position (or vector) within the corpus.

It helps to think of the position as a ‘vector’ in a space with an infinite number of dimensions, even if you can’t visualise it, which I can’t. But, having done this, it’s then possible to k-means to randomly pick a number of starting vectors (the number being picked in advance) and it will proceed to find all of the documents closest to it until it finds the distance becomes too great or it begins to overlap with a neighbouring group, in which case it starts again somewhere else. The algorithm does this over and again until it completes the task successfully as it can (or it’s told to do it for a maximum number of tries, or iterations) and then it tells you how many documents it’s put in each cluster.

In theory, the algorithm should produce the same number of clusters every time you run it, although that doesn’t always happen as I found with my data. The other thing is, without grouping the set manually, there’s no way of telling what the actual number of k should be, which rather defeats the point of the algorithm…. except when you’re dealing with large data sets, you’ve got no choice.

Of course, you CAN just keep clustering, adding 1 to your chosen number for k until you think you’ve got results you’re happy with. I started doing that, beginning with 10 and working up to 15, by which time I was totally bored and considering the possibility that my actual optimum number of clusters might we over 100…. Every time I ran the algorithm, the number of posts in each cluster changed, although two were stable. That seemed to be telling me that I was a long way from finding the optimum number.

Enter another load of algorithms that can help you estimate the optimum number for k. They aren’t a magic bullet – they can only help with an estimation, and each one goes about the process in a different way. I chose the one I did because a) I found the code in a very recent book written by a data scientist, and b) he gave an example of how to write the code AND IT WORKED.

Guess how many clusters it estimated I had? Go on, guess….. seven hundred and sixty. Of course I now have to go back and evaluate the results, but still. Seven hundred and sixty.

Good job I stopped at 15.


Having successfully divided my data set up into separate years yesterday, I thought I’d go back to basics and have a look at stopwords.

in language processing, it’s apparent that that are quite a few words that absolutely no value to a text.  These are words like ‘a’, ‘all’, ‘with’ etc.  NLTK (Natural Language Tool Kit – a module that can be used to process text in various ways.  You can have a play with it here) has a list of 127 words that could be considered the most basic ones.  Scikit-learn, which I’m using for some of the more complicated text processing algorithms) uses a list of 318 words taken from research carried out by the University of Glasgow .  A research paper published by them makes it clear that a fixed list is of limited use, and in fact a bespoke list should be produced if the corpus is drawn from a specific domain, as I’m doing with my blogs written by teachers and other Edu-professionals.

Basically, the more frequently a word is used in a corpus, the less useful it is.  For example, if you were presented with a data base of blogs written by teachers, and you wanted to find the blogs written about ‘progress 8’, that’s what your search term would be, possibly with some extra filtering-words like ‘secondary’ and ‘England’.  You would know not to bother with ‘student’, ‘children’ or ‘education’ because they’re words you’d expect time find in pretty much everything.  Those words are often referred to as ‘noise’.

The problem is that if the word ‘student’ was taken out of the corpus altogether, and treated as a stopword, that might have an adverse effect on the subsequent analysis of the data.  In other words, just because the word is used frequently doesn’t make it ‘noise’.   The bigger problem, then, is how to decide which of the most frequently used terms in a corpus can safely be removed.  And of course there’s the issue of whether the words on the default list should be used as well.

The paper I referred to above addresses this very problem, with some success.  I’m still trying to understand exactly how it works, but it seems to be based on the idea that a frequently-used word may in fact be an important search term.  And the reason I’ve spent so much time on this is because the old adage ‘rubbish in, rubbish out’ is indeed true, and before I go any further with the data I have, I at least need to understand the factors that may impact the results.

Thinking it through… Part 2

Having had chance to think about, and articulate some ideas as to how to deal with my data set, I started dividing it up into blogs posts by year.  I like using Pandas for Python, although it can be difficult to find help with it that is pitched at the right level.  Anyway, I separated out all the year from 2004 to 2017 and saved them in individual .csv files.

Than I had a go at clustering posts from 2017.  With ‘only’ 230 blog posts, this was relatively easy in terms of processing using the hardware available on my laptop.  I stuck with 10 clusters as I’d used this arbitrary number when I clustered the whole set.  I’ll talk in more detail about the results in the next post, but some issues remain to be addressed:

  • What to do with the entries that don’t include the year they were posted.
  • The stop words obviously need sorting out, as I’m getting rubbish like ‘facebooktwittergoogleprintmoreemaillinkedinreddit’ as one of the top terms in a cluster.  Two clusters, in fact.
  • As mentioned in the previous post, some of the titles include ‘posted on’ followed by the date of posting, and/or the category; and sometimes the blog post itself rather than the title.  I should probably try and remove the ‘posted by’ from the beginning, and I can probably get rid of the category as well.  Following that, the first sentence would probably do as the title.

The big question, though, is should I use the data from the entire set as training data for these subsequent sub-sets?  That would probably mean experimenting with different numbers of clusters until I got what looked like a coherent set of topics (which will obviously be down to my own professional judgement and inevitable researcher bias) and label them, or should I subject each subset to the principles of unsupervised learning and see what happens?

Then there’s presenting my data.  I would like something like this, explained here by the late, great Hans Rosling.

I’m imagining my timeline along the horizontal axis, probably starting around 2004 and finishing with the present.  This will probably be broken down into quarters.  The vertical axis will be the topics discussed, summed up in one or two words if possible.  How cool would that be?

Thinking It Through…

This blog is intended to be a record of the things I’ve been thinking about as I’ve looked over a sample of my data.  You might find it a bit boring…. that’s allowed.  You don’t have to read it.

Dealing with Data: Dates

I’m working on a sample of blog post data that I scraped for my PhD upgrade report (and a paper for the Web Science conference that wasn’t accepted, sadly).  The data contains ‘just’ 11,197 rows of text data: The contents of each blog post, the date it was posted, and the title of the post.  Well, that’s what I wanted when I wrote the code that went through a list of URLs and scraped the data.

A spreadsheet with 12,000 rows is just about manageable, by which I mean you wouldn’t want to print the data out, but you can scroll through and have a look at what you’ve got using Excel.  A sample like this is useful because you can observe the data you’ve gathered, and anticipate some of the problems with it.

The first thing I noticed is that rows 1486 to 2971 appear to be duplicates of the previous batch of rows.  Obviously this has happened because the source URLs have become duplicated.  Now, when I got my first list of URLs together, not all of them could be scraped.  There are several reasons for this:

  • wrong address;
  • URL no longer available;
  • password protected blog;
  • the code simply won’t work on the given URL.

My code stops running when it encounters a URL it can’t access.  Up to now, I’ve been manually cutting out the offending URL, and copying it in to a separate document that I can look at later.  This is the first place an error could be made, by me of course.

Task 1: amend code so that a URL that can’t be processed is automatically written to a separate file, and the code continues to iterate through the rest of the list.

When you’re dealing with around 1000 URLs, as I hope to do, the less intervention by me the better.

Then, there’s the data that’s gathered.  First, Excel is a very poor tool for viewing data scraped from the web.  I used Pandas (a Python module) to clean it up a bit (removing the whitespace at the beginning and end of the text) first before opening it up in Excel.  Then, it’s possible to see what’s in each cell, and align it top/left if necessary.  As I was only interested in reviewing the ‘date’ and ‘title’ columns at this stage, I saved the file with a slightly different name and deleted the ‘content’ column.  The reduction in file size makes it a bit easier to manage.

All looks good.  This is a typical entry:

65 September 11, 2012 Reading

65 is the index number given to the entry when the data was scraped, so it’s the 66th blog post from this URL (entries start at zero).

Then there’s this:

Problem 1

0 Posted on December 5, 2016 Carnival of Mathematics 140

The way the date is represented is crucial to my project.

Task 2: Remove ‘posted on ’ from the string.

Easy enough to do you’d think, but actually not.  It is possible to strip the first n-characters from the beginning of a string, but the code will iterate through every row and do the same, which is not what I want.  The other option is to split the string and copy the ‘posted on ’ (the space after ‘on’ is deliberate) bit to another column.  So, the pseudo-code would look like this:

if row in ‘Date column’ contains the string ‘posted on ’;

split string after ‘posted on ’;

write to row in ‘Posted On’ column.

Problem 2

1 Posted on January 29, 2017January 29, 2017 Education

So much is such a waste of time

Posted on January 29, 2017January 29, 2017

There are a couple of problems here.  If I split the date string as I did previously, it’s not going to help me.  I’d be left with ‘January 29, 2017January 29, 2017’.  Now what?

Secondly, the title cell looks to me as if it contains a category for the post, the title, and the date the post was made (again).  At this point, I’m thinking of finding this particular blog post via a google search, and looking at the HTML structure of the page to see why I’m getting these extra bits of unnecessary information.  It may not look like much, but:

  • when my spreadsheet has one hundred and eleven thousand rows, or more, that’s a lot of extra data;
  • I eventually want to use the titles when I present my data visually to an audience;
  • The title itself may be useful to add some substance to my analysis, so I don’t want it ‘dirtied’ with useless characters.

Problem 3

This row has a similar issue, although there is no category.  I’ve added the stars to protect the identity of the blogger.

0 Posted on November 9, 2015 What did I learn?

Posted on November 9, 2015 by C******* M*****

I’m not sure what to do about the date here, so let’s move on.  I can do this though:

Task 3: examine the HTML structure of this blog URL with a view to modifying the code used to scrape the data.

Problem 4

Here’s something else interesting:

183 Posted on March 1, 2010September 9, 2010 Software and websites I couldn’t do without

Two dates.  I suspect that the first date is the one on which the blog entry was posted, and the second is the date it was amended /updated.  Again, how am I going to deal with this?  I think I’m going to have to go back to the HTML again and see if I can make another modification to my code.  I’m only on row 925… let’s move on.

Problem 5

Here’s my next oddity:

0 2016-09-12 by k***** National Drama CPD Training for secondary teachers

I can split the string here:

if row in ‘Date column’ contains the string ‘ by’;

split string before ‘ by’;

write to row in ‘By’ column.

The space is in a different place now.  This matters, because while you and I see a space in Excel, there is in fact a character there, and it counts.  It quite literally ‘counts’ too, because it has a place.  It’s number 10 in the string (remember, counting begins at zero).  So, if I were to split the string at the space before ‘by’, it might actually split at a different place in a different cell (remember my code will iterate through every row of the column, so I need to be sure that it will only impact the cells I want it to).

Task 4: split string at ‘ by’.

The date that’s left in the cell will be in a different format from previous dates i.e. it’s 2016-9-12 rather than September 12, 2016.  Will this make a difference?  I don’t know yet.

Problem 6

0 2017-02-05 00:00:00 314. Maths is a foreign language

Problem 7

This date has the time as well.  Again, I don’t know what difference this will make.

1 21st December Phase diagrams

Now here’s a problem – no year.  A crucial piece of information is missing, and it’s missing for 696 rows (from row 4603).  Previously, I used Pandas to do a quick audit (locating rows containing 2017, 2016, 2015 etc. and had established that 786 rows were unaccounted for.  It looks as if I’ve found some of them.

p.s. rows 9522 to 9552 are similar, so there’s another 30.  Only 40 unaccounted for.

30 Posted by  b********1 Hello world!

Problem 8

This cell indicates there’s no spaces between ‘Feb’, ‘17’ and ‘2017’ although when I pasted the row into this word document, each element was on a different line.

0 Feb172017 Learning & Teaching GCSE Mathematics

This will probably be ok because when I come to analyse my data, the important pieces are the month and the year, both of which are clear.

Problem 9

And what about this?

36 8. März 201330. März 2016 Build your own low-cost slate! | Baue dein eigenes low-cost Slate!

I know from looking at this blog before that not all of it is in a foreign language (I’m assuming it’s a foreign language teacher), so do I leave this entire blog out of my master list?

Problem 10

I could split these strings, although the figure given for the number of comments varies.

0 04 Apr 2016 Leave a comment The World is Upside Down
6 22 Apr 2015 3 Comments Revision – what works best?

if row in ‘Date column’ contains the string ‘ leave a comment’;

split string before ‘ leave’;

write to row in ‘Leave’ column.

if row in ‘Date column’ contains the string ‘ (number) comments’;

split string before ‘ (number)’;

write to row in ‘Leave’ column.

It’s possible to write code that will take any numerical value for ‘(number)’.

Problem 11

Then there’s this – no title at all.

333 March 3, 2012  

I really need something here, but what?  I could amend my code so that, if it fails to find a blog post title, the phrase ‘No Title’ is written into the row instead.  Alternatives include:

  • use the first sentence from the blog post itself (which can be extracted from the ‘Contents’ cell);
  • use the three most common terms from the post (obtained from the TF-IDF analysis I’m doing on the whole data set);
  • deploy some other text analysis technique to summarise the post in one sentence, which, when you think about it, is exactly what we try and do when we come up with a title for our own blogs.

This affects quite a few rows, so it needs addressing.

Problem 12



Posted by


Posted on

March 14, 2015

Posted under



Leave a comment

Peer Observation – Priceless CPD, for free!


I’ve copied and pasted this ‘as is’, although in the spreadsheet the data in the date cell appears on one line.  This highlights one of the issues when viewing data – it will appear differently when looked at through different windows, and yet each window has its advantages.  Excel is good for scrolling through data, and for basic numerical functions.  For everything else, I use Pandas for Python, usually via the Jupyter notebook that’s part of the Anaconda suite.

Problem 13




Dec 2016

Hi Guys. This page will contain all the BSGP (bronze, silver, gold, platinum) skill sheets for your perusal.

All resources are free to a good home and are intended to be used for what they are… banks of questions rising in difficulty to help complement your teaching, not replace it!

As I create new resources I’ll add them here so check back often. At some point i’ll probably give the project a formal name and organise it a little better than I am at the minute.

All answer sheets can be found in a password protected blog post (called ‘answer sheets’ of all things!).

Hit me up on twitter  ( @mrlyonsmaths ) for the password














Here’s a row where the contents are appearing where the title should be.  I’m willing to bet that this is because of the HTML structure of the page, so I need to revisit my master code.  It’s not the only set of blog posts from a URL either.

Task 5: revisit master code for extracting ‘Title’ from this blog URL.

And all these problems are, of course, the ones I’ve uncovered in my sample.  The ones I know about.  My final data set will be huge, and I’ll have little chance of spotting anomalies unless I accidentally stumble upon them.

Welcome to my world of big data.



Back to the Classroom.

A few weeks ago, I was asked if I’d be interested in running a workshop for year 12 students as part of the ESRC* Festival of Social Science.  This was organised as part of the University of Southampton’s Learn with US (Outreach) programme which I’d quite like to do more work with in the future.  The theme of the workshops was looking at how technology, and mobile phones and devices in particular, are being used in social science research.  As part of my research, I’m looking at networks and network (or graph) theory, so I thought I could have a go at teaching that.  I find networks fascinating, AND I knew I had some excellent resources that could be adapted for use with students, so why not?

I’m also really keen on promoting the idea that a) computer science is for women too, b) web science is an excellent way of combining the social sciences with computer science, and c) age is no barrier.  A teacher who was accompanying a group of students also told me that, as well as being a role model for girls, I was also showing students why being able to write code was so important as it could have a real practical benefit.

I really miss being in the classroom.  Why will be the subject of another blog post, but suffice to say that, for me, there’s something exhilarating about putting things together (in this case, a PowerPoint and some handouts to guide students the through some actual hands-on work) so that I can deliver knowledge in a way that I hope is interesting.  I like being in charge, in my own space, directing my own personal show.  It’s also a really good chance for me to consolidate my own learning, which is one of the benefits of teaching.

The students were, of course, excellent.  They were made up of groups from several schools – one or two local, others from further afield.  It was really interesting to observe how different as groups they were from one another, which I assume reflects both the socio-economic background they were drawn from (and is almost certainly directly related to the catchment area of the school) and the ethos of the school itself.  The interactions between them, them and their teachers, and with me was markedly different from session to session.  Having only taught in one school before (and not really being detached enough to just observe), it was a fascinating experience for me.  It was, though, overwhelmingly positive and I thoroughly enjoyed it!

I’m sure they left with a positive view of the University of Southampton, and I hope they were inspired by my workshop, and the others they attended.

By the way, the resources I used were borrowed and adapted from the ‘Power of Social Networks’ MOOC** that has just finished on Futurelearn.  It’ll be repeated though, if you fancy a dabble into the world of social networks.

*Economic & Social Research Council

**Massive Open Online Course