Tag Archives: clustering

Houston, We May Have A Problem….

I’ve been writing up my PhD.  This has been a very slow process, mainly because I’ve had to spend quite a bit of time going back through all my references, and re-planning the whole thing.  I bought post-it notes, and a tabletop flip chart (which is also like one massive post-it), and I’ve re-appraised everything.  As I write, I’m constantly adding more post-its as prompts of things I need to look up / do / add to the ‘discussion’ section at the end.

One of the things I decided I’d do was go back through my original data to make sure that I’d gathered everything I needed to, and to see if I could improve the cleaning-up process.  In computer science circles, this is often referred to as ‘text wrangling’.  Your typical blog post contains URLs, other advertising rubbish that’s added by the platform, junky unicode, characters representing carriage returns, new lines…. I could go on.  This all has to be removed.  A text data file, when its being prepared for analysis, can get very big very quickly – and there’s a limit to the data file that even my pretty-well-spec’d laptop can handle.  Having discovered this excellent site, I can not copy and paste a section of a blog post with some rubbish in it, and generate the code snippet that will remove it.  Regex can be tricky – the broader the parameters i.e. the greater freedom you give it to remove the stuff you don’t want, the more chance there is that it’ll remove other stuff you really would have preferred to keep.  It’s difficult to check, though, so in the end you probably have to just take the risk.

The other thing I wanted  to do was expand the contractions in the posts so that ‘isn’t’ becomes ‘is not’ etc.  I think it’s important to leave behind a data set that may be useful to future researchers, some of whom might be interested in sentiment analysis.  Expanding contractions helps to keep the meaning of the writing intact.

Then, I decided I’d go back and look again at how I’d processed my data.  As you may recall, my aim is to classify as many edu-blogs as possible according to a pre-defined list of categories drawn from the research that’s already been done on what teachers blog about.   I chose this approach because the potential number of topics is completely unknown, and  potentially huge.  It’s possible to run an algorithm that will cluster blogs without any prior information, but the trouble is that a) you still have to give it some idea how many clusters you might be expecting, and b) the results will vary slightly each time it’s run.  Its not a model; there’s no consistency.

One of the alternatives is to label a small set of blog posts with numbers representing categories, and then use an algorithm that will take this information and classify the unlabelled posts.  This is how it works: imagine having a double handful of brown smarties and a clear perspex box, say 1m x1m.  You throw the smarties into the box, but by magic they remain stationery, but scattered, in space. Now you have a small number coloured  smarties, several of the remaining colours, and you chuck them in as well.  They also hang in space.  The label spreading algorithm assumes that the coloured smarties are the labels, and it sets about relabelling all the brown smarties according to how close they are to each different colour.  You can allow it to change the colours of the non-brown smarties if you want, and you can give it some freedom as to how far it can spread, say, the red colour.  The algorithm spreads and re-spreads each colour (some of the different coloured smarties will be quite close to each other…. where should the boundary be drawn?) until it reaches convergence.

The picture here (and above) is a great example.  Not only does it look like a load of smarties (which I’m now craving btw) but it also perfectly illustrates one of the fundamental problems with this approach – if your data, when plotted into a 3D space, is an odd shape, spreading labels across it can be a bit of a problem.  The algorithm draws a network (there are lines connecting the smarties if you look closely) and uses the links between the smarties – officially called ‘nodes’, links are ‘edges’ – to determine how many ‘hops’ (edges) it takes to get from your labelled node to your closest unlabelled one.

Each of these nodes could represent a blog post.  It has co-ordinates in this space.  The co-ordinates are generated from the words contained in the post.  The words have to be represented as numbers because none of the algorithms can deal with anything else – this is maths territory we’re in, after all.

I’ve one this label spreading thing before with a sample set of data.  It seemed to work ok.  A quick audit of the results was promising.  I had another run through the code with a different set of data, including the training set I’d developed earlier, and realised that things weren’t quite the same.  The algorithm has had a bit of an upgrade since I last deployed it.  There were some issues, and the developers from scikit-Learn made some improvements.  That got me re-thinking what I’d done, and I realised two things: I’d made a fundamental error, and the new results I was getting needed a bit of an audit.

The book on the right has been invaluable!

The fundamental error really shows up how hard it is to do data / computer science when you aren’t a data / computer scientist.  I was feeding the algorithm the wrong set of data.  I should have been feeding it an array of data based on distance, but I wasn’t.  I was still getting results though, so I didn’t notice.  The thing is, nowhere is there anything that says ‘if you want to do this, you must first do this because this’.  It’s just assumed by every writer of computer science books and blogs and tutorials that you know.  I went back and re-read a few things, and could see that the crucial bit of information was inferred.  I can spot it now I’ve gained a lot more knowledge.  So, fault corrected, move on, nothing to see here.

The audit of results isn’t very encouraging, though.  There were many mis-categorisations, and some that were just a bit…. well… odd but understandable.  One of my categories is ‘soapboxing’ – you know, having a bot of a rant about something.  Another is ‘other’ to try and catch the posts that don’t fit anywhere else.  Turns out of you have a rant in a blog post about something that isn’t about education, it still gets classed as ‘soapboxing’, which makes perfect sense when you think about it.  An algorithm can’t distinguish between a post about education and a post that isn’t, because I’m thinking about concepts / ideas / more abstract topics for blog posts, and it’s just doing maths.  Post x is closer to topic a than topic b, and so that’s where it belongs.

There are other approaches to this.  I could use topic modelling to discover topics, but that has problems too.  ‘People’ might be a valid topic, but is that useful when trying to understand what teachers have been blogging about?

My label spreading approach has been based on individual words in a blog post, but I could expand this to include commonly-occurring pairs or trios of words.  Would this make a significant difference?  It  might.  It would also put some strain on my laptop, and while this shouldn’t necessarily be a reason not to do something, it’s a legitimate consideration.  And I have tried tweaking the parameters of the algorithm.  It makes little difference.  Overall, the results aren’t different from one another, which is actually a good thing.  I can make a decision about what settings I think are best, and leave it at that.  The problem, the real problem, is that I’m working with text data – with language – and that’s a problem still not solved by AI.

What I cannot do is make the data fit my outcome.  Worst case scenario, I have a lot to add to the ‘discussion’ part of my PhD.  If I can come up with a better analytical framework, I will.  The hard work – harvesting and wrangling the data – has already been done.  If I have to find some more papers to add to the literature review, that’s no hardship.  In the meantime, I’ve slowed down again, but I’m learning so much more.


So, What DO teachers talk about?

So, having put the final piece of the coding jigsaw in place, here are the first set of results.  The diagram below represents a set of 7,786 blog posts gathered from blog URLs.  The earliest is 2009, the latest 2016.  They’re currently a  lumped in together, although in the end the data set will be a) much, much larger, and b) broken down by year (and normalised so that a proper comparison can be made).

There are lots of things going on here – how I’ve defined the categories; how I initially categorised some posts to form a training set; how the algorithms work and were applied to the data; in spite of what some people will tell you, data science has all the appearances of giving nice, clear cut answers when in fact the opposite – especially when dealing with text – is often true.

The journey to get here has been long and challenging.  Still, I’m happy.


Developing Categories, Part 2

So, while I deploy my bespoke python code to scrape the contents of umpteen WordPress and Blogger blogs, I’ve continued trying to classify blogs from my sample according the the categories I outlined in my previous post.

I say ‘trying’ because it’s not as straightforward as it seems.  Some blogs clearly don’t fit into any of the categories, e.g. where a blogger has simply written about their holiday, or for one blogger written a series of posts explaining various aspects of science or physics.  I reckon that this a science teacher writing for the benefit of his or her students, but as the posts are sharing ‘knowledge’ rather than ‘resources’, I can’t classify them.  Fortunately the label propagation algorithm I will eventually be using will allow for new categories to be instigated (or the ‘boundaries’ for existing categories to be softened) so it shouldn’t be a problem.

‘Soapboxing’, ‘professional concern’ and ‘positioning’ have also caused me to think carefully about my definitions.  ‘Soapboxing’ I’m counting as all posts that express an opinion in a strident, one-sided way,  with a strong feeling that the writer is venting frustration, and perhaps with a call to action.  These tend to be short posts, probably written because the blogger simply needs to get something off their chest and (possibly, presumably) get some support from others via the comments.  ‘Professional concern’, then, is a also post expressing a view or concern, but the language will be more measured.  Perhaps evidence from research or other bloggers will be cited, and the post will generally be longer.  The blogger may identify themselves as a teacher of some experience, or perhaps a head of department or other school leader.  As with ‘soapboxing’, a point of view will be expressed, but the call to action will be absent.

‘Positioning’ is a blog post that expresses a belief or method that the blogger holds to be valid above others, and expresses this as a series of statements.  Evidence to support the statements will be present, generally in the form of books or published research by educational theorists or other leading experts in the field of education.

Of course, having made some decisions regarding which blogs fit into these categories, I need to go back through some specific examples and try to identify some specific words or phrases that exemplify my decision.  And I fully expect other people to disagree with me, and be able to articulate excellent reasons why blog A is an example of ‘positioning’ rather than ‘professional concern’, but all I can say in response is that, while it’s possible to get a group of humans to agree around 75% of the time, it’s impossible to get them to agree 100%, and that’s but the joy and the curse of this kind of research.

Given more time, I’d choose some edu-people from Twitter and ask them to categorise a sample of blogs to verify (or otherwise) my decision, but as I don’t have that luxury the best I can do is make my definitions as clear as possible, and provide a range of examples as justification.

The other categories that aren’t proving straightforward are ‘feeedback, assessment and marking’ (‘FAM’) and ‘behaviour’.  I knew this might be the case, though, so I’m keeping an open mind about these.  I have seen examples of blogs discussing ‘behaviour’ that I’ve put into one of the three categories I’ve mentioned above, but that’s because the blogs don’t discuss ‘behaviour’ exclusively.

Anyway, I’ve categorised 284 (out of a total of 7,788) posts so far so I thought I’d have a bit of a look at the data.


I used Orange again to get a bit more insight into my data.  Just looking at the top flow, after opening the corpus I selected the rows that had something entered in the ‘group’ column I created.


Selecting rows.


Creating classes.









I then created a class for each group name.  This additional information can be saved, and I’ve dragged the ‘save data’ icon onto the workspace, but I’ve chosen not to save it automatically for now.  If you do, and you give it a file name, every time you open Orange the file will be overwritten, which you may not want.  Then, I pre-processed the 284 blogs using the snowball stemmer, and decided I’d have a look at how just the sample might be clustered using k-means.

“Since it effectively provides a ‘suffix STRIPPER GRAMmar’, I had toyed with the idea of calling it ‘strippergram’, but good sense has prevailed, and so it is ‘Snowball’ named as a tribute to SNOBOL, the excellent string handling language of Messrs Farber, Griswold, Poage and Polonsky from the 1960s.”

Martin Porter

I’m not sure if I’ve explained k-means before, but here’s a nice link that explains it well.

“Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals in a population together by similarity, but not driven by a specific purpose.”

The data points are generated from the words in the blogs.  These have been reduced to tokens by the stemmer, then a count is made of the number of times each word is used in a post.  The count is subsequently adjusted to take account of the length of the document so that a word used three times in a document of 50 words is not given undue weight compared with the same word used three times in a document of 500.  So, each document generates a score for each word used, with zero for a word not used that appears in another document or documents.  Mathematical things happen and the algorithm coverts each document into a data point in a graph like the ones in the link.  K-means then clusters the documents according to how similar they are.

I already know I have 8 classes, so that’s the number of clusters I’m looking for.  If I deploy the algorithm, I can see the result on a silhouette plot (the matching icon, top far right of the flow diagram above).  The closer to a score of ‘0’, the more likely it is that a blog post is on the border between two clusters.  When I select that the silhouette plot groups each post by cluster, it’s clear that ‘resources’ has a few blogs that are borderline.









‘FAM’ and ‘behaviour’ are more clearly demarcated.  If I let the algorithm choose the optimal number of clusters (Orange allows between 2 and 30), the result is 6, although 8 has a score of 0.708 which is reasonable (as you can see, the closer to 1 the score is, the higher the probability that the number suggested is the ‘best fit’ for the total number  of clusters within the data set).

stuff6 As you can see from the screenshot below, cluster 4 is made up of posts from nearly all the groups.  Remember, though, that this algorithm is taking absolutely no notice of my categories, or the actual words as words that convey meaning.  It’s just doing what it does  based on numbers, and providing me with a bit of an insight into my data.


Developing Categories


An initial estimate of the possible number of categories in the 25% sample my nine-thousand-odd list of blog posts, provided by the Affinity Propagation (AP) algorithm, suggested over 100 categories.   Based on the words used in the posts it chose to put into a cluster, this was actually reasonable although way more than I can process.  It was also obvious that some of the categories could have been combined:  maths- and science-based topics often appeared together, for example.

A different method provided by an algorithm in Orange (k-means, allowing the algorithm to find a ‘fit’ of between 2 and 30 clusters) suggested three or four clusters.  How is it possible for algorithms, using the same data, to come up with such widely differing suggestions for clusters?  Well, it’s maths.  No doubt a mathematician could explain to me (and to you) in detail how the results were obtained, but for me all the explanation I need is that when you start converting words to numbers and use the results to decide which sets of numbers have greater similarity, you get a result that, while useful, completely disregards the nuances of language.

An initial attempt by me to review the content of the categories suggested by AP, but I had to give up after a few hours’ work.  I identified a good number of potential categories, including the ones suggested by the literature (see below), but I soon realised that it was going to be difficult to attribute some posts to a specific category.  A well-labelled training set is really important, even if it’s a small training set.  So, back to the research that has already been published, describing he reasons why teachers and other edu-professionals blog, and a chat with my supervisor, who made the observation that I needed to think about ‘process’ as opposed to ‘product’.

Bit of a lightbulb moment, then.  I’m not trying to develop a searchable database of every topic covered – I’m trying to provide a summary of the most important aspects of teaching discussed in blogs over a period of time.   The categories arising from the literature are clearly grounded in process, and so these are the ones I’ll use.  If you click on this link, you’ll be able to see the full version of the Analytical Framework, a snippet of which is pictured above.

As well as the main categories (the ones in the blue boxes), I decided to add two more: ‘behaviour’ and ‘assessment /  feedback / marking’ simply because these, in my judgement, are important enough topics to warrant categories of their own.  However, I’m aware that they overlap with all the others, and so I may revise my decision in the light of results.  What I’ll have to do is provide clear definitions of each category, linked with the terms associated with the relevant posts.

What will be interesting is exploring each category.  The ‘concordance‘ widget in Orange allows for some of the key terms to be entered, and to see how  they’re used in posts.  This will add depth to the analysis, and may even lead to an additional category or two if it appears, for example, that ‘Ofsted’ dominated blogs within the ‘professional concern’ category for a considerable period of time, an additional category would be justified.  My intention is to divide my data into sets by year (starting at 2004), although it may be prudent to sub-divide later years as the total number of blog posts increases year on year.

Clustering Blog Posts: Part 3

No interesting visuals this time.  I’ve been spending my Saturday going back and hand-labelling what will become a training set of blog posts.

I should have done this before now, but I’ve been putting it off, mainly because it’s so tedious.  I have my sample of 2,316 blogs grouped into 136 clusters, and I’m going through them, entering the appropriate labels in the spreadsheet.  Some background reading has made it clear that a set of well-labelled data, even a small set, is extremely beneficial to a clustering algorithm.  The algorithm can choose to add new documents to the already established set, start a new set, or modify the parameters of the labelled set slightly to include the new document.  Whatever it decides, it ‘learns’ from the examples given, and the programmer can test the training set on another sample to refine the set before launching it on the entire corpus.

There has been some research into the kind of things teachers blog about.  The literature suggests the following categories:

  1. sharing resources;
  2. building a sense of connection;
  3. soapboxing;
  4. giving and receiving support;
  5. expressing professional concern;
  6. positioning.

Some of these are clearly useful – there are plenty of resource-themed blogs, although I instinctively want to label resource-sharing blogs with a reference to the subject.  ‘Soapboxing’ and ‘expressing professional concern’ appear relatively straightforward.  ‘Positioning’ refers to the  blogger ‘positioning themselves in the community i.e. as an ex-
pert practitioner or possessor of extensive subject knowledge’.  That may be more problematic, although I haven’t come across a post that looked as if it might fit into that category yet.  The ones  that are left – ‘support’ and ‘connection’ are very difficult, grounded as they are in the writers’ sense of feeling and emotion.  I’m not sure they’re appropriate as categories.

The other category that emerges from current research is ‘reflective practice’.   I’ve already come across several blog posts discussing SOLO taxonomy which could be categorised as just that – SOLO taxonomy – or ‘reflective practice’ or ‘positioning’ or ‘professional concern’.   My experience as a teacher (and here’s researcher bias again) wants to (and already has) labelled these posts as SOLO, because it fits better with my research questions, in the same way that I’m going to label some posts ‘mindset’ or ‘knowledge organiser’.  What I may do – because it’s easy at this stage – is to create two labels where there is some overlap with the existing framework suggested by the literature, which may be useful later.

It’s also worth mentioning that I’m basing my groups on the content of the blog posts.  An algorithm counts the number of times all the words in the corpus are used in each post (so many will be zero) and then adjusts the number according  to the length of the document in which it appears.  Thus, each word becomes a ‘score’ and it’s these that are used to decide which documents are most similar to one another.   Sometimes, it’s clear why the clustering algorithm has made the decision is has, other times it’s not, and this is why I’m having to go through the laborious process of hand-labelling.  Often, the blog post title makes the subject of the content clear, but not always.

Teachers and other Edu-professionals, Gods-damn them, like to be creative and cryptic when it comes to titling their blogs, and they often draw on metaphors to explain the points they’re trying to make, all of which expose algorithms that reduce language to  numbers as the miserable , soulless and devoid-of-any-real-intelligence things they are.  How very dare they.