Developing Categories


An initial estimate of the possible number of categories in the 25% sample my nine-thousand-odd list of blog posts, provided by the Affinity Propagation (AP) algorithm, suggested over 100 categories.   Based on the words used in the posts it chose to put into a cluster, this was actually reasonable although way more than I can process.  It was also obvious that some of the categories could have been combined:  maths- and science-based topics often appeared together, for example.

A different method provided by an algorithm in Orange (k-means, allowing the algorithm to find a ‘fit’ of between 2 and 30 clusters) suggested three or four clusters.  How is it possible for algorithms, using the same data, to come up with such widely differing suggestions for clusters?  Well, it’s maths.  No doubt a mathematician could explain to me (and to you) in detail how the results were obtained, but for me all the explanation I need is that when you start converting words to numbers and use the results to decide which sets of numbers have greater similarity, you get a result that, while useful, completely disregards the nuances of language.

An initial attempt by me to review the content of the categories suggested by AP, but I had to give up after a few hours’ work.  I identified a good number of potential categories, including the ones suggested by the literature (see below), but I soon realised that it was going to be difficult to attribute some posts to a specific category.  A well-labelled training set is really important, even if it’s a small training set.  So, back to the research that has already been published, describing he reasons why teachers and other edu-professionals blog, and a chat with my supervisor, who made the observation that I needed to think about ‘process’ as opposed to ‘product’.

Bit of a lightbulb moment, then.  I’m not trying to develop a searchable database of every topic covered – I’m trying to provide a summary of the most important aspects of teaching discussed in blogs over a period of time.   The categories arising from the literature are clearly grounded in process, and so these are the ones I’ll use.  If you click on this link, you’ll be able to see the full version of the Analytical Framework, a snippet of which is pictured above.

As well as the main categories (the ones in the blue boxes), I decided to add two more: ‘behaviour’ and ‘assessment /  feedback / marking’ simply because these, in my judgement, are important enough topics to warrant categories of their own.  However, I’m aware that they overlap with all the others, and so I may revise my decision in the light of results.  What I’ll have to do is provide clear definitions of each category, linked with the terms associated with the relevant posts.

What will be interesting is exploring each category.  The ‘concordance‘ widget in Orange allows for some of the key terms to be entered, and to see how  they’re used in posts.  This will add depth to the analysis, and may even lead to an additional category or two if it appears, for example, that ‘Ofsted’ dominated blogs within the ‘professional concern’ category for a considerable period of time, an additional category would be justified.  My intention is to divide my data into sets by year (starting at 2004), although it may be prudent to sub-divide later years as the total number of blog posts increases year on year.


Clustering Blog Posts: Part 3

No interesting visuals this time.  I’ve been spending my Saturday going back and hand-labelling what will become a training set of blog posts.

I should have done this before now, but I’ve been putting it off, mainly because it’s so tedious.  I have my sample of 2,316 blogs grouped into 136 clusters, and I’m going through them, entering the appropriate labels in the spreadsheet.  Some background reading has made it clear that a set of well-labelled data, even a small set, is extremely beneficial to a clustering algorithm.  The algorithm can choose to add new documents to the already established set, start a new set, or modify the parameters of the labelled set slightly to include the new document.  Whatever it decides, it ‘learns’ from the examples given, and the programmer can test the training set on another sample to refine the set before launching it on the entire corpus.

There has been some research into the kind of things teachers blog about.  The literature suggests the following categories:

  1. sharing resources;
  2. building a sense of connection;
  3. soapboxing;
  4. giving and receiving support;
  5. expressing professional concern;
  6. positioning.

Some of these are clearly useful – there are plenty of resource-themed blogs, although I instinctively want to label resource-sharing blogs with a reference to the subject.  ‘Soapboxing’ and ‘expressing professional concern’ appear relatively straightforward.  ‘Positioning’ refers to the  blogger ‘positioning themselves in the community i.e. as an ex-
pert practitioner or possessor of extensive subject knowledge’.  That may be more problematic, although I haven’t come across a post that looked as if it might fit into that category yet.  The ones  that are left – ‘support’ and ‘connection’ are very difficult, grounded as they are in the writers’ sense of feeling and emotion.  I’m not sure they’re appropriate as categories.

The other category that emerges from current research is ‘reflective practice’.   I’ve already come across several blog posts discussing SOLO taxonomy which could be categorised as just that – SOLO taxonomy – or ‘reflective practice’ or ‘positioning’ or ‘professional concern’.   My experience as a teacher (and here’s researcher bias again) wants to (and already has) labelled these posts as SOLO, because it fits better with my research questions, in the same way that I’m going to label some posts ‘mindset’ or ‘knowledge organiser’.  What I may do – because it’s easy at this stage – is to create two labels where there is some overlap with the existing framework suggested by the literature, which may be useful later.

It’s also worth mentioning that I’m basing my groups on the content of the blog posts.  An algorithm counts the number of times all the words in the corpus are used in each post (so many will be zero) and then adjusts the number according  to the length of the document in which it appears.  Thus, each word becomes a ‘score’ and it’s these that are used to decide which documents are most similar to one another.   Sometimes, it’s clear why the clustering algorithm has made the decision is has, other times it’s not, and this is why I’m having to go through the laborious process of hand-labelling.  Often, the blog post title makes the subject of the content clear, but not always.

Teachers and other Edu-professionals, Gods-damn them, like to be creative and cryptic when it comes to titling their blogs, and they often draw on metaphors to explain the points they’re trying to make, all of which expose algorithms that reduce language to  numbers as the miserable , soulless and devoid-of-any-real-intelligence things they are.  How very dare they.


Clustering Blog Posts: Part 2 (Word Frequency)

One of the most important things to do when working with a lot of data is to reduce the dimensionality of that data as far as possible.  When the data you are working with is text, this is done by reducing the number of words used in the corpus without compromising the meaning of the text.

One of the most fascinating things about language was discovered by G K Zipf in 1935¹: that the most frequently used words in (the English) language are actually few in number, and obey a ‘power law’.   The most frequently used word occurs twice as often as the next most frequently word, three times as often as the third, and so on.  Zipf’s law forms a curve like this:
Zipf-CurveThe distribution seems to apply to languages other than English, and it’s been tested many times, including using the text of, for example, novels.  It seems we humans are very happy to come up with a rich and varied lexography, but then rely on just a few to communicate with each other.  This makes perfect sense as far as I can see: I say I live on a boat, gets the essentials across (a thing that floats, a bit of an alternative lifestyle, how cool am I? etc.) because were I to say I live on a lifeboat, I then have to explain that it’s like one of the fully-enclosed ones you see hanging from the side of cruise ships, not the open Titanic-style ones most people would imagine.

“For language in particular, any such account of the Zipf’s law provides a psychological theory about what must be occurring in the minds of language users. Is there a multiplicative stochastic process at play? Communicative optimization? Preferential reuse of certain forms?” (Piantadosi, 2014)

A recent paper by Piantadosi² reviewed some of the research on word frequency distributions, and concluded that, although Zipf’s law holds broadly true, there are other models that provide a more reliable picture of word frequency which depend on the corpus selected.  Referring to a paper by another researcher, he writes “Baayen finds, with a quantitative model comparison, that which model is best depends on which corpus is examined. For instance, the log-normal model is best for the text The Hound of the Baskervilles, but the Yule–Simon model is best for Alice in Wonderland.”

I’m not a mathematician, but that broadly translates as ‘there are different ways of calculating word frequency, you pays your money you takes your choice”.  Piantadosi then goes on to explain the problem with Zipf’s law: it doesn’t take account of the fact that some words may occur more frequently that others purely by chance, giving the illusion of an underlying structure where none may exist.  He then goes on to suggests a way to overcome this problem, which is to use two independent corpora, or split a corpora in half and then test word frequency distribution in each. He then tests a range of models, and concludes that the “…distribution in language is only near-Zipfian.” and concludes “Therefore, comparisons between simple models will inevitably be between alternatives that are both “wrong.” “.

Semantics also has a strong influence on word frequency.  Piantadosi cites a study³ that compared 17 languages across six language families and concluded that simple words are used with greater frequency in all of them, and result in a near-Zipfian model.  More importantly for my project, he notes that other studies indicate that word frequencies are domain-dependent.   Piantadosi’s paper is long and presents a very thorough review of research relating to Zipf’s law, but the main point is that it does exist, even though why it should be so is still unclear.  The next question is should the most frequently used words from a particular domain also be removed?

As I mentioned before, research has already established that it’s worth removing (at least as far as English is concerned) a selection of words.  Once that’s done, which are the most frequently used words in my data?  I used Orange is to split my data in half and generate three word clouds based on the same parameters, and observe the result.  Of course I’m not measuring the distribution of words, I’m just doing a basic word count and then displaying the results, but it’s a start.  First, here’s my workflow:


I’ve shuffled (randomised) my corpus, taken a training sample of 25%, and then split this again into two equal samples.  Each of these has been pre-processed using the following parameters:


Pre-processing parameters. I used the lemmatiser this time.

The stop word set is the extended set of ‘standard’ stop words used by Scikit that I referred to in my previous post, plus a few extra things to try and get rid of some of the rubbish that appears.

The word clouds for the full set, and each separate sample, look like this:


Complete data set (25% sample, 2316 rows)


50% of sample (1188 rows)


Remaining 50% of sample

The graph below plots the frequency with which the top 500 words occur.


So, I can conclude that based on word counts, each of my samples is similar to each other, and to the total (sampled) corpus.  This is good.

So, should I remove the most frequently used words, and if so, how many?  Taking the most frequently used words across each set, and calculating the average for each word, gives me a list as follows:


And if I take them out, the word cloud (based on the entire 25% set) looks like this:


Which leads me to think I should take ‘learning’ and ‘teaching’ out as well.  It’s also interesting that the word ‘pupil’ has cropped up here – I wonder how many teachers still talk about pupils rather than students?  Of course, this data set contains blogs that may be a few years old, and/or be written by bloggers who prefer the term.  Who knows?  In fact, Orange can tell me.  The ‘concordance’ widget, when connected to the bag of words, tells me that ‘pupil’ is used in 64 rows (blogs) and will show me a snippet of the sentence.


It’s actually used a total of 121 times, and looking at the context I’m not convinced it adds value in terms of helping me with my ultimate goal, which is clustering blog posts by topic, so it’s probably worth mentioning here that the words used the least often are going to be the most numerically relevant when it comes to grouping blogs by topic.


Could I take out some more?  This is a big question.  I don’t want to remove so many words that the data becomes difficult to cluster.  Think of this as searching the blog posts using key words, much as you would when you search Google.  Where, as a teacher, you might want to search ‘curriculum’, you might be more interesting in results that discuss ‘teaching (the) curriculum’  rather than those that cover ‘designing (the) curriculum’.  If ‘teaching’ has already been removed, how will you find what you’re looking for?  Alternatively, does it matter so long as search returns everything that contains the word ‘curriculum’?  You may be more interested in searching for ‘curriculum’ differentiating by key stage.  For my purposes, I think I’d be happy with a cluster labelled ‘curriculum’ that covered all aspects of the topic.  I’ll be able to judge when I see some actual clusters emerge, and have the chance to examine them more closely.  Which, incidentally, the concordance widget tells me is used in 93 blogs, and appears 147 times.  That’s more than ‘pupil’, but because of my specialised domain knowledge I judge to be more important to the corpus.

Which is also a good example of researcher bias.

  1. Zipf, G. K.; The Psychology of Language; 1966; The M.I.T. Press.
  2. Piantadosi, S.; Zipf’s word frequency law in natural language: A critical review and future Directions; Journal of National Institutes of Health; 2014; volume 21, October issue, pages 1112-1130.
  3. Calude, A., Pagel, M.;  How do we use language? Shared patterns in the frequency of word use across 17 world languages; 2011; Journal of Philosophical Transactions of the Royal Society of London. Series B, Biological sciences; volume 366, issue 1567, pages 1101-7.

Clustering Blog Posts: Part 1

Today, while my laptop was employed scraping Edu-blog posts from 2011, I decided to play around with Orange.  This is one of the suite of tools offered by Anaconda which I use for all my programming needs.

The lovely thing about Orange is that it allows you to build a visual workflow,  while all the actual work in the form of the lines and lines of code is done behind the scenes.  You have to select some parameters, which can be tricky, but all the heavy lifting is done for you.

This was my workflow today, although this was about halfway through so by the time I’d finished, there were a few more things there.  Still, it’s enough to show you how it works.


My corpus is a sample of 9,262 blog posts gathered last year.  Originally, there were over 11,000 posts but they’ve been whittled down by virtue of having no content, having content that had been broken up across several rows in the spreadsheet, or being duplicates.  I also deleted a few that simply weren’t appropriate, usually because they were written by educational consultants, as means to sell something other tangible such as books or software, or political in some way such as blogs written for one of the teaching unions.  What I’ve tried to do is identify blog URLs that contain posts by individuals, preferably but not exclusively teachers, with a professional interest in education and writing from an individual point of view.  This hasn’t been easy, and I’m certain that when I have the full set of data (which will contain many tens of thousands of blog posts) some less than ideal ones will have crept in, but that’s one of the many drawbacks of dealing with BIG DATA:  it’s simply too big to audit.

You may recall that the point of all this is to classify as much as the Edu-blogosphere as I possibly can –  to see what Edu-professionals talk about, and to see if the topics they discuss change over time.  Is there any correlation between, for example, Michael Gove being appointed Secretary of State for Education and posts discussing his influence?  We’ll see.  First of all, I have to try and cluster the posts into groups according to content.  I’ve been doing this already, and developed a methodology.  However, while I’m still gathering data, and labelling a set of ‘training data’ (of which more in a future blog post) I’ve been experimenting with a different set of tools.

So, here’s my first step using Orange.  Open the corpus, shuffle (randomise) the rows, and take a sample of 25% for further analysis, which equates to 2316 documents or rows.


Open the document, select the data you need, e.g. ‘Content’.  The other features can still be accessed.




The ‘corpus viewer’ icons along the top of the workflow shown above mean that I can see the corpus at each stage of the process.  This is a glimpse of the second one, after shuffling.  Double-clicking any of the icons brings up a view of the output, as well as the various options available for selection.


The next few steps are will give me an insight into the data, and already there are a series of decisions to make.  I’m only interested in  pre-processing the content of each blog post.  The text has to be broken up into separate words, or ‘tokens’, punctuation removed, plus any URLs that are embedded in the text.  In addition, other characters such as /?&* or whatever are also stripped out.  So far, so straightforward as this screen grab shows:



Words can also be stemmed or lemmatised (see above).

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.  Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. ”  The Stanford Natural Language Processing Group

If I lammatise the text as is, I get this lovely word cloud. BTW, it’s just a coincidence that ‘example’ is in red.WCloudLemmatiser

As you would expect from blogs written by teachers, words like ‘student’ and ‘teacher’ feature heavily.  If I use the snowball stemmer (which is basically an efficient algorithm for stemming, explained here) then the word cloud looks like this:


Both of these word clouds are also generated from the corpus after a series of words know as ‘stop words’ are removed.  These words are the ones we often use most frequently, but add little value to a text; words such as ‘the’, ‘is’, ‘at’ or ‘which’.  There is no agreed standard, although most algorithms use the list provided by the Natural Language Toolkit (NLTK).  I’ve chosen to use the list provided by Scikit-Learn, a handy module providing lots of useful algorithms.  Their list is slightly longer.  The use of stop words is well researched and recommended to reduce the number of unique tokens (words) in the data, otherwise referred to by computer scientists as dimensionality reduction.   I also added some other nonesense that I noticed when I was preprocessing this data earlier in the year- phrases like ‘twitterfacebooklike’ – so in the end I created my own list combining the ‘standard’ words and the *rap, and copied them into a text file.  This is referred to as ‘NY17stopwordsOrange.txt’ in the screenshot below.


My next big question, though, is what happens if I add to the list some of the words that are most frequently used in my data set – words like ‘teach’, ‘student’, ‘pupil’, ‘year’?  So I added these words to the list: student, school, teacher, work, year, use, pupil, time, teach, learn, use.  This is the result, using the stemmer:


There is research to suggest that creating a bespoke list of stop words that is domain-specific is worth doing as a step before going on to try and classify a set of documents.  It’s the least-used words in the corpus that are arguably the most interesting and valuable.  I’ll explore this some more in the next post, along with the following steps in the workflow.

New Allotment

So, a few weeks ago I was finally allocated an allotment and I went to have a look.  As is normal with allotments (it seems), by the time the tenant finally gives them up, they’ve been left for at least a year, often two.  This one was no different.

From top to bottom…

I admit I thought about it for a couple of days, but in the end decided I probably wasn’t going to get anything any better any time soon, so accepted and went to the local council offices to sign the paperwork.

The sheds at the bottom were both falling down. The first task was making one good one out of the remains.  Oh, and clearing the brambles.

Let me tell you, some of those brambles were massive.  And have you ever tried buying a sickle or a scythe? Basically, it’s impossible to pick either up from a shop, even a large garden centre, but it turns out you can buy any number of sharp agricultural implements online.  Not that we needed to.  We found both in the smaller shed!

Under construction…

...a bit more...


Yesterday, I put a lock on the shed and took down all the tools and other useful things I’d brought from home.  This isn’t my first allotment.

Why do I need extra work? Well, aside from the fresh air, exercise, and the fact that I’m a vegetarian who loves all things that come from the soil  (except sprouts and turnips), as some of you know I live on a boat and use a waterless toilet.  This generates stuff that goes into a ‘hot bin’, an insulated compost bin that heats the contents to over 90°C.  This is essential to kill of any bacteria that may be lurking, although given my vegetable diet there probably aren’t many that are harmful. Certainly emptying solids into the bin isn’t an unpleasant chore.  Meat eaters produce horrid stuff.

Once the bin has been doing it’s thing at over 60°C for three months, it’s ready to be spread on a garden, and this is why I need an allotment.  I have a small patch of garden to use where my boat is moored, but it’s not enough.  Every three months I get a wheelbarrow-load of lovely compost, and it needs to go somewhere.  At least now I can take it to the allotment and will be able to use it there in the spring.

The next stage is to kill back all the grass and weeds so that I can dig over a strip for late autumn / spring planting.   While I try and be as environmentally friendly as I can, I’ll use chemicals for this.  There really isn’t anything more effective, and these days it’s relatively safe.  Plus, I’m not treating it all, just about a third.

The thing is, what looks like a huge undertaking is really only a series of small jobs.  And this is what I keep telling myself…

The Problem With Guessing K-Means

I’ve been grappling with the problem of how to find out what a group of professionals blog about. That seems simple enough on the face of it, but when there are over 9,000 blogs in a sample set of data, it’s not so easy. I can’t read every one, and even if I could, can you imagine how long it might take me to group them into topics?

Enter computer science in the form of algorithms.

I’ll gloss over the hours…. days…. weeks of researching how the various alternatives work, and why algorithm A is better than algorithm B for my type of data. Turns out k-means is the one I need.

Put very simply, each blog post (document) is made up of words. Each word is used x amount of times, both in the document and in the entire collection of documents (corpus). An adjustment must be made for the overall length of the document (a word used ten times in a document of 100 words doesn’t have the same significance as the same word used ten times in document of 1000 words), but once this has been done it’s possible to give each document an overall ‘score’, which is converted to a position (or vector) within the corpus.

It helps to think of the position as a ‘vector’ in a space with an infinite number of dimensions, even if you can’t visualise it, which I can’t. But, having done this, it’s then possible to k-means to randomly pick a number of starting vectors (the number being picked in advance) and it will proceed to find all of the documents closest to it until it finds the distance becomes too great or it begins to overlap with a neighbouring group, in which case it starts again somewhere else. The algorithm does this over and again until it completes the task successfully as it can (or it’s told to do it for a maximum number of tries, or iterations) and then it tells you how many documents it’s put in each cluster.

In theory, the algorithm should produce the same number of clusters every time you run it, although that doesn’t always happen as I found with my data. The other thing is, without grouping the set manually, there’s no way of telling what the actual number of k should be, which rather defeats the point of the algorithm…. except when you’re dealing with large data sets, you’ve got no choice.

Of course, you CAN just keep clustering, adding 1 to your chosen number for k until you think you’ve got results you’re happy with. I started doing that, beginning with 10 and working up to 15, by which time I was totally bored and considering the possibility that my actual optimum number of clusters might we over 100…. Every time I ran the algorithm, the number of posts in each cluster changed, although two were stable. That seemed to be telling me that I was a long way from finding the optimum number.

Enter another load of algorithms that can help you estimate the optimum number for k. They aren’t a magic bullet – they can only help with an estimation, and each one goes about the process in a different way. I chose the one I did because a) I found the code in a very recent book written by a data scientist, and b) he gave an example of how to write the code AND IT WORKED.

Guess how many clusters it estimated I had? Go on, guess….. seven hundred and sixty. Of course I now have to go back and evaluate the results, but still. Seven hundred and sixty.

Good job I stopped at 15.


Having successfully divided my data set up into separate years yesterday, I thought I’d go back to basics and have a look at stopwords.

in language processing, it’s apparent that that are quite a few words that absolutely no value to a text.  These are words like ‘a’, ‘all’, ‘with’ etc.  NLTK (Natural Language Tool Kit – a module that can be used to process text in various ways.  You can have a play with it here) has a list of 127 words that could be considered the most basic ones.  Scikit-learn, which I’m using for some of the more complicated text processing algorithms) uses a list of 318 words taken from research carried out by the University of Glasgow .  A research paper published by them makes it clear that a fixed list is of limited use, and in fact a bespoke list should be produced if the corpus is drawn from a specific domain, as I’m doing with my blogs written by teachers and other Edu-professionals.

Basically, the more frequently a word is used in a corpus, the less useful it is.  For example, if you were presented with a data base of blogs written by teachers, and you wanted to find the blogs written about ‘progress 8’, that’s what your search term would be, possibly with some extra filtering-words like ‘secondary’ and ‘England’.  You would know not to bother with ‘student’, ‘children’ or ‘education’ because they’re words you’d expect time find in pretty much everything.  Those words are often referred to as ‘noise’.

The problem is that if the word ‘student’ was taken out of the corpus altogether, and treated as a stopword, that might have an adverse effect on the subsequent analysis of the data.  In other words, just because the word is used frequently doesn’t make it ‘noise’.   The bigger problem, then, is how to decide which of the most frequently used terms in a corpus can safely be removed.  And of course there’s the issue of whether the words on the default list should be used as well.

The paper I referred to above addresses this very problem, with some success.  I’m still trying to understand exactly how it works, but it seems to be based on the idea that a frequently-used word may in fact be an important search term.  And the reason I’ve spent so much time on this is because the old adage ‘rubbish in, rubbish out’ is indeed true, and before I go any further with the data I have, I at least need to understand the factors that may impact the results.