Tag Archives: data science

Houston, We May Have A Problem….

I’ve been writing up my PhD.  This has been a very slow process, mainly because I’ve had to spend quite a bit of time going back through all my references, and re-planning the whole thing.  I bought post-it notes, and a tabletop flip chart (which is also like one massive post-it), and I’ve re-appraised everything.  As I write, I’m constantly adding more post-its as prompts of things I need to look up / do / add to the ‘discussion’ section at the end.

One of the things I decided I’d do was go back through my original data to make sure that I’d gathered everything I needed to, and to see if I could improve the cleaning-up process.  In computer science circles, this is often referred to as ‘text wrangling’.  Your typical blog post contains URLs, other advertising rubbish that’s added by the platform, junky unicode, characters representing carriage returns, new lines…. I could go on.  This all has to be removed.  A text data file, when its being prepared for analysis, can get very big very quickly – and there’s a limit to the data file that even my pretty-well-spec’d laptop can handle.  Having discovered this excellent site, I can not copy and paste a section of a blog post with some rubbish in it, and generate the code snippet that will remove it.  Regex can be tricky – the broader the parameters i.e. the greater freedom you give it to remove the stuff you don’t want, the more chance there is that it’ll remove other stuff you really would have preferred to keep.  It’s difficult to check, though, so in the end you probably have to just take the risk.

The other thing I wanted  to do was expand the contractions in the posts so that ‘isn’t’ becomes ‘is not’ etc.  I think it’s important to leave behind a data set that may be useful to future researchers, some of whom might be interested in sentiment analysis.  Expanding contractions helps to keep the meaning of the writing intact.

Then, I decided I’d go back and look again at how I’d processed my data.  As you may recall, my aim is to classify as many edu-blogs as possible according to a pre-defined list of categories drawn from the research that’s already been done on what teachers blog about.   I chose this approach because the potential number of topics is completely unknown, and  potentially huge.  It’s possible to run an algorithm that will cluster blogs without any prior information, but the trouble is that a) you still have to give it some idea how many clusters you might be expecting, and b) the results will vary slightly each time it’s run.  Its not a model; there’s no consistency.

One of the alternatives is to label a small set of blog posts with numbers representing categories, and then use an algorithm that will take this information and classify the unlabelled posts.  This is how it works: imagine having a double handful of brown smarties and a clear perspex box, say 1m x1m.  You throw the smarties into the box, but by magic they remain stationery, but scattered, in space. Now you have a small number coloured  smarties, several of the remaining colours, and you chuck them in as well.  They also hang in space.  The label spreading algorithm assumes that the coloured smarties are the labels, and it sets about relabelling all the brown smarties according to how close they are to each different colour.  You can allow it to change the colours of the non-brown smarties if you want, and you can give it some freedom as to how far it can spread, say, the red colour.  The algorithm spreads and re-spreads each colour (some of the different coloured smarties will be quite close to each other…. where should the boundary be drawn?) until it reaches convergence.

The picture here (and above) is a great example.  Not only does it look like a load of smarties (which I’m now craving btw) but it also perfectly illustrates one of the fundamental problems with this approach – if your data, when plotted into a 3D space, is an odd shape, spreading labels across it can be a bit of a problem.  The algorithm draws a network (there are lines connecting the smarties if you look closely) and uses the links between the smarties – officially called ‘nodes’, links are ‘edges’ – to determine how many ‘hops’ (edges) it takes to get from your labelled node to your closest unlabelled one.

Each of these nodes could represent a blog post.  It has co-ordinates in this space.  The co-ordinates are generated from the words contained in the post.  The words have to be represented as numbers because none of the algorithms can deal with anything else – this is maths territory we’re in, after all.

I’ve one this label spreading thing before with a sample set of data.  It seemed to work ok.  A quick audit of the results was promising.  I had another run through the code with a different set of data, including the training set I’d developed earlier, and realised that things weren’t quite the same.  The algorithm has had a bit of an upgrade since I last deployed it.  There were some issues, and the developers from scikit-Learn made some improvements.  That got me re-thinking what I’d done, and I realised two things: I’d made a fundamental error, and the new results I was getting needed a bit of an audit.

The book on the right has been invaluable!

The fundamental error really shows up how hard it is to do data / computer science when you aren’t a data / computer scientist.  I was feeding the algorithm the wrong set of data.  I should have been feeding it an array of data based on distance, but I wasn’t.  I was still getting results though, so I didn’t notice.  The thing is, nowhere is there anything that says ‘if you want to do this, you must first do this because this’.  It’s just assumed by every writer of computer science books and blogs and tutorials that you know.  I went back and re-read a few things, and could see that the crucial bit of information was inferred.  I can spot it now I’ve gained a lot more knowledge.  So, fault corrected, move on, nothing to see here.

The audit of results isn’t very encouraging, though.  There were many mis-categorisations, and some that were just a bit…. well… odd but understandable.  One of my categories is ‘soapboxing’ – you know, having a bot of a rant about something.  Another is ‘other’ to try and catch the posts that don’t fit anywhere else.  Turns out of you have a rant in a blog post about something that isn’t about education, it still gets classed as ‘soapboxing’, which makes perfect sense when you think about it.  An algorithm can’t distinguish between a post about education and a post that isn’t, because I’m thinking about concepts / ideas / more abstract topics for blog posts, and it’s just doing maths.  Post x is closer to topic a than topic b, and so that’s where it belongs.

There are other approaches to this.  I could use topic modelling to discover topics, but that has problems too.  ‘People’ might be a valid topic, but is that useful when trying to understand what teachers have been blogging about?

My label spreading approach has been based on individual words in a blog post, but I could expand this to include commonly-occurring pairs or trios of words.  Would this make a significant difference?  It  might.  It would also put some strain on my laptop, and while this shouldn’t necessarily be a reason not to do something, it’s a legitimate consideration.  And I have tried tweaking the parameters of the algorithm.  It makes little difference.  Overall, the results aren’t different from one another, which is actually a good thing.  I can make a decision about what settings I think are best, and leave it at that.  The problem, the real problem, is that I’m working with text data – with language – and that’s a problem still not solved by AI.

What I cannot do is make the data fit my outcome.  Worst case scenario, I have a lot to add to the ‘discussion’ part of my PhD.  If I can come up with a better analytical framework, I will.  The hard work – harvesting and wrangling the data – has already been done.  If I have to find some more papers to add to the literature review, that’s no hardship.  In the meantime, I’ve slowed down again, but I’m learning so much more.


University of Shanghai Datathon (March 2018)

It’s taken me longer than usual to write this blog post.  This is partly because I’ve been very poorly since returning from China (the Doctor thinks I may have picked up a viral infection on the plane on the way home), and partly because the trip was very different from previous ones.  The purpose of this trip was to participate in a Datathon, with one day scheduled to have a look around Shanghai.  As it turned out, spending four days exploring data with other students was an absolute joy, and I was able to use many of my skills and the data processing tools I’d gathered which made the whole experience a really positive one.  The group of students I was working with – Hello if you’re reading this! – were absolutely lovely, and they looked after me really well.

My lovely group: Richard, Christie, and Eric. Our presentation is in the background!

The hotel where we were staying was within the University campus (I think I’m right in saying it’s actually owned by the University of Shanghai), which itself is a 50-minute journey by subway to the city centre.  I wish I’d taken more pictures of the campus, which was large, open and airy, with lots of green space and gardens (all my photos are here).  The computer science building where we were based was a few minutes’ walk away.

The data we were given to work with included some from the NYC Taxi and Limousine commission.  This is a huge set of data that people have already done some amazing – and silly – things with like this which shows that you can make it from Upper West Side to Wall Street in 30 minutes like Bruce Willis and Samuel L Jackson.  The theme of the exercise was ‘Smart Transport, Our Environment, and More’, which is a very hot topic at the moment, especially driver-less vehicles.   The University of Shanghai is conducting a lot of research on autonomous vehicles, including transport by sea.  We were given one year of data to work with, but even when this year was broken down into months, the size of the files made it impossible to work with on a laptop.  While my group worked on the main project, I drew a 1% sample of January 2013 to work with, the largest sample I could extract and still be able to process the data.  I’ve included a few images here, which were generated using Orange (part of the Anaconda suite) which I’ve  blogged about previously.

Passenger count – mainly – Surprise! – single person journeys.

All three groups in the Datathon converged around the idea of predicting where taxis would be in highest demand, and at what times.  There’s a link to our presentation, data and code here, and the work of the other groups can be found here.  I particularly liked the work on ‘volcanoes and black holes’, which is basically the same problem, but visualised differently.


James with a couple of his group. That’s part of the ‘volcanoes and black holes’ presentation from Jon’s group behind them.

The other two PhD students – Jon and James – were both really good coders, which was just as well as the students they were working with were less experienced in this area.  In my group is was the opposite – they were able to crack right on with writing the code, while I did some of the ‘big picture’ stuff and helped with the presentation.

The nice thing about working with geo-tagged data is that is can be used to generate some lovely graphics.  These can tell you so much, and prompt other questions, like for example why don’t more people share a cab, and what would it take to persuade them to do so?  Even so, and although I haven’t been to New York, I do know that you have to know more about the location than a map and data will tell you.  You also have to know about people, and the way they behave.  Nevertheless, this is a fascinating open data set, which is being added to every year.  Similar data would be, I believe, easily available in Shanghai and other cities in China, and no doubt will be used in similar research.

Here you can see all the journeys from my sample plotted on a map.

2-6 passenger journeys. The larger the circle, the further the trip.

We all presented our work on Monday, 26th March in front of Professor Dame Wendy Hall, Professor Yi-Ke Guo, and Dr. Tuo Leng.  I know they were impressed with what had been achieved, and I think all the students (us included) gained a lot from the experience.  This is my second trip to China, and I have to say it made a huge difference being able to do something with the data.  In my (limited) experience, unless you’re a naturally gregarious person, it can be difficult to get fully engaged when some of the people you’re working with don’t speak English very well, and/or are reluctant to speak.  Fortunately for me, my group were both good English speakers, and happy to chat while working.  For Jon and James, I think the students with them were less chatty, but the fact that the guys could write code helped to break down those barriers.  the fact that I could code, and had some useful data analysis tools I could draw on, made all the difference.  I felt more confident, knowing that I could make some useful contributions.  Of course, Shanghai is a more cosmopolitan city than Shenzen, which probably makes a difference.

To sum up, then, this was a proper working trip which turned out to be both interesting and informative.  I met some lovely, lovely people and had a brilliant time.  I even managed to find plenty of vegetarian food to eat, and proper coffee.  I’m glad I’m not a vegan, though.

So, What DO teachers talk about?

So, having put the final piece of the coding jigsaw in place, here are the first set of results.  The diagram below represents a set of 7,786 blog posts gathered from blog URLs.  The earliest is 2009, the latest 2016.  They’re currently a  lumped in together, although in the end the data set will be a) much, much larger, and b) broken down by year (and normalised so that a proper comparison can be made).

There are lots of things going on here – how I’ve defined the categories; how I initially categorised some posts to form a training set; how the algorithms work and were applied to the data; in spite of what some people will tell you, data science has all the appearances of giving nice, clear cut answers when in fact the opposite – especially when dealing with text – is often true.

The journey to get here has been long and challenging.  Still, I’m happy.


Label Spreading

This week, I finally managed to get the last lines of code I needed written.  I wanted to apply the label spreading algorithm provided by scikit learn but the documentation provided is next to useless, even bearing in mind how much I’ve learned so far.  There are other ways of grouping data, but my approach from the start has always been to go with the most straightforward, tried and tested methods.  After all, my contribution isn’t about optimising document classification, but the results of document classification, which will reveal what pretty much everyone from one community who writes a blog has been writing about.

The label spreading algorithm works by representing a document as a point in space, and then finding all the other points that are closest to it than, say, another document somewhere else.  I gave the algorithm a set of documents that I’d already decided should be close to each other in the form of a training set of blog posts allocated to one of six categories.  The algorithm can then work out how the rest of the unlabelled blog posts should be labelled based on how close (or distant) they are from the training group.

It’s also possible to give the algorithm a degree of freedom (referred to as clamping) so that it can relax the boundaries and reassign some unlabelled data to an adjacent category that is more appropriate.  I don’t know yet exactly how this works, but it will have something to do with the probability that document  would be a better fit with category a than category b.

I ran the algorithm twice with different clamping parameters, and you can see the results below.

alpha = 0.2, gamma = 20 alpha = 0.1, gamma = 20
Category No. of Posts Category No. of Posts Category No. of Posts
6 21 6 475 6 506
5 98 5 1915 5 1920
4 34 4 1013 4 1044
3 27 3 505 3 516
2 34 2 746 2 712
1 78 1 3132 1 3088
-1 7494 -1 0 -1 0

The first couple of columns are the set of posts with just my labelled training set. -1 represents the unlabelled data.  Thereafter you can see two sets of results, one with a clamping setting of 0.2 (alpha), the other slightly less flexible at 0.1.

alpha : float

Clamping factor. A value in [0, 1] that specifies the relative amount that an instance should adopt the information from its neighbors as opposed to its initial label. alpha=0 means keeping the initial label information; alpha=1 means replacing all initial information (scikit learn).

I’m still trying to find out exactly what the gamma parameter does.  I just went with the value given by all the scikit documentation I could find.

I then went through 50 randomly selected posts that had originally been unlabelled to see what category they had been allocated.   I changed 26 of them, although 10 of these were labelled with a new category which I’m just calling ‘other’ at the moment.  So, in summary, I changed 32% of the sample and added 10% of the sample to a new category.

I always knew from previous explorations of the text data that there would be posts that went into the ‘wrong’ category, but the degree of ‘wrong’ is only according to my personal assessment.  I could be ‘wrong’, and I have absolutely no doubt that others would disagree with how I’ve defined my categories and identified blog posts that ‘fit’, but that’s the joy / frustration of data science.  Context and interpretation are everything.

Coding Resources, or: Things I Wish I’d Known When I Started


As some of you know, I’m in the final year of my PhD in Web Science.  For whatever reason, I decided I’d learn whole load of new stuff from the ground up.  In my 50s.  With zero knowledge to start with except some very basic maths.  I needed to learn to write code, and although my MSc year included a module on writing code in Python, it did nothing more than get me familiar with what code actually looks like on the page.

I cried every Sunday night, prior to the workshop on Monday, because I just couldn’t see how to make things work.

Today, over two years on, I get it.  I can write it (although I still have to refer to a book or previous code I’ve written as a reminder) and my ability to think logically and has improved considerably.  During that time, I’ve amassed a range of books and URLs that have been, and still are, incredibly useful.  It’s time to share and provide myself with a post of curated resources at the same time.

First of all, you absolutely need a pencil (preferably with a rubber on the end), some coloured pens if you’re a bit creative, and plenty of A3 paper.  Initially, this is just for taking notes but I found then incredibly useful further along when I wanted to write the task that I needed my code to carry out, step by step.

Post-it notes – as many colours and sizes as you fancy.  Great for scribbling notes as you go, acting as bookmarks, and if you combine them with the coloured pens and A3 paper, you can make a flow chart.

Code Academy is a good place to start.  It takes you through the basics step by step, and helps you to both see what code looks like on screen, and how it should be written.  There are words that act as commands e.g. print, while, for etc.  that appear in different colours so you can see you’ve written something that’s going to do something, and you can see straight away that indents are important as they signal the order in which tasks are carried out (indents act like brackets in maths).

Just about every book that covers writing code includes a basic tutorial, but one that I bought and still keep referring back to is Automate The Boring Stuff With Python.  By the time you get here, you’ll be wanting to start writing your own code.  For that, I recommend you install Anaconda which will give you a suite of excellent tools.  Oh, and I use Python 3.6.
Resources1Once you’ve opened Anaconda, Spyder is the basic code editor.  I also use the Jupyter Notebook a lot.  I like it because it’s much easier to try out code bit by bot, so for example when I’m cleaning up some text data  and want to remove white space, or ‘new line’ commands, I can clear things one step at a time and see the results at the end of each one.  You can do the same using Spyder, but it isn’t as easy.

I’m going to list some books next, but before I do I should mention Futurelearn.  I have done several of the coding courses – current ones include ‘Data MiningWith WEKA’, ‘Advanced Data Mining With WEKA’ and ‘Learning To Code For Data Analysis’.  While these may not cover exactly what you have in mind to do (more on that in a minute), they will all familiarise you with gathering data, doing things with the data by writing code, and visualising the results.  They also help to get you thinking about the whole process.

I had a series of tasks I needed code to do for me.  In fact, I think the easiest way to learn how to write code is to have something in mind that you want it to do.  I needed to be able to gather text from blog posts and store it in a way that would make it easily accessible.  In fact, I needed to store the content of a blog post, the title of the post and the date it was published.  I later added the URL, as I discovered that for various reasons sometimes the title or the date (or both) were missing and that information is usually in the URL.  I then identified various other things I needed to do with the data, which led to identifying more things I needed to do with the data….. and so on.  This is where I find books so useful, so here’s a list:

  • Mining The Social Web, 2nd Edition.  The code examples given in this book are a little dated, and in fact rather than write the code line-by-line to do some things, you’d be better off employing what I’ll call for the sake of simplicity an app to do it for you.  It was the book that got me started, though, and I found the simple explanations for some of the things I needed to achieve very useful.
  • Data Science From Scratch.  I probably should have bought this book earlier, but it’s been invaluable for general information.
  • Python For Data Analysis, 2nd Edition.  Again, good for general stuff, especially how to use Pandas.  Imagine all the things you can do with an Excel spreadsheet, but once your sheet gets large, it becomes very difficult to navigate, and calculations can take forever.  Pandas can handle spreadsheet-style stuff with consummate ease and will only display what you want to see.  I love it.
  • Programming Collective Intelligence.  This book answered pretty much all the other questions I had, but also added a load more.  It takes you through all sorts of interesting algorithms and introduces things like building classifiers, but the main problem for me is that the examples draw on data that has already been supplied for you.  That’s great, but like so many other examples in all sorts of other books (and on the web, see below) that’s all fine until you want to use your own data.
  • This book began to answer the questions about how to gather your own data, and how to apply the models from the books cited above: Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data.  This book has real-world examples which were relatively easy for me to adapt, as well as straightforward explanations as to how the code works.

Finally, some useful web sites.  The first represented a real break-through for me.  Not only did it present a real-world project from the ground up, but the man behind it, Brandon Rose (who also contributed to the last book in my list) is on Twitter and he answered a couple of questions from me when I couldn’t get his code to work with my data.  In fact, he re-wrote bots of my code for me, with explanations, which was incredibly helpful and got me started.  http://brandonrose.org/ is amazing.

This is the one and only video tutorial I’ve found useful.  Very useful, actually.  I find video tutorials impossible to learn anything from on the whole – you can’t beat a book for being able to go back, re-read, bookmark, write notes etc. – but this one was just what I needed to help me write my code to scrape blog posts, which are just web pages https://www.youtube.com/watch?v=BCJ4afDX4L4&t=34s.

https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/ and other blog posts.

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ does what it says, and more.

http://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/ useful walk-through.


The URLs listed above are quite specific to the project I’ve been working on.  I’d also like to add Scikit-Learn which provided all the apps I’ve been using.  The explanations and documentation that is included on the site was less than helpful as it assumed a level of knowledge that was, and to a certain extent still is way above my head.  However, what it gave me was the language to use when I was searching for how to write a piece of code.  Stack Overflow is the best resource there is for this, and most of my bookmarks are links to various questions and  responses.  However, it did take me a while to a) learn what form of words would elicit an answer to my problem, and b) to understand the answers.  I even tried asking a question myself.  Never again.  Unless you’re a fully-fledged computer science geek (and if you were, you wouldn’t be here) it’s hostile territory.

Finally, an excellent site that has been useful again and again: DataVizTools.

Going back to Anaconda for a minute, when you’re feeling a bit more confident, have a look at the Orange application.  I’ve blogged about it several times, and blog on the site is an excellent source of information and example projects.  The help pages are excellent for all the basic apps, although some of the  newer ones don’t have anything yet.

And to finish, a site that I found, courtesy of Facebook, this very morning.  This site lets you see how your code works with a visualiser, something I found myself doing with pencil and paper when my code wasn’t doing what it should and I didn’t know why.

Developing Categories


An initial estimate of the possible number of categories in the 25% sample my nine-thousand-odd list of blog posts, provided by the Affinity Propagation (AP) algorithm, suggested over 100 categories.   Based on the words used in the posts it chose to put into a cluster, this was actually reasonable although way more than I can process.  It was also obvious that some of the categories could have been combined:  maths- and science-based topics often appeared together, for example.

A different method provided by an algorithm in Orange (k-means, allowing the algorithm to find a ‘fit’ of between 2 and 30 clusters) suggested three or four clusters.  How is it possible for algorithms, using the same data, to come up with such widely differing suggestions for clusters?  Well, it’s maths.  No doubt a mathematician could explain to me (and to you) in detail how the results were obtained, but for me all the explanation I need is that when you start converting words to numbers and use the results to decide which sets of numbers have greater similarity, you get a result that, while useful, completely disregards the nuances of language.

An initial attempt by me to review the content of the categories suggested by AP, but I had to give up after a few hours’ work.  I identified a good number of potential categories, including the ones suggested by the literature (see below), but I soon realised that it was going to be difficult to attribute some posts to a specific category.  A well-labelled training set is really important, even if it’s a small training set.  So, back to the research that has already been published, describing he reasons why teachers and other edu-professionals blog, and a chat with my supervisor, who made the observation that I needed to think about ‘process’ as opposed to ‘product’.

Bit of a lightbulb moment, then.  I’m not trying to develop a searchable database of every topic covered – I’m trying to provide a summary of the most important aspects of teaching discussed in blogs over a period of time.   The categories arising from the literature are clearly grounded in process, and so these are the ones I’ll use.  If you click on this link, you’ll be able to see the full version of the Analytical Framework, a snippet of which is pictured above.

As well as the main categories (the ones in the blue boxes), I decided to add two more: ‘behaviour’ and ‘assessment /  feedback / marking’ simply because these, in my judgement, are important enough topics to warrant categories of their own.  However, I’m aware that they overlap with all the others, and so I may revise my decision in the light of results.  What I’ll have to do is provide clear definitions of each category, linked with the terms associated with the relevant posts.

What will be interesting is exploring each category.  The ‘concordance‘ widget in Orange allows for some of the key terms to be entered, and to see how  they’re used in posts.  This will add depth to the analysis, and may even lead to an additional category or two if it appears, for example, that ‘Ofsted’ dominated blogs within the ‘professional concern’ category for a considerable period of time, an additional category would be justified.  My intention is to divide my data into sets by year (starting at 2004), although it may be prudent to sub-divide later years as the total number of blog posts increases year on year.

Clustering Blog Posts: Part 2 (Word Frequency)

One of the most important things to do when working with a lot of data is to reduce the dimensionality of that data as far as possible.  When the data you are working with is text, this is done by reducing the number of words used in the corpus without compromising the meaning of the text.

One of the most fascinating things about language was discovered by G K Zipf in 1935¹: that the most frequently used words in (the English) language are actually few in number, and obey a ‘power law’.   The most frequently used word occurs twice as often as the next most frequently word, three times as often as the third, and so on.  Zipf’s law forms a curve like this:
Zipf-CurveThe distribution seems to apply to languages other than English, and it’s been tested many times, including using the text of, for example, novels.  It seems we humans are very happy to come up with a rich and varied lexography, but then rely on just a few to communicate with each other.  This makes perfect sense as far as I can see: I say I live on a boat, gets the essentials across (a thing that floats, a bit of an alternative lifestyle, how cool am I? etc.) because were I to say I live on a lifeboat, I then have to explain that it’s like one of the fully-enclosed ones you see hanging from the side of cruise ships, not the open Titanic-style ones most people would imagine.

“For language in particular, any such account of the Zipf’s law provides a psychological theory about what must be occurring in the minds of language users. Is there a multiplicative stochastic process at play? Communicative optimization? Preferential reuse of certain forms?” (Piantadosi, 2014)

A recent paper by Piantadosi² reviewed some of the research on word frequency distributions, and concluded that, although Zipf’s law holds broadly true, there are other models that provide a more reliable picture of word frequency which depend on the corpus selected.  Referring to a paper by another researcher, he writes “Baayen finds, with a quantitative model comparison, that which model is best depends on which corpus is examined. For instance, the log-normal model is best for the text The Hound of the Baskervilles, but the Yule–Simon model is best for Alice in Wonderland.”

I’m not a mathematician, but that broadly translates as ‘there are different ways of calculating word frequency, you pays your money you takes your choice”.  Piantadosi then goes on to explain the problem with Zipf’s law: it doesn’t take account of the fact that some words may occur more frequently that others purely by chance, giving the illusion of an underlying structure where none may exist.  He then goes on to suggests a way to overcome this problem, which is to use two independent corpora, or split a corpora in half and then test word frequency distribution in each. He then tests a range of models, and concludes that the “…distribution in language is only near-Zipfian.” and concludes “Therefore, comparisons between simple models will inevitably be between alternatives that are both “wrong.” “.

Semantics also has a strong influence on word frequency.  Piantadosi cites a study³ that compared 17 languages across six language families and concluded that simple words are used with greater frequency in all of them, and result in a near-Zipfian model.  More importantly for my project, he notes that other studies indicate that word frequencies are domain-dependent.   Piantadosi’s paper is long and presents a very thorough review of research relating to Zipf’s law, but the main point is that it does exist, even though why it should be so is still unclear.  The next question is should the most frequently used words from a particular domain also be removed?

As I mentioned before, research has already established that it’s worth removing (at least as far as English is concerned) a selection of words.  Once that’s done, which are the most frequently used words in my data?  I used Orange is to split my data in half and generate three word clouds based on the same parameters, and observe the result.  Of course I’m not measuring the distribution of words, I’m just doing a basic word count and then displaying the results, but it’s a start.  First, here’s my workflow:


I’ve shuffled (randomised) my corpus, taken a training sample of 25%, and then split this again into two equal samples.  Each of these has been pre-processed using the following parameters:


Pre-processing parameters. I used the lemmatiser this time.

The stop word set is the extended set of ‘standard’ stop words used by Scikit that I referred to in my previous post, plus a few extra things to try and get rid of some of the rubbish that appears.

The word clouds for the full set, and each separate sample, look like this:


Complete data set (25% sample, 2316 rows)


50% of sample (1188 rows)


Remaining 50% of sample

The graph below plots the frequency with which the top 500 words occur.


So, I can conclude that based on word counts, each of my samples is similar to each other, and to the total (sampled) corpus.  This is good.

So, should I remove the most frequently used words, and if so, how many?  Taking the most frequently used words across each set, and calculating the average for each word, gives me a list as follows:


And if I take them out, the word cloud (based on the entire 25% set) looks like this:


Which leads me to think I should take ‘learning’ and ‘teaching’ out as well.  It’s also interesting that the word ‘pupil’ has cropped up here – I wonder how many teachers still talk about pupils rather than students?  Of course, this data set contains blogs that may be a few years old, and/or be written by bloggers who prefer the term.  Who knows?  In fact, Orange can tell me.  The ‘concordance’ widget, when connected to the bag of words, tells me that ‘pupil’ is used in 64 rows (blogs) and will show me a snippet of the sentence.


It’s actually used a total of 121 times, and looking at the context I’m not convinced it adds value in terms of helping me with my ultimate goal, which is clustering blog posts by topic, so it’s probably worth mentioning here that the words used the least often are going to be the most numerically relevant when it comes to grouping blogs by topic.


Could I take out some more?  This is a big question.  I don’t want to remove so many words that the data becomes difficult to cluster.  Think of this as searching the blog posts using key words, much as you would when you search Google.  Where, as a teacher, you might want to search ‘curriculum’, you might be more interesting in results that discuss ‘teaching (the) curriculum’  rather than those that cover ‘designing (the) curriculum’.  If ‘teaching’ has already been removed, how will you find what you’re looking for?  Alternatively, does it matter so long as search returns everything that contains the word ‘curriculum’?  You may be more interested in searching for ‘curriculum’ differentiating by key stage.  For my purposes, I think I’d be happy with a cluster labelled ‘curriculum’ that covered all aspects of the topic.  I’ll be able to judge when I see some actual clusters emerge, and have the chance to examine them more closely.  Which, incidentally, the concordance widget tells me is used in 93 blogs, and appears 147 times.  That’s more than ‘pupil’, but because of my specialised domain knowledge I judge to be more important to the corpus.

Which is also a good example of researcher bias.

  1. Zipf, G. K.; The Psychology of Language; 1966; The M.I.T. Press.
  2. Piantadosi, S.; Zipf’s word frequency law in natural language: A critical review and future Directions; Journal of National Institutes of Health; 2014; volume 21, October issue, pages 1112-1130.
  3. Calude, A., Pagel, M.;  How do we use language? Shared patterns in the frequency of word use across 17 world languages; 2011; Journal of Philosophical Transactions of the Royal Society of London. Series B, Biological sciences; volume 366, issue 1567, pages 1101-7.