Category Archives: My Education

So, What DO teachers talk about?

So, having put the final piece of the coding jigsaw in place, here are the first set of results.  The diagram below represents a set of 7,786 blog posts gathered from blog URLs.  The earliest is 2009, the latest 2016.  They’re currently a  lumped in together, although in the end the data set will be a) much, much larger, and b) broken down by year (and normalised so that a proper comparison can be made).

There are lots of things going on here – how I’ve defined the categories; how I initially categorised some posts to form a training set; how the algorithms work and were applied to the data; in spite of what some people will tell you, data science has all the appearances of giving nice, clear cut answers when in fact the opposite – especially when dealing with text – is often true.

The journey to get here has been long and challenging.  Still, I’m happy.



A Research Trip to Singapore , Part 2


So, Singapore then.  First of all, it’s a really small country.  Have a look on Google maps.  It’s basically a city-state rather than a country.  It’s also clean, calm, and very green.  The pavements are pretty much spotless, everywhere is neat and tidy – even the H&M store in one of the big shopping malls was neat with all the clothes hanging on racks.  There’s a general air of sophistication – I didn’t see anyone dressed in baggy tracksuits, or trashy leggings.  Shoes, even trainers, were clean and new-looking.   Of course there were exceptions, but almost no-one looked down-at-heel.

Singapore is rich.  With Malaysia next door as a source of cheap labour, many people can afford cleaners and nannies.  Clearly, shopping is the number one pass-time.   The malls are huge and stretch the length of the main road down through the city centre.


There’s a reason it’s green…

Singapore has worked hard to become what it is.  It went from poverty-stricken to one of the richest countries in a single generation, after gaining independence in 1965.  It’s rated very highly for education and healthcare, and is a major player in finance, foreign exchange, oil refining and trading, and is one  of the world’s busiest container ports.

It’s also claimed  back a lot of land.  The fabulous gardens by the bay are entirely built on reclaimed land – land that was drained and then left for 10 years to dry out.

Singapore isn’t without its drawbacks, though.  The death penalty is still a punishment for some crimes.  Homosexuality is illegal.  You can’t chew gum in public, or smoke unless you’re within 3 metres of a designated smoking area.  Jaywalking is illegal.  And it’s the kind of state where the police will arrest you first, and ask questions later.  On the flip side, the streets are clean and crime is extremely rare.  There are a couple of casinos, but Singaporians have to sign in to use them, and if any member of their family is concerned about their gambling, they are denied entry.  People generally drive considerately, and I didn’t see a single dented or scratched vehicle.


The beautiful gardens by the bay.

Alcohol, especially wine, is expensive.  I’m told people pop across to Malaysia and buy it in bulk.  Food is also pricey, although you can buy anything from Spanish tapas to Singapore noodles.  We didn’t manage to visit the places where street-food is sold (known as Hawkers Markets) but we did find a Food Republic which is basically a canteen-style arrangement of several independent food outlets where you can buy a variety of Asian dishes.  Kimchi with fried rice, an egg on top, a bowl of clear soup and two little dishes of something unidentifiable was the equivalent of £2.  The basements of the shopping malls offer a similar arrangement, although a little more expensive at around £5.  Everything is spotlessly clean.

FYI, if you’re a vegeterian like me, it’s harder than you might think to find things you can eat.  If tofu is on the menu, it’s worth asking if it can be substituted for the usual chicken or pork.  The trouble is, I’m not sure Singapore would recognise a green vegetable if it was jumping up and down holding a sign saying ‘I’m a vegetable. Eat me!’.  Salad is practically non-existent.  Indian food offers dhals , Korean food kimchi, but from what I could see practically every other dish includes meat or seafood, and is fried.  No wonder Singapore has a problem with diabetes.

It’s incredibly hot in January, with tropical downpours accompanied by full-on thunder and lightning more often than not.  I love a good storm.  I recommend taking an umbrella everywhere with you, and wearing sensible, waterproof footwear.  It’s far too hot for coats.   Public transport is cheap and plentiful, and as everything is in English it’s easy to find your way around.

The hotel we stayed in provided a free smart phone for guests to use, which included free and unlimited access to mobile data as well as free phone calls (including international calls).  This was so useful, although I learned on the very last day that you can register your own mobile phone for free public wifi across the city.  How cool is that?

My biggest disappointment was finding out that the Raffles Hotel was closed for renovation.  I was really looking forward to a Singapore Sling there.  I did have one somewhere else, but it just wasn’t the same.  Still, I managed to bring back a bottle of Bombay Star gin from the duty free shop (£24 for a litre!) so that went some way to making up for it.

I’m not going to post all my photographs here, but I’ve included a link so you can see the entire album here.  So, to sum up, lovely country, lovely polite people, bit expensive but there are ways of mitigating this.  Be mindful of the law, and you’ll have a great time.  Oh, and Levis are incredibly cheap.




A Research Trip to Singapore, Part 1

Some of you know that I spent last week in Singapore on a research trip, sponsored by the University of Southampton.  In this first post, I’m just going to focus on the work side, and save the experience of Singapore (together with lots of lovely photos) for the next one.

The Web Science CDT (Centre for Doctoral Training) usually runs a couple of research trips to other universities over the course of a year.  In 2015, I went to Tsingua University in China as part of a group of PhD students.  Our picture is the main one I use on my web site, and it accurately conveys the general mood of the whole experience.  This time, I was only with two other students – Clare, who is doing her PhD with one foot in Education like me, and Jon who is a bit a maths genius and whose PhD is firmly rooted in AI.20180124_120057

The theme behind the invitation-only conference was ‘Wellness’.  Some students from the National University of Singapore (NUS) led by Dr Zhao-Yan Ming (Zoe) have developed a mobile phone app – DietLens –  which invites you to photograph your plate of food, and it will then tell you the nutritional content.  This doesn’t sound all that important, until you know that Singapore, in common with other Asian countries, has a serious problem with type 1 diabetes.  This isn’t visibly weight-related, but a genetic pre-disposition coupled with a diet high in fried food and sugar.  The government is facing a significant rise in the cost of treating people with diabetes, and something needs to be done to encourage people to change their eating habits.

We spent some time with Zoe and two of her students, going through the app and what it can do.  Not only can it ‘read’ the nutritional content of a plate of food with around 80% accuracy, but it can also estimate the portion size.  So far, there is a database of several hundred local foods, most of which I seem to recall come from restaurant fare, especially the food served at the Hawkers Centres.  The database is being expanded with home-cooked food as well.

The app is a good example of what’s termed ‘deep learning’ in AI.  Every time food is photographed, it prompts the user to identify it with a series of options to select.  These include an option for the user to enter the recipe if the food has been made at home.  Every time food is photographed, the algorithm behind the app learns more about food identification, improving its accuracy.

Of course, the best outcome would be for users to choose healthier food once they know how potentially unhealthy their existing choices are.  However, we know from extensive research in Behavioural Science that persuading people to change their habits is extremely difficult.  Just being shown evidence that their food is high in fat and sugar, and low in complex carbohydrates, isn’t enough.  Most people simply carry on doing what they’ve already done, even when their issue is health-related.  Think about how many times someone we know acknowledges that they really MUST give up smoking, but carries on regardless of all the warnings.  We also know from research that people are more likely to change if they have the support of a network of family and/or friends, or even a group of people they don’t know i.e. Weight Watchers, with their weigh-ins and meetings.

So, having spent a morning looking at the app, we went away to discuss how we might add value to the app, or consider some wider issues.  We knew we’d been allocated 30 minutes to present our ideas on Thursday (we saw the app in action on Tuesday) and had to come up with some ideas fast.

We were able to present three research questions covering four projects:

  • To what extent does the perception of information in the DietLens app affect behavioural change?
  • Can small online social networks improve communication between members of real-life food sharing networks in order to encourage behavioural change in dietary
  • How can we use [the] data to encourage users to make better food choices, and continue to do so?

The first project suggested ways of improving the user interaction with the app with the intention of retaining the user, although it would be deemed a success if at some point the user no longer needed to use the app because they had improved their eating habits.  The user interface should be easy to use, take up minimal time, and have an intuitive interface.  Displaying nutritional values using a simplified ‘traffic light’ system was presented.

The second project proposed using a small social network to encourage users to change their eating behaviour.   Food consumption could be shared, and an ‘encourager’ could be identified.   Reward schemes could further encourage the user.  These enhancements also produce data, which can be used to evaluate the success (or otherwise) of the app.

The third and fourth projects focus on the use of this data.  ‘Nudge theory‘ underpins the suggestions for encouraging long-term change in dietary habits. Nudge theory isn’t new, but it’s gained popularity in the wake of a book published in 2008.


Even the UK Government has a Behavioural Insights Team, otherwise known as the ‘nudge unit’.   It’s been responsible for things like writing to people who have not paid their council tax, informing them that most of their neighbours have already done so, thereby exerting subtle pressure to conform with the perceived behaviour of the group.

The app could make use of this by generating messages of encouragement from within the app, and allowing others who have access to generate messages or ‘thumbs up’ signs.  If a group of users chose to use the app together, the app could tell the members of the group when one of them made a healthy choice or cooked a healthy meal.

Building on this, by inviting others to ‘share’ your food and see what you’re eating, a small social network is created.  Everyone could be part of the group trying to make better food choices, or one person could invite others to join them for encouragement and support.  A study carried out among a sample Mexican and Hispanic people in the USA (all trying to lose weight) asked the simple question: who gave you the most encouragement?  and revealed that it was their children.  The DietLens app could ask the same question, perhaps at the end of each week, to establish whether the same holds true for Singaporeans.


Local nudges on the Singaporean underground.

Of course there are wider issues to consider, such as data privacy and ethics.  Furthermore, just because the app has been built and meets all the requirements doesn’t mean that people will use it, or even download it.   There must also be an accompanying advertising campaign, promotion in schools, and other marketing techniques that have been used successfully to promote campaigns like anti-smoking here in the UK, alongside a continuous analysis of the data.

I’m pleased to say that it looks as if the code for the app is going to be sent to us in Southampton so that we can train it on British food, especially the food we cook at home.  I’m especially looking forward to trying it out with my vegeterian recipes.  Given that I barely saw a vegetable in Singapore (other than kimchi, which was heavily spiced and fried with rice) I’m not sure how it will cope with anything green.  I should imagine broccoli will cause it have a bit of a moment.

I hope to be able to add some photos of our presentation to this post, as my primary supervisor was in the audience, together with Dame Wendy Hall, to whom we owe our thanks for setting up the trip and inviting us along.  Watch this space!

Update: here’s a link to a video of our presentation.  It starts at 33.51.


Label Spreading

This week, I finally managed to get the last lines of code I needed written.  I wanted to apply the label spreading algorithm provided by scikit learn but the documentation provided is next to useless, even bearing in mind how much I’ve learned so far.  There are other ways of grouping data, but my approach from the start has always been to go with the most straightforward, tried and tested methods.  After all, my contribution isn’t about optimising document classification, but the results of document classification, which will reveal what pretty much everyone from one community who writes a blog has been writing about.

The label spreading algorithm works by representing a document as a point in space, and then finding all the other points that are closest to it than, say, another document somewhere else.  I gave the algorithm a set of documents that I’d already decided should be close to each other in the form of a training set of blog posts allocated to one of six categories.  The algorithm can then work out how the rest of the unlabelled blog posts should be labelled based on how close (or distant) they are from the training group.

It’s also possible to give the algorithm a degree of freedom (referred to as clamping) so that it can relax the boundaries and reassign some unlabelled data to an adjacent category that is more appropriate.  I don’t know yet exactly how this works, but it will have something to do with the probability that document  would be a better fit with category a than category b.

I ran the algorithm twice with different clamping parameters, and you can see the results below.

alpha = 0.2, gamma = 20 alpha = 0.1, gamma = 20
Category No. of Posts Category No. of Posts Category No. of Posts
6 21 6 475 6 506
5 98 5 1915 5 1920
4 34 4 1013 4 1044
3 27 3 505 3 516
2 34 2 746 2 712
1 78 1 3132 1 3088
-1 7494 -1 0 -1 0

The first couple of columns are the set of posts with just my labelled training set. -1 represents the unlabelled data.  Thereafter you can see two sets of results, one with a clamping setting of 0.2 (alpha), the other slightly less flexible at 0.1.

alpha : float

Clamping factor. A value in [0, 1] that specifies the relative amount that an instance should adopt the information from its neighbors as opposed to its initial label. alpha=0 means keeping the initial label information; alpha=1 means replacing all initial information (scikit learn).

I’m still trying to find out exactly what the gamma parameter does.  I just went with the value given by all the scikit documentation I could find.

I then went through 50 randomly selected posts that had originally been unlabelled to see what category they had been allocated.   I changed 26 of them, although 10 of these were labelled with a new category which I’m just calling ‘other’ at the moment.  So, in summary, I changed 32% of the sample and added 10% of the sample to a new category.

I always knew from previous explorations of the text data that there would be posts that went into the ‘wrong’ category, but the degree of ‘wrong’ is only according to my personal assessment.  I could be ‘wrong’, and I have absolutely no doubt that others would disagree with how I’ve defined my categories and identified blog posts that ‘fit’, but that’s the joy / frustration of data science.  Context and interpretation are everything.

Writing Retreat: Cumberland Lodge


This isn’t even the front.

I spent all of last weekend, from Friday afternoon through to Monday lunch time, at the magnificent Cumberland Lodge.  It really is the most beautiful building, once occupied by the Ranger of Windsor Great Park, a grace-and-favour appointment that’s been held by some well-known names from English aristocracy.  And just to remind you who the boss is, there’s a clear view from a spot in the grounds all the way to Windsor Castle, nine miles away.  In 1947, the lodge was given to an education foundation established by Amy Buller.  Click on the link and read about the book she authored, Darkness Over Germany.  Strangely, the only available Wikipedia link is to the German site.

When we arrived on Friday afternoon, the fires were lit, and it was like something straight out of Gosford Park.  Someone, who shall remain nameless, described the accommodation as ‘nursing home chic’.  Here’s a picture of my room. You decide.  It did get bloody cold at night, though (single glazed windows, high ceilings, what can you expect?) which of course wouldn’t be tolerated in most nursing homes.


My room.

There were eight of us students, accompanied by Professor Susan Halford and Dr Mark Weal.  The focus, of course, was writing.  For most of us, that was writing a chapter or section of our PhDs, although a couple of people were writing a paper for publication.  We were free to write anywhere, but most people chose to stay sitting around the huge table that was provided for us in a library, although on Sunday afternoon and Monday I chose to go down to one of the drawing rooms and stretch out on a sofa.

We also had two mentors with us for part of the stay – ex-students who had completed their PhDs – to answer any of our questions and talk to use about the last stages of qualifying, including the dreaded viva.


The view from the library.

The final attraction was a visit from Jen McCall, representing Emerald Publishing, to talk about converting our PhDs into books or monograms, which isn’t as easy as it sounds!  Nevertheless, I think many PhDs could be re-written successfully for a broader audience, and in spite of the work (think another year at least of re-writing and adding new material) I really feel as if I could do this with mine.  Time will tell….


The drawing room.

The real benefit of all this, though, was the opportunity to write in a quiet, relaxed atmosphere, away from all the usual distractions (dogs, children, washing up, take your pick).   I forgot to make a note of how many words my chapter started out with, and so couldn’t tell you how much I wrote.  It was a lot, though.  And I made several pages worth of notes as I went along; notes to check things, find things, do extra things…. I find that whenever I start writing I think of lots of other things as I’m getting the words down on the page.  I’m hoping this means I’ve done enough preparation that my mind is free to wander and doesn’t have to pay too much attention to writing any more.

It was also a benefit beyond words to have Susan and Mark with us.  As any PhD student knows, getting face time with your supervisor when you most need it is practically impossible.  They’re busy people.  Also, by the time you do get to see them, the problem has either been resolved, or there are now a whole set of problems that need addressing, few of which you’ll have time to address in your meeting.  The fact that they were there, happy to talk through whatever was an issue was priceless.  Thank you.

As a result of this weekend, I’ve booked myself in to as many other writing retreats organised by the Digital Economy Network as I can.    They’re an organisation funded by the Research Councils UK to support post-grad and research students; all the retreats run over two days, and are free.  As well as the chance to write in a purposeful atmosphere, they are also held in different locations across the  UK so there’s the added bonus of seeing somewhere new.

I’m going to miss this in September.






Coding Resources, or: Things I Wish I’d Known When I Started


As some of you know, I’m in the final year of my PhD in Web Science.  For whatever reason, I decided I’d learn whole load of new stuff from the ground up.  In my 50s.  With zero knowledge to start with except some very basic maths.  I needed to learn to write code, and although my MSc year included a module on writing code in Python, it did nothing more than get me familiar with what code actually looks like on the page.

I cried every Sunday night, prior to the workshop on Monday, because I just couldn’t see how to make things work.

Today, over two years on, I get it.  I can write it (although I still have to refer to a book or previous code I’ve written as a reminder) and my ability to think logically and has improved considerably.  During that time, I’ve amassed a range of books and URLs that have been, and still are, incredibly useful.  It’s time to share and provide myself with a post of curated resources at the same time.

First of all, you absolutely need a pencil (preferably with a rubber on the end), some coloured pens if you’re a bit creative, and plenty of A3 paper.  Initially, this is just for taking notes but I found then incredibly useful further along when I wanted to write the task that I needed my code to carry out, step by step.

Post-it notes – as many colours and sizes as you fancy.  Great for scribbling notes as you go, acting as bookmarks, and if you combine them with the coloured pens and A3 paper, you can make a flow chart.

Code Academy is a good place to start.  It takes you through the basics step by step, and helps you to both see what code looks like on screen, and how it should be written.  There are words that act as commands e.g. print, while, for etc.  that appear in different colours so you can see you’ve written something that’s going to do something, and you can see straight away that indents are important as they signal the order in which tasks are carried out (indents act like brackets in maths).

Just about every book that covers writing code includes a basic tutorial, but one that I bought and still keep referring back to is Automate The Boring Stuff With Python.  By the time you get here, you’ll be wanting to start writing your own code.  For that, I recommend you install Anaconda which will give you a suite of excellent tools.  Oh, and I use Python 3.6.
Resources1Once you’ve opened Anaconda, Spyder is the basic code editor.  I also use the Jupyter Notebook a lot.  I like it because it’s much easier to try out code bit by bot, so for example when I’m cleaning up some text data  and want to remove white space, or ‘new line’ commands, I can clear things one step at a time and see the results at the end of each one.  You can do the same using Spyder, but it isn’t as easy.

I’m going to list some books next, but before I do I should mention Futurelearn.  I have done several of the coding courses – current ones include ‘Data MiningWith WEKA’, ‘Advanced Data Mining With WEKA’ and ‘Learning To Code For Data Analysis’.  While these may not cover exactly what you have in mind to do (more on that in a minute), they will all familiarise you with gathering data, doing things with the data by writing code, and visualising the results.  They also help to get you thinking about the whole process.

I had a series of tasks I needed code to do for me.  In fact, I think the easiest way to learn how to write code is to have something in mind that you want it to do.  I needed to be able to gather text from blog posts and store it in a way that would make it easily accessible.  In fact, I needed to store the content of a blog post, the title of the post and the date it was published.  I later added the URL, as I discovered that for various reasons sometimes the title or the date (or both) were missing and that information is usually in the URL.  I then identified various other things I needed to do with the data, which led to identifying more things I needed to do with the data….. and so on.  This is where I find books so useful, so here’s a list:

  • Mining The Social Web, 2nd Edition.  The code examples given in this book are a little dated, and in fact rather than write the code line-by-line to do some things, you’d be better off employing what I’ll call for the sake of simplicity an app to do it for you.  It was the book that got me started, though, and I found the simple explanations for some of the things I needed to achieve very useful.
  • Data Science From Scratch.  I probably should have bought this book earlier, but it’s been invaluable for general information.
  • Python For Data Analysis, 2nd Edition.  Again, good for general stuff, especially how to use Pandas.  Imagine all the things you can do with an Excel spreadsheet, but once your sheet gets large, it becomes very difficult to navigate, and calculations can take forever.  Pandas can handle spreadsheet-style stuff with consummate ease and will only display what you want to see.  I love it.
  • Programming Collective Intelligence.  This book answered pretty much all the other questions I had, but also added a load more.  It takes you through all sorts of interesting algorithms and introduces things like building classifiers, but the main problem for me is that the examples draw on data that has already been supplied for you.  That’s great, but like so many other examples in all sorts of other books (and on the web, see below) that’s all fine until you want to use your own data.
  • This book began to answer the questions about how to gather your own data, and how to apply the models from the books cited above: Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data.  This book has real-world examples which were relatively easy for me to adapt, as well as straightforward explanations as to how the code works.

Finally, some useful web sites.  The first represented a real break-through for me.  Not only did it present a real-world project from the ground up, but the man behind it, Brandon Rose (who also contributed to the last book in my list) is on Twitter and he answered a couple of questions from me when I couldn’t get his code to work with my data.  In fact, he re-wrote bots of my code for me, with explanations, which was incredibly helpful and got me started. is amazing.

This is the one and only video tutorial I’ve found useful.  Very useful, actually.  I find video tutorials impossible to learn anything from on the whole – you can’t beat a book for being able to go back, re-read, bookmark, write notes etc. – but this one was just what I needed to help me write my code to scrape blog posts, which are just web pages and other blog posts. does what it says, and more. useful walk-through. 

The URLs listed above are quite specific to the project I’ve been working on.  I’d also like to add Scikit-Learn which provided all the apps I’ve been using.  The explanations and documentation that is included on the site was less than helpful as it assumed a level of knowledge that was, and to a certain extent still is way above my head.  However, what it gave me was the language to use when I was searching for how to write a piece of code.  Stack Overflow is the best resource there is for this, and most of my bookmarks are links to various questions and  responses.  However, it did take me a while to a) learn what form of words would elicit an answer to my problem, and b) to understand the answers.  I even tried asking a question myself.  Never again.  Unless you’re a fully-fledged computer science geek (and if you were, you wouldn’t be here) it’s hostile territory.

Finally, an excellent site that has been useful again and again: DataVizTools.

Going back to Anaconda for a minute, when you’re feeling a bit more confident, have a look at the Orange application.  I’ve blogged about it several times, and blog on the site is an excellent source of information and example projects.  The help pages are excellent for all the basic apps, although some of the  newer ones don’t have anything yet.

And to finish, a site that I found, courtesy of Facebook, this very morning.  This site lets you see how your code works with a visualiser, something I found myself doing with pencil and paper when my code wasn’t doing what it should and I didn’t know why.

Developing Categories, Part 2

So, while I deploy my bespoke python code to scrape the contents of umpteen WordPress and Blogger blogs, I’ve continued trying to classify blogs from my sample according the the categories I outlined in my previous post.

I say ‘trying’ because it’s not as straightforward as it seems.  Some blogs clearly don’t fit into any of the categories, e.g. where a blogger has simply written about their holiday, or for one blogger written a series of posts explaining various aspects of science or physics.  I reckon that this a science teacher writing for the benefit of his or her students, but as the posts are sharing ‘knowledge’ rather than ‘resources’, I can’t classify them.  Fortunately the label propagation algorithm I will eventually be using will allow for new categories to be instigated (or the ‘boundaries’ for existing categories to be softened) so it shouldn’t be a problem.

‘Soapboxing’, ‘professional concern’ and ‘positioning’ have also caused me to think carefully about my definitions.  ‘Soapboxing’ I’m counting as all posts that express an opinion in a strident, one-sided way,  with a strong feeling that the writer is venting frustration, and perhaps with a call to action.  These tend to be short posts, probably written because the blogger simply needs to get something off their chest and (possibly, presumably) get some support from others via the comments.  ‘Professional concern’, then, is a also post expressing a view or concern, but the language will be more measured.  Perhaps evidence from research or other bloggers will be cited, and the post will generally be longer.  The blogger may identify themselves as a teacher of some experience, or perhaps a head of department or other school leader.  As with ‘soapboxing’, a point of view will be expressed, but the call to action will be absent.

‘Positioning’ is a blog post that expresses a belief or method that the blogger holds to be valid above others, and expresses this as a series of statements.  Evidence to support the statements will be present, generally in the form of books or published research by educational theorists or other leading experts in the field of education.

Of course, having made some decisions regarding which blogs fit into these categories, I need to go back through some specific examples and try to identify some specific words or phrases that exemplify my decision.  And I fully expect other people to disagree with me, and be able to articulate excellent reasons why blog A is an example of ‘positioning’ rather than ‘professional concern’, but all I can say in response is that, while it’s possible to get a group of humans to agree around 75% of the time, it’s impossible to get them to agree 100%, and that’s but the joy and the curse of this kind of research.

Given more time, I’d choose some edu-people from Twitter and ask them to categorise a sample of blogs to verify (or otherwise) my decision, but as I don’t have that luxury the best I can do is make my definitions as clear as possible, and provide a range of examples as justification.

The other categories that aren’t proving straightforward are ‘feeedback, assessment and marking’ (‘FAM’) and ‘behaviour’.  I knew this might be the case, though, so I’m keeping an open mind about these.  I have seen examples of blogs discussing ‘behaviour’ that I’ve put into one of the three categories I’ve mentioned above, but that’s because the blogs don’t discuss ‘behaviour’ exclusively.

Anyway, I’ve categorised 284 (out of a total of 7,788) posts so far so I thought I’d have a bit of a look at the data.


I used Orange again to get a bit more insight into my data.  Just looking at the top flow, after opening the corpus I selected the rows that had something entered in the ‘group’ column I created.


Selecting rows.


Creating classes.









I then created a class for each group name.  This additional information can be saved, and I’ve dragged the ‘save data’ icon onto the workspace, but I’ve chosen not to save it automatically for now.  If you do, and you give it a file name, every time you open Orange the file will be overwritten, which you may not want.  Then, I pre-processed the 284 blogs using the snowball stemmer, and decided I’d have a look at how just the sample might be clustered using k-means.

“Since it effectively provides a ‘suffix STRIPPER GRAMmar’, I had toyed with the idea of calling it ‘strippergram’, but good sense has prevailed, and so it is ‘Snowball’ named as a tribute to SNOBOL, the excellent string handling language of Messrs Farber, Griswold, Poage and Polonsky from the 1960s.”

Martin Porter

I’m not sure if I’ve explained k-means before, but here’s a nice link that explains it well.

“Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals in a population together by similarity, but not driven by a specific purpose.”

The data points are generated from the words in the blogs.  These have been reduced to tokens by the stemmer, then a count is made of the number of times each word is used in a post.  The count is subsequently adjusted to take account of the length of the document so that a word used three times in a document of 50 words is not given undue weight compared with the same word used three times in a document of 500.  So, each document generates a score for each word used, with zero for a word not used that appears in another document or documents.  Mathematical things happen and the algorithm coverts each document into a data point in a graph like the ones in the link.  K-means then clusters the documents according to how similar they are.

I already know I have 8 classes, so that’s the number of clusters I’m looking for.  If I deploy the algorithm, I can see the result on a silhouette plot (the matching icon, top far right of the flow diagram above).  The closer to a score of ‘0’, the more likely it is that a blog post is on the border between two clusters.  When I select that the silhouette plot groups each post by cluster, it’s clear that ‘resources’ has a few blogs that are borderline.









‘FAM’ and ‘behaviour’ are more clearly demarcated.  If I let the algorithm choose the optimal number of clusters (Orange allows between 2 and 30), the result is 6, although 8 has a score of 0.708 which is reasonable (as you can see, the closer to 1 the score is, the higher the probability that the number suggested is the ‘best fit’ for the total number  of clusters within the data set).

stuff6 As you can see from the screenshot below, cluster 4 is made up of posts from nearly all the groups.  Remember, though, that this algorithm is taking absolutely no notice of my categories, or the actual words as words that convey meaning.  It’s just doing what it does  based on numbers, and providing me with a bit of an insight into my data.