Category Archives: Uncategorized

Guest Blog Post

Here’s a little thing I wrote for the Web Science Institute over at the University of Southampton….. http://blog.soton.ac.uk/wsi/ben-williamson-seminar-educational-research-web-science/

You’re welcome.

Advertisements

Clustering Blog Posts: Part 2 (Word Frequency)

One of the most important things to do when working with a lot of data is to reduce the dimensionality of that data as far as possible.  When the data you are working with is text, this is done by reducing the number of words used in the corpus without compromising the meaning of the text.

One of the most fascinating things about language was discovered by G K Zipf in 1935¹: that the most frequently used words in (the English) language are actually few in number, and obey a ‘power law’.   The most frequently used word occurs twice as often as the next most frequently word, three times as often as the third, and so on.  Zipf’s law forms a curve like this:
Zipf-CurveThe distribution seems to apply to languages other than English, and it’s been tested many times, including using the text of, for example, novels.  It seems we humans are very happy to come up with a rich and varied lexography, but then rely on just a few to communicate with each other.  This makes perfect sense as far as I can see: I say I live on a boat, gets the essentials across (a thing that floats, a bit of an alternative lifestyle, how cool am I? etc.) because were I to say I live on a lifeboat, I then have to explain that it’s like one of the fully-enclosed ones you see hanging from the side of cruise ships, not the open Titanic-style ones most people would imagine.

“For language in particular, any such account of the Zipf’s law provides a psychological theory about what must be occurring in the minds of language users. Is there a multiplicative stochastic process at play? Communicative optimization? Preferential reuse of certain forms?” (Piantadosi, 2014)

A recent paper by Piantadosi² reviewed some of the research on word frequency distributions, and concluded that, although Zipf’s law holds broadly true, there are other models that provide a more reliable picture of word frequency which depend on the corpus selected.  Referring to a paper by another researcher, he writes “Baayen finds, with a quantitative model comparison, that which model is best depends on which corpus is examined. For instance, the log-normal model is best for the text The Hound of the Baskervilles, but the Yule–Simon model is best for Alice in Wonderland.”

I’m not a mathematician, but that broadly translates as ‘there are different ways of calculating word frequency, you pays your money you takes your choice”.  Piantadosi then goes on to explain the problem with Zipf’s law: it doesn’t take account of the fact that some words may occur more frequently that others purely by chance, giving the illusion of an underlying structure where none may exist.  He then goes on to suggests a way to overcome this problem, which is to use two independent corpora, or split a corpora in half and then test word frequency distribution in each. He then tests a range of models, and concludes that the “…distribution in language is only near-Zipfian.” and concludes “Therefore, comparisons between simple models will inevitably be between alternatives that are both “wrong.” “.

Semantics also has a strong influence on word frequency.  Piantadosi cites a study³ that compared 17 languages across six language families and concluded that simple words are used with greater frequency in all of them, and result in a near-Zipfian model.  More importantly for my project, he notes that other studies indicate that word frequencies are domain-dependent.   Piantadosi’s paper is long and presents a very thorough review of research relating to Zipf’s law, but the main point is that it does exist, even though why it should be so is still unclear.  The next question is should the most frequently used words from a particular domain also be removed?

As I mentioned before, research has already established that it’s worth removing (at least as far as English is concerned) a selection of words.  Once that’s done, which are the most frequently used words in my data?  I used Orange is to split my data in half and generate three word clouds based on the same parameters, and observe the result.  Of course I’m not measuring the distribution of words, I’m just doing a basic word count and then displaying the results, but it’s a start.  First, here’s my workflow:

WFD1

I’ve shuffled (randomised) my corpus, taken a training sample of 25%, and then split this again into two equal samples.  Each of these has been pre-processed using the following parameters:

WFD2

Pre-processing parameters. I used the lemmatiser this time.

The stop word set is the extended set of ‘standard’ stop words used by Scikit that I referred to in my previous post, plus a few extra things to try and get rid of some of the rubbish that appears.

The word clouds for the full set, and each separate sample, look like this:

WC1

Complete data set (25% sample, 2316 rows)

WC2

50% of sample (1188 rows)

WC3

Remaining 50% of sample

The graph below plots the frequency with which the top 500 words occur.

WFDGraph

So, I can conclude that based on word counts, each of my samples is similar to each other, and to the total (sampled) corpus.  This is good.

So, should I remove the most frequently used words, and if so, how many?  Taking the most frequently used words across each set, and calculating the average for each word, gives me a list as follows:

table1

And if I take them out, the word cloud (based on the entire 25% set) looks like this:

WCouldLemSWset3

Which leads me to think I should take ‘learning’ and ‘teaching’ out as well.  It’s also interesting that the word ‘pupil’ has cropped up here – I wonder how many teachers still talk about pupils rather than students?  Of course, this data set contains blogs that may be a few years old, and/or be written by bloggers who prefer the term.  Who knows?  In fact, Orange can tell me.  The ‘concordance’ widget, when connected to the bag of words, tells me that ‘pupil’ is used in 64 rows (blogs) and will show me a snippet of the sentence.

concordance1

It’s actually used a total of 121 times, and looking at the context I’m not convinced it adds value in terms of helping me with my ultimate goal, which is clustering blog posts by topic, so it’s probably worth mentioning here that the words used the least often are going to be the most numerically relevant when it comes to grouping blogs by topic.

WCouldLemSWset4

Could I take out some more?  This is a big question.  I don’t want to remove so many words that the data becomes difficult to cluster.  Think of this as searching the blog posts using key words, much as you would when you search Google.  Where, as a teacher, you might want to search ‘curriculum’, you might be more interesting in results that discuss ‘teaching (the) curriculum’  rather than those that cover ‘designing (the) curriculum’.  If ‘teaching’ has already been removed, how will you find what you’re looking for?  Alternatively, does it matter so long as search returns everything that contains the word ‘curriculum’?  You may be more interested in searching for ‘curriculum’ differentiating by key stage.  For my purposes, I think I’d be happy with a cluster labelled ‘curriculum’ that covered all aspects of the topic.  I’ll be able to judge when I see some actual clusters emerge, and have the chance to examine them more closely.  Which, incidentally, the concordance widget tells me is used in 93 blogs, and appears 147 times.  That’s more than ‘pupil’, but because of my specialised domain knowledge I judge to be more important to the corpus.

Which is also a good example of researcher bias.

  1. Zipf, G. K.; The Psychology of Language; 1966; The M.I.T. Press.
  2. Piantadosi, S.; Zipf’s word frequency law in natural language: A critical review and future Directions; Journal of National Institutes of Health; 2014; volume 21, October issue, pages 1112-1130.
  3. Calude, A., Pagel, M.;  How do we use language? Shared patterns in the frequency of word use across 17 world languages; 2011; Journal of Philosophical Transactions of the Royal Society of London. Series B, Biological sciences; volume 366, issue 1567, pages 1101-7.

Clustering Blog Posts: Part 1

Today, while my laptop was employed scraping Edu-blog posts from 2011, I decided to play around with Orange.  This is one of the suite of tools offered by Anaconda which I use for all my programming needs.

The lovely thing about Orange is that it allows you to build a visual workflow,  while all the actual work in the form of the lines and lines of code is done behind the scenes.  You have to select some parameters, which can be tricky, but all the heavy lifting is done for you.

This was my workflow today, although this was about halfway through so by the time I’d finished, there were a few more things there.  Still, it’s enough to show you how it works.

workflow

My corpus is a sample of 9,262 blog posts gathered last year.  Originally, there were over 11,000 posts but they’ve been whittled down by virtue of having no content, having content that had been broken up across several rows in the spreadsheet, or being duplicates.  I also deleted a few that simply weren’t appropriate, usually because they were written by educational consultants, as means to sell something other tangible such as books or software, or political in some way such as blogs written for one of the teaching unions.  What I’ve tried to do is identify blog URLs that contain posts by individuals, preferably but not exclusively teachers, with a professional interest in education and writing from an individual point of view.  This hasn’t been easy, and I’m certain that when I have the full set of data (which will contain many tens of thousands of blog posts) some less than ideal ones will have crept in, but that’s one of the many drawbacks of dealing with BIG DATA:  it’s simply too big to audit.

You may recall that the point of all this is to classify as much as the Edu-blogosphere as I possibly can –  to see what Edu-professionals talk about, and to see if the topics they discuss change over time.  Is there any correlation between, for example, Michael Gove being appointed Secretary of State for Education and posts discussing his influence?  We’ll see.  First of all, I have to try and cluster the posts into groups according to content.  I’ve been doing this already, and developed a methodology.  However, while I’m still gathering data, and labelling a set of ‘training data’ (of which more in a future blog post) I’ve been experimenting with a different set of tools.

So, here’s my first step using Orange.  Open the corpus, shuffle (randomise) the rows, and take a sample of 25% for further analysis, which equates to 2316 documents or rows.

corpus

Open the document, select the data you need, e.g. ‘Content’.  The other features can still be accessed.

 

 

Step1

The ‘corpus viewer’ icons along the top of the workflow shown above mean that I can see the corpus at each stage of the process.  This is a glimpse of the second one, after shuffling.  Double-clicking any of the icons brings up a view of the output, as well as the various options available for selection.

viewer2

The next few steps are will give me an insight into the data, and already there are a series of decisions to make.  I’m only interested in  pre-processing the content of each blog post.  The text has to be broken up into separate words, or ‘tokens’, punctuation removed, plus any URLs that are embedded in the text.  In addition, other characters such as /?&* or whatever are also stripped out.  So far, so straightforward as this screen grab shows:

 

preprocess1

Words can also be stemmed or lemmatised (see above).

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.  Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. ”  The Stanford Natural Language Processing Group

If I lammatise the text as is, I get this lovely word cloud. BTW, it’s just a coincidence that ‘example’ is in red.WCloudLemmatiser

As you would expect from blogs written by teachers, words like ‘student’ and ‘teacher’ feature heavily.  If I use the snowball stemmer (which is basically an efficient algorithm for stemming, explained here) then the word cloud looks like this:

WCloudBespokeStopWords

Both of these word clouds are also generated from the corpus after a series of words know as ‘stop words’ are removed.  These words are the ones we often use most frequently, but add little value to a text; words such as ‘the’, ‘is’, ‘at’ or ‘which’.  There is no agreed standard, although most algorithms use the list provided by the Natural Language Toolkit (NLTK).  I’ve chosen to use the list provided by Scikit-Learn, a handy module providing lots of useful algorithms.  Their list is slightly longer.  The use of stop words is well researched and recommended to reduce the number of unique tokens (words) in the data, otherwise referred to by computer scientists as dimensionality reduction.   I also added some other nonesense that I noticed when I was preprocessing this data earlier in the year- phrases like ‘twitterfacebooklike’ – so in the end I created my own list combining the ‘standard’ words and the *rap, and copied them into a text file.  This is referred to as ‘NY17stopwordsOrange.txt’ in the screenshot below.

stopwords

My next big question, though, is what happens if I add to the list some of the words that are most frequently used in my data set – words like ‘teach’, ‘student’, ‘pupil’, ‘year’?  So I added these words to the list: student, school, teacher, work, year, use, pupil, time, teach, learn, use.  This is the result, using the stemmer:

WCloudStopWordsSet2

There is research to suggest that creating a bespoke list of stop words that is domain-specific is worth doing as a step before going on to try and classify a set of documents.  It’s the least-used words in the corpus that are arguably the most interesting and valuable.  I’ll explore this some more in the next post, along with the following steps in the workflow.

Back to the Classroom.

A few weeks ago, I was asked if I’d be interested in running a workshop for year 12 students as part of the ESRC* Festival of Social Science.  This was organised as part of the University of Southampton’s Learn with US (Outreach) programme which I’d quite like to do more work with in the future.  The theme of the workshops was looking at how technology, and mobile phones and devices in particular, are being used in social science research.  As part of my research, I’m looking at networks and network (or graph) theory, so I thought I could have a go at teaching that.  I find networks fascinating, AND I knew I had some excellent resources that could be adapted for use with students, so why not?

I’m also really keen on promoting the idea that a) computer science is for women too, b) web science is an excellent way of combining the social sciences with computer science, and c) age is no barrier.  A teacher who was accompanying a group of students also told me that, as well as being a role model for girls, I was also showing students why being able to write code was so important as it could have a real practical benefit.

I really miss being in the classroom.  Why will be the subject of another blog post, but suffice to say that, for me, there’s something exhilarating about putting things together (in this case, a PowerPoint and some handouts to guide students the through some actual hands-on work) so that I can deliver knowledge in a way that I hope is interesting.  I like being in charge, in my own space, directing my own personal show.  It’s also a really good chance for me to consolidate my own learning, which is one of the benefits of teaching.

The students were, of course, excellent.  They were made up of groups from several schools – one or two local, others from further afield.  It was really interesting to observe how different as groups they were from one another, which I assume reflects both the socio-economic background they were drawn from (and is almost certainly directly related to the catchment area of the school) and the ethos of the school itself.  The interactions between them, them and their teachers, and with me was markedly different from session to session.  Having only taught in one school before (and not really being detached enough to just observe), it was a fascinating experience for me.  It was, though, overwhelmingly positive and I thoroughly enjoyed it!

I’m sure they left with a positive view of the University of Southampton, and I hope they were inspired by my workshop, and the others they attended.

By the way, the resources I used were borrowed and adapted from the ‘Power of Social Networks’ MOOC** that has just finished on Futurelearn.  It’ll be repeated though, if you fancy a dabble into the world of social networks.

*Economic & Social Research Council

**Massive Open Online Course

Rome

This time last week, I was probably on an aircraft waiting to fly back to Gatwick, following five days in Rome.  It seems like a long time ago now.

20161013_180922

He’s got wood.

You really can’t take more than three steps in Rome without stumbling over some ancient ruins.  They’re everywhere.  Often, they’re just some pillars, supported by metal bands and standing among weeds and rubble, usually up against more contemporary buildings.  I don’t doubt that just a few feet beneath the pavements, so much more remains undiscovered.  The point is, you can see so much without paying a penny, like the Trevi Fountain, which is pretty much what we did.

I didn’t bother to do any specific research before I went.  I watched Mary Beard’s series on BBC4 when it was broadcast.  I wanted to just look, and take it all in.  And it really is spectacular.  My general photos are here.  The river you can see is the river Tiber – the Tiber!  I don’t know why this excited me so much, but it did.  I wish I’d kept up with Latin, though.  Just the street names can tell you so much, but I might have been able to read some of the inscriptions and graffito.

Anyway, out hotel was this one, which was central for everything and very comfortable.  Mind you, it was a bit inconsistent.  My room was right at the top, with my very own balcony, and very spacious for a single room.  Wifi was pretty near impossible to get, though.  My travelling companions had single rooms each, both of which were smaller than mine (although one had a queen(?) sized bed and a door that was incredibly difficult to open.  The other was more like a cupboard and had a leaking bidet.  And none of our rooms were in the hotel we originally booked, which was this one.  For some reason, they’d made a mistake and had to move us.  The Helvazia was more central, but further away from the Colosseum and the conference venue which is the reason one of us was there in the first place.

20161015_091840

Even had a lime tree…

In fact, the mistake with the hotel booking came at the end of a day that had started with the rail strike meaning that I had to get a taxi to the station because my train was cancelled, then the aforementioned taxi hit a cyclist who had come tearing out from a park straight across the road (and was wearing his earphones….), and then the train to Gatwick was cancelled as well so we ended up getting another taxi (fifty quid each) to the airport because we didn’t want to take any more chances with public transport.  Sigh.

I didn’t find Rome as expensive as I thought it would be, which probably says more about how prices have risen generally than anything else.  We are really well (apart from the last evening, which was ok but not up to the standard of previous choices).  We ate here (which was my favourite, and by far the cheapest, especially with wine at 7 euros a litre) on the first evening; lunched here on Thursday (lovely, freshly cooked food but very uncomfortable seating if you have anything other than a small bottom); a Sicilian restaurant Melo on Thursday evening; and here on Friday evening.  The Constanza was something special.  Not only was the food and wine excellent, but the restaurant itself is partially in the remains of a Roman theatre.  Saturday was a bit of a disappointment.  We wanted to go to a place that made pizzas fresh and right in front of you, but unfortunately it was full, complete with queue of people waiting for a table

20161014_131337

The Tiber!

Two spectacular places we did pay to visit were the Colosseum and Trajan’s Market.  My Colosseum pictures are here.  Trajan’s market (photos here) was originally directly linked with the Colosseum.  The magnificent horse sculptures you can see are modern, and part of a touring public exhibition, the Lapiderium.  Given that I hard a tour guide say that more  ‘exotic’ animals were dispatched for public entertainment (and probably by ‘accident’ in the chariot races) in the Colosseum than at any time in history, I thought they were a poignant reminder of how cruel human beings can be.20161015_180440I would definitely go back to Rome again.  I didn’t see the Sistine Chapel,  or visit any of the art galleries.  I’d do some more research as well, and visit something with a bit more knowledge under my belt.  Oh, and I’d take several pairs of comfortable walking shoes and loads of pairs of socks.  Walking around Rome is incredibly hard on your feet, paved as it is with small granite blocks  if you’re lucky, crumbling concrete and tarmac if you aren’t.  It’s the best way, though, as nothing is especially far away and public transport looked packed and tricky to negotiate unless you speak some Italian.  I wouldn’t like to be there in the summer.  It was very warm even in October, and busy.  I can only imagine how hot and crowded it must be July/August.

What Big Data Can’t Tell You

I’ve spent what seems like months writing Python code that will let me download the content of blog posts.  You can  do this using what’s known as an RSS (Rich Site Summary) feed, but that only yields a summary of the most recent posts, when I need the whole post, and every post the blogger has written.  In some cases, this goes back years.  It’s been a painful process, and will be the subject of my next blog post, but just for a bit of ‘fun’ I thought I’d look at the comments feed instead.

A while ago, Tom Starkey (@tstarkey1212) asked on Twitter if there was any way of finding out which Edu-blogs might be the most popular.  One way of finding out might be to look at the number of comments made on posts, so I thought I’d use the RSS feed this time to download the latest ones and have a look.  I wrote some Python code, and bingo! there they all were in a nice tidy spreadsheet.  There are some issues, though.  Quite a few, actually.

  1. I have no idea if I have the http address of every Edu-blogger out there.  My source was the list in a spreadsheet provided by Andrew Old .  How complete it is depends on a) whether you’ve heard of Andrew, or b) whether you’ve heard of him but don’t want to add your blog to ‘his’ spreadsheet.  Still, there are over 800 blogs on there so it’s a big enough sample to be getting on with.
  2. The information I needed was the blog post title, the name of the commenter, the date the comment was made, the comment itself, and the http link to the comment.  The link is important because it contains the title of the blog site.  RSS feeds yield particular information as they’re kind of standardised.  However, the title of the blog post contained in the link, the bit in bold in fact: https://sarahhewittsblog.wordpress.com/ isn’t always the actual title.  Nor is it always after the //, so any attempt to automatically extract the title based on its position in the http address was difficult.  When you’ve got over 8000 rows in your spreadsheet, you so want to automate the process if you can.  I chose not to, because….
  3. The name of the commenter might also be the title of the blog.  In fact, this was the case for quite a few posts, something that only becomes obvious when you slowly scroll through each of those 8000-plus rows.
  4. The name of the commenter should be the very last item in the field yielded by the RSS feed.  In theory, it should be easy to extract because it would come after a comma or possibly even a | symbol.  So, I could write some code that would iterate over every one of those 8000 rows and just extract the commenter’s name and put it in a separate column, right?  Wrong.  Some fields were truncated because they were too long.  Relying on commas to demarcate the right characters risked getting the wrong information.  Sometimes there was nothing more than a space.  In the end, I did it manually, copying and pasting.  That also helped me to identify names that related to the blog title and the name of the commenter, so I could match them up.
  5. Finally, the most obvious thing.  Not everyone who read a blog leaves a comment.  In fact, I’m willing to bet most people don’t. And if they do, I bet they do it via either posting a link to the blog with a recommendation, or simply retweeting the link that brought the blog to their attention in the first place.  The only way of knowing who reads a blog is in the hands of the blogger themselves via their stats pages, or possibly Google with their page link algorithm.  Still, I think the real proof (in spite of what some bloggers have claimed) lies in those stats.  And given I’ve been accessing some sites repeatedly in an effort to see if my code works, there may be some glitches there as well.

In spite of all this, I gathered my data and used NodeXL to produce a graph.  Three, in fact.  The basic one is here and is best viewed using a laptop or PC.  I’ve made some notes based on the graph metrics (graph-notes) and there are two other versions here and here .  Again, it’s best you view them using a laptop or a PC.

Finally, if your blog isn’t on Andrew’s spreadsheet, and you want to see how it compares with everyone else’s (or you’d like me to include it in the data I’ll be using for my PhD) you can either add it yourself or let me know the address and I’ll add to my own records.  I intend to anonymise all my data before I publish it because I know how sensitive it is even though it’s public (I’m an ex-teacher myself).  Or you can send me your viewing stats because, after all, they paint the clearer picture.

The thing is, though, that while it’s easy to think your blog might the one that’s influencing everyone and getting them ‘on your side’,  knowing and proving it are completely different things altogether.

Baby’s First Conference*

Nearly a month ago now, I attended my very first conference – WebSci16 – in Hanover, Germany.  I submitted a short paper, which was accepted as an extended abstract.  I was also invited to submit a poster for the poster session.

Now, I admit I was a little puzzled about posters when I first started my MSc, but I know now they’re just one of the things academics use to communicate their work.  As such, they’re not necessarily works of art with lots of illustrations (although some of the best ones I’ve seen have been) but they need to be concise, clear, and not overly wordy.  If you want to see mine, it’s here Extended Abstract Hanover 16 Poster .  And here’s my paper, if you’re really bored… Extended Abstract Final.

DSC_0248

Me and my poster.

So anyway, a few thoughts.  Web Science is, basically, an inter-disciplinary thing.  It usually combines an ‘ology’ with computer science, but then there are other approaches like my colleague Nikko who is researching online identities from the perspective of Law; or another colleague who is researching event detection on social media.  Basically, if some aspect involves people and the internet, Web Science can fit in there somehow.  A lot of computer science people have also moved across to Web Science, which is great.  They work hard to develop and refine all manner of things including data analysis, machine learning (AI) and language processing.  The problem is that they can sometimes be so focused on the technical side that they forget the human aspect, and I think it’s that which really defines Web Science.

Many times during the course of listening to someone present their research, I wanted to ask why they thought their work was important, and the impact they thought it would make.  Many of the questions focused on technical aspects of the paper, which told me there were a lot of computer scientists in the audience.  These things are important, of course they are, but a little more thought about where humans fit in would have been better.  In short, there were times when I was disengaged and a bit bored, but I was with some fab colleagues, the venue was great, and the food lovely.  And I met my data science hero, Pete Burnap.  Oh, and the organisers very kindly averted disaster when I realised that the posters I’d been carrying in my poster tube fell out the bottom when we had to make a mad dash across Schipol airport to catch the connecting flight to Hanover.  I wouldn’t have been too upset about mine, but I was also carrying one for a colleague who was had also been accepted for a paper and poster, but couldn’t attend.  Both posters were reprinted without any fuss , so a massive thank you! to them.

Here’s a link to my photo album if you want to browse.  The post looking building in the park in the town hall where we went for a formal dinner on the last evening.

Oh, and then Nikko and I went to Berlin for a couple of days…..

*phrase courtesy of Nick Bennett.