Category Archives: Uncategorized

Topic Modelling. It’s Hell.

Now I’m on the last chapter of my PhD (ok, so the other three still have some things I have to add, but hey-ho), it’s time to face up to the challenge of topic modelling.

The good thing about topic modelling is it’s a clustering problem: it assumes you know nothing about your data, and wants to find out what people have been writing about.  Actually, what topic modelling is, is the application of the Latent Dirichlet Application algorithm which assumes that if you know something about the words in the corpus, you can work out what the topics are.  It also assumes that all the documents belong to all the topics, but the probability for some documents belonging to all the classes will be 0, or very, very close to 0.  The bad thing about topic modelling is that is assumes you know nothing about your data, and therefore will lie to you unless and until you treat it properly: feed it clean data, restrict the calorie intake of the data by removing anything unnecessarily fatty or sugary, and tell it how many topics you actually want it to find.  But hang on, you don;t know anything about the data, right?  So how can you possibly know what to look for?  And the topic modelling algorithm just sits quietly with a smug grin on its face.

In the space of ten years or so, the computer science community, or at least the bit of it that’s tried to wipe the smug grin off its face, has gone from ‘look! It’s a miracle!’ to ‘this thing just LIES and LIES and LIES….’.   Some attempts have been made to try and address the problem, but they are computationally expensive, or the code hasn’t been incorporated into the popular algorithms yet, and so the non-computer-sciency person like me is left trying to work out the best method of making the bloody thing work.

And that’s the real issue: there’s a line between ‘making it work’ and ‘fiddling the figures until you get what you’re looking for’ that must not be crossed.  Not if you want an examiner to believe your research, anyway.  And it turns out there are a thousand ways to tweak the data / the parameters of the algorithm / both together to ‘make it work’.

Ultimately, the goad is to uncover “….better topics that are more human interpret-able” but even if you know the domain from which your corpus has been drawn, this can still be challenging.

So, the general consensus in all research papers I’ve read seems to be that the data needs to be cleaned up (URLs removed, contractions expanded, punctuation deleted etc.); stopwords removed (including any additional words that are particular to the corpus); and the corpus tokenised (each word is now a token, not a functional part of a sentence).  Some papers advocate stemming – reducing words to their root form – other’s don’t mention this at all.  Stemming reduces the overall number of tokens as words like ‘teacher’, ‘teaching’, ‘taught’ become ‘teach’.  The next step is then to produce a matrix of word count vectors – the number of times every word in the entire corpus is used in each document.  That’s potentially a lot of zeros, which is fine.  Having done this, it’s possible to go one step further and add a weighting to each word so that that count is now in inverse proportion to the number of documents its used in.  In short, the least frequently used words get higher ‘scores’ and become more important as a way of signalling a particular topic.  The 232 documents in my trial set of data (blog posts, in case you haven’t read anything else I’ve written) contain 7,047 unique tokens.  That’s a matrix (table, if you like) of 7047 x 7047 word counts of TFIDF scores.

If I use a simple count vector and ask the algorithm to find 8 topics (a lot of trial and error suggested that this might be the optimal number.  Don’t ask. Let’s just say it was a lot of running and re-running code), plus stemming and tokenising, the spread of topics looks like this:

CVV3

What you’re looking at here is each of 8 topics ‘clustered’ in two-dimensional space (principal components 1 and 2).  The top-30 most-used are shown on the right.  The same parameters, but this time using TFIDF, are shown here:

TFIDFV3

It looks a bit different.  One topic dominates all the others.  What makes this all more than a bit annoying is that every time the algorithm is run, the results are slightly different, so anyone wedded to 100% accuracy and replicability is going to be apoplectic very quickly.  Nevertheless, if you can take a deep breath and get beyond this, it’s entirely possible that what is being shown as a reasonable representation of the topics in a corpus as a result of the decisions made on the way to the final convergence of the algorithmic process.  It’s not ‘true’, but it’s ‘true for a given value of true’.

 

If anyone wants to have a look at the interactive files, they’re here in my GitHub.  You’ll need to download the files, and then open them using a browser.

As well as trying this out on data that’s been stemmed and tokenised, I also tried using just a count vectoriser.  You’ll find the files by clicking on the link above.

The verdict?  Well, the goal is to be able to add a meaningful label to each cluster.  I haven’t had a really close look at them yet, but first glance suggests applying a simple count vectoriser, and only tokenising the data, seems to produce the clearest results.  In the end, the method I choose to arrive at the results has to be consistent, and once I’ve decided what it is going to be, I have to accept them as they are, because I have to repeat this on 14 sets of blogs.  It’s also entirely likely that, for some years, there will be more than 8 topics (or less) so there will still be some faffing around in terms of topic numbers, but that will be it.  Everything else will stay the same.

Advertisements

Kicking Off The Actual Writing

For the last two days, I’ve been up NORTH at a writing retreat, organised by the DEN, and held here.  I’ll add a link to my photos, but I’ll put this one here because it sums up the place perfectly!

20180612_105025

I got loads of work done, as you can see.  I love technology, but sometimes you have to get the stationery out and do it the old-fashioned way.  Besides, who doesn’t love stationery, amiright?

20180612_162151

The narrative…

20180612_162216

The overall structure, minus the Introduction (which is next)

20180612_162237

The Introduction

20180612_162255

Starting the Literature Review, and some extra thoughts….

20180612_162128

….and the Literature Review specifically focusing on the blogosphere!

Developing Categories, Part 4

I thought I’d have a quick look at the difference using a lemmatiser instead of a snowball stemmer makes to clustering using k-means and just my group of labelled blogs.  Here’s the silhouette plot based on groups:

SilPlotLemm

Remember, the closer the score is to 0, the more statistically likely it is that the blog could be in a different category.

Here’s the same data, this time with the number of categories set to 6, but grouped according to the category the algorithm has calculated as being the most appropriate.

SilPlotLemmV2

There appear to be blogs that, at least according the k-means, are in a category with a variety of different labels.  The algorithm isn’t learning anything, though, it’s just making decisions based on the scores of tokens in the blog, nothing else.  I simply wanted to see if lemmatising the blogs instead of stemming made much of a difference.

Here’s the same parameters as above, but using the snowball stemmer as before:

SilPlotSnow

And side-by-side (Snowballing / Lemmatising):

SilPlotSnow                                       SilPlotLemmV2

 

The answer is: overall, not that I can see.

Developing Categories, Part 3

stuff8I’ve already said that I wasn’t sure if ‘behaviour’ and ‘feedback, assessment & marking’ (FAM) should be separate categories, and some further analysis has convinced me that I need to drop them both.

One of the many useful features of Orange is the ‘concordance’ app, shown on the left in my workflow.  It allows for a sub-set of documents to be extracted based on a key word.  I chose to have a closer look at ‘marking’.  As you can see from the screenshot below, the app will show you your chosen word as it appears with a selected number of words either side.  The default is 5, which I stuck with.

stuff9

The white and blue bands represent individual documents, which can then be selected and viewed using the ‘corpus viewer’ app.  I browsed through several, deciding that they should best be classed as ‘professional concern’, ‘positioning’, ‘soapboxing’ or ‘reflective practice’.  I selected ‘assessment’ and ‘feedback’ as alternatives to ‘marking’, but a closer look at a few of them suggested the same.  I went back to the posts I’d originally classified as ‘FAM’ and reviewed them, and decided I could easily re-categorise them too.

Here’s an example of a post containing the key word ‘marking’:

Lesson 3 (previous post) had seen my Head of Department sit in with a year 7 group to look at ideas he could apply. His key observation- the need for a grounding in the terminology and symbols see first lesson which has been shared as a flipchart with the department. We move on apace to lesson 4 where pupils start to be involved in setting their own marking criteria linked to SOLO. Still no hexagons, a key aspect in the sequence of lessons now being blogged about by Paul Berry (see previous post). Linking activities between lessons has become very overt in this sequence of lessons. Our starter was a return to annotate the pics from last lesson. Most recall was at a uni-structural stage and some discussion ensued see Year 7 example below. The focus today was to be on marking information onto maps accurately. We have decided as a department to return to more traditional mapping skills as many of our pupils have a lack of sense of place. So we returned to the textbook (Foundations) and a copy of the main map was shared with the class. It limits the amount of information, and hopefully this will develop a stronger use of maps in future work. Before starting though we needed to determine a SOLO based marking criteria which allowed peer marking. The pupils in year 7 in particular had clear ideas already about this. We identified how they would mark and initial and day the marking as Sirdoes so it was clear who the peer marker was. The map task was time limited. I use a variety of flash based timers which I found online- the novelty value of how the timer will end can be a distraction at the end of a task but does promote pupil interest. I circulated the room giving prompts on how seas could include other terms e.g. Channel and ocean. The work rate was very encouraging. The peer marking was successful and invoked quite a lot of table based discussions. We started to identify the idea of feed forward feedback to allow improvement of future pieces of work. Lesson 4 with years 8 and 9 included a return to the SOLO symbols image sheet and sharing recall. Also a key facts based table quiz was used to promote teamwork and remind how we already know a range of facts. These quizzes provided a good opportunity to use the interactive nature of the board to match answers to locations. Writing to compare features in different locations became the focus for Years 8 and 9. We recapped the use of directions in Relational answers. Headings were provided and I circulated to support and/ or prompt as required. Now I need to identify opportunities to use HOT maps as recommended by others including Lucie Golton, John Sayers et al. from Twitters growing #SOLO community. Also the mighty hexagons and linking facts need to enter the arena. Please if commenting, which image size works better as lesson 3 or lesson 4?

This is clearly ‘reflective practice’, as the practitioner is clearly commenting on the successes of using the SOLO taxonomy model  with a variety of year groups.

If I have time, it may well be more appropriate to interrogate a particular category to visualise what sub-categories may emerge e.g.  I would expect ‘professional concern’ to encompass workload, marking, growth mindset, flipped  learning etc. , areas of concern that are ‘product’ as opposed to ‘process’.

Clustering Blog Posts: Part 2 (Word Frequency)

One of the most important things to do when working with a lot of data is to reduce the dimensionality of that data as far as possible.  When the data you are working with is text, this is done by reducing the number of words used in the corpus without compromising the meaning of the text.

One of the most fascinating things about language was discovered by G K Zipf in 1935¹: that the most frequently used words in (the English) language are actually few in number, and obey a ‘power law’.   The most frequently used word occurs twice as often as the next most frequently word, three times as often as the third, and so on.  Zipf’s law forms a curve like this:
Zipf-CurveThe distribution seems to apply to languages other than English, and it’s been tested many times, including using the text of, for example, novels.  It seems we humans are very happy to come up with a rich and varied lexography, but then rely on just a few to communicate with each other.  This makes perfect sense as far as I can see: I say I live on a boat, gets the essentials across (a thing that floats, a bit of an alternative lifestyle, how cool am I? etc.) because were I to say I live on a lifeboat, I then have to explain that it’s like one of the fully-enclosed ones you see hanging from the side of cruise ships, not the open Titanic-style ones most people would imagine.

“For language in particular, any such account of the Zipf’s law provides a psychological theory about what must be occurring in the minds of language users. Is there a multiplicative stochastic process at play? Communicative optimization? Preferential reuse of certain forms?” (Piantadosi, 2014)

A recent paper by Piantadosi² reviewed some of the research on word frequency distributions, and concluded that, although Zipf’s law holds broadly true, there are other models that provide a more reliable picture of word frequency which depend on the corpus selected.  Referring to a paper by another researcher, he writes “Baayen finds, with a quantitative model comparison, that which model is best depends on which corpus is examined. For instance, the log-normal model is best for the text The Hound of the Baskervilles, but the Yule–Simon model is best for Alice in Wonderland.”

I’m not a mathematician, but that broadly translates as ‘there are different ways of calculating word frequency, you pays your money you takes your choice”.  Piantadosi then goes on to explain the problem with Zipf’s law: it doesn’t take account of the fact that some words may occur more frequently that others purely by chance, giving the illusion of an underlying structure where none may exist.  He then goes on to suggests a way to overcome this problem, which is to use two independent corpora, or split a corpora in half and then test word frequency distribution in each. He then tests a range of models, and concludes that the “…distribution in language is only near-Zipfian.” and concludes “Therefore, comparisons between simple models will inevitably be between alternatives that are both “wrong.” “.

Semantics also has a strong influence on word frequency.  Piantadosi cites a study³ that compared 17 languages across six language families and concluded that simple words are used with greater frequency in all of them, and result in a near-Zipfian model.  More importantly for my project, he notes that other studies indicate that word frequencies are domain-dependent.   Piantadosi’s paper is long and presents a very thorough review of research relating to Zipf’s law, but the main point is that it does exist, even though why it should be so is still unclear.  The next question is should the most frequently used words from a particular domain also be removed?

As I mentioned before, research has already established that it’s worth removing (at least as far as English is concerned) a selection of words.  Once that’s done, which are the most frequently used words in my data?  I used Orange is to split my data in half and generate three word clouds based on the same parameters, and observe the result.  Of course I’m not measuring the distribution of words, I’m just doing a basic word count and then displaying the results, but it’s a start.  First, here’s my workflow:

WFD1

I’ve shuffled (randomised) my corpus, taken a training sample of 25%, and then split this again into two equal samples.  Each of these has been pre-processed using the following parameters:

WFD2

Pre-processing parameters. I used the lemmatiser this time.

The stop word set is the extended set of ‘standard’ stop words used by Scikit that I referred to in my previous post, plus a few extra things to try and get rid of some of the rubbish that appears.

The word clouds for the full set, and each separate sample, look like this:

WC1

Complete data set (25% sample, 2316 rows)

WC2

50% of sample (1188 rows)

WC3

Remaining 50% of sample

The graph below plots the frequency with which the top 500 words occur.

WFDGraph

So, I can conclude that based on word counts, each of my samples is similar to each other, and to the total (sampled) corpus.  This is good.

So, should I remove the most frequently used words, and if so, how many?  Taking the most frequently used words across each set, and calculating the average for each word, gives me a list as follows:

table1

And if I take them out, the word cloud (based on the entire 25% set) looks like this:

WCouldLemSWset3

Which leads me to think I should take ‘learning’ and ‘teaching’ out as well.  It’s also interesting that the word ‘pupil’ has cropped up here – I wonder how many teachers still talk about pupils rather than students?  Of course, this data set contains blogs that may be a few years old, and/or be written by bloggers who prefer the term.  Who knows?  In fact, Orange can tell me.  The ‘concordance’ widget, when connected to the bag of words, tells me that ‘pupil’ is used in 64 rows (blogs) and will show me a snippet of the sentence.

concordance1

It’s actually used a total of 121 times, and looking at the context I’m not convinced it adds value in terms of helping me with my ultimate goal, which is clustering blog posts by topic, so it’s probably worth mentioning here that the words used the least often are going to be the most numerically relevant when it comes to grouping blogs by topic.

WCouldLemSWset4

Could I take out some more?  This is a big question.  I don’t want to remove so many words that the data becomes difficult to cluster.  Think of this as searching the blog posts using key words, much as you would when you search Google.  Where, as a teacher, you might want to search ‘curriculum’, you might be more interesting in results that discuss ‘teaching (the) curriculum’  rather than those that cover ‘designing (the) curriculum’.  If ‘teaching’ has already been removed, how will you find what you’re looking for?  Alternatively, does it matter so long as search returns everything that contains the word ‘curriculum’?  You may be more interested in searching for ‘curriculum’ differentiating by key stage.  For my purposes, I think I’d be happy with a cluster labelled ‘curriculum’ that covered all aspects of the topic.  I’ll be able to judge when I see some actual clusters emerge, and have the chance to examine them more closely.  Which, incidentally, the concordance widget tells me is used in 93 blogs, and appears 147 times.  That’s more than ‘pupil’, but because of my specialised domain knowledge I judge to be more important to the corpus.

Which is also a good example of researcher bias.

  1. Zipf, G. K.; The Psychology of Language; 1966; The M.I.T. Press.
  2. Piantadosi, S.; Zipf’s word frequency law in natural language: A critical review and future Directions; Journal of National Institutes of Health; 2014; volume 21, October issue, pages 1112-1130.
  3. Calude, A., Pagel, M.;  How do we use language? Shared patterns in the frequency of word use across 17 world languages; 2011; Journal of Philosophical Transactions of the Royal Society of London. Series B, Biological sciences; volume 366, issue 1567, pages 1101-7.

Clustering Blog Posts: Part 1

Today, while my laptop was employed scraping Edu-blog posts from 2011, I decided to play around with Orange.  This is one of the suite of tools offered by Anaconda which I use for all my programming needs.

The lovely thing about Orange is that it allows you to build a visual workflow,  while all the actual work in the form of the lines and lines of code is done behind the scenes.  You have to select some parameters, which can be tricky, but all the heavy lifting is done for you.

This was my workflow today, although this was about halfway through so by the time I’d finished, there were a few more things there.  Still, it’s enough to show you how it works.

workflow

My corpus is a sample of 9,262 blog posts gathered last year.  Originally, there were over 11,000 posts but they’ve been whittled down by virtue of having no content, having content that had been broken up across several rows in the spreadsheet, or being duplicates.  I also deleted a few that simply weren’t appropriate, usually because they were written by educational consultants, as means to sell something other tangible such as books or software, or political in some way such as blogs written for one of the teaching unions.  What I’ve tried to do is identify blog URLs that contain posts by individuals, preferably but not exclusively teachers, with a professional interest in education and writing from an individual point of view.  This hasn’t been easy, and I’m certain that when I have the full set of data (which will contain many tens of thousands of blog posts) some less than ideal ones will have crept in, but that’s one of the many drawbacks of dealing with BIG DATA:  it’s simply too big to audit.

You may recall that the point of all this is to classify as much as the Edu-blogosphere as I possibly can –  to see what Edu-professionals talk about, and to see if the topics they discuss change over time.  Is there any correlation between, for example, Michael Gove being appointed Secretary of State for Education and posts discussing his influence?  We’ll see.  First of all, I have to try and cluster the posts into groups according to content.  I’ve been doing this already, and developed a methodology.  However, while I’m still gathering data, and labelling a set of ‘training data’ (of which more in a future blog post) I’ve been experimenting with a different set of tools.

So, here’s my first step using Orange.  Open the corpus, shuffle (randomise) the rows, and take a sample of 25% for further analysis, which equates to 2316 documents or rows.

corpus

Open the document, select the data you need, e.g. ‘Content’.  The other features can still be accessed.

 

 

Step1

The ‘corpus viewer’ icons along the top of the workflow shown above mean that I can see the corpus at each stage of the process.  This is a glimpse of the second one, after shuffling.  Double-clicking any of the icons brings up a view of the output, as well as the various options available for selection.

viewer2

The next few steps are will give me an insight into the data, and already there are a series of decisions to make.  I’m only interested in  pre-processing the content of each blog post.  The text has to be broken up into separate words, or ‘tokens’, punctuation removed, plus any URLs that are embedded in the text.  In addition, other characters such as /?&* or whatever are also stripped out.  So far, so straightforward as this screen grab shows:

 

preprocess1

Words can also be stemmed or lemmatised (see above).

Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes.  Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. ”  The Stanford Natural Language Processing Group

If I lammatise the text as is, I get this lovely word cloud. BTW, it’s just a coincidence that ‘example’ is in red.WCloudLemmatiser

As you would expect from blogs written by teachers, words like ‘student’ and ‘teacher’ feature heavily.  If I use the snowball stemmer (which is basically an efficient algorithm for stemming, explained here) then the word cloud looks like this:

WCloudBespokeStopWords

Both of these word clouds are also generated from the corpus after a series of words know as ‘stop words’ are removed.  These words are the ones we often use most frequently, but add little value to a text; words such as ‘the’, ‘is’, ‘at’ or ‘which’.  There is no agreed standard, although most algorithms use the list provided by the Natural Language Toolkit (NLTK).  I’ve chosen to use the list provided by Scikit-Learn, a handy module providing lots of useful algorithms.  Their list is slightly longer.  The use of stop words is well researched and recommended to reduce the number of unique tokens (words) in the data, otherwise referred to by computer scientists as dimensionality reduction.   I also added some other nonesense that I noticed when I was preprocessing this data earlier in the year- phrases like ‘twitterfacebooklike’ – so in the end I created my own list combining the ‘standard’ words and the *rap, and copied them into a text file.  This is referred to as ‘NY17stopwordsOrange.txt’ in the screenshot below.

stopwords

My next big question, though, is what happens if I add to the list some of the words that are most frequently used in my data set – words like ‘teach’, ‘student’, ‘pupil’, ‘year’?  So I added these words to the list: student, school, teacher, work, year, use, pupil, time, teach, learn, use.  This is the result, using the stemmer:

WCloudStopWordsSet2

There is research to suggest that creating a bespoke list of stop words that is domain-specific is worth doing as a step before going on to try and classify a set of documents.  It’s the least-used words in the corpus that are arguably the most interesting and valuable.  I’ll explore this some more in the next post, along with the following steps in the workflow.