Today, while my laptop was employed scraping Edu-blog posts from 2011, I decided to play around with Orange. This is one of the suite of tools offered by Anaconda which I use for all my programming needs.
The lovely thing about Orange is that it allows you to build a visual workflow, while all the actual work in the form of the lines and lines of code is done behind the scenes. You have to select some parameters, which can be tricky, but all the heavy lifting is done for you.
This was my workflow today, although this was about halfway through so by the time I’d finished, there were a few more things there. Still, it’s enough to show you how it works.
My corpus is a sample of 9,262 blog posts gathered last year. Originally, there were over 11,000 posts but they’ve been whittled down by virtue of having no content, having content that had been broken up across several rows in the spreadsheet, or being duplicates. I also deleted a few that simply weren’t appropriate, usually because they were written by educational consultants, as means to sell something other tangible such as books or software, or political in some way such as blogs written for one of the teaching unions. What I’ve tried to do is identify blog URLs that contain posts by individuals, preferably but not exclusively teachers, with a professional interest in education and writing from an individual point of view. This hasn’t been easy, and I’m certain that when I have the full set of data (which will contain many tens of thousands of blog posts) some less than ideal ones will have crept in, but that’s one of the many drawbacks of dealing with BIG DATA: it’s simply too big to audit.
You may recall that the point of all this is to classify as much as the Edu-blogosphere as I possibly can – to see what Edu-professionals talk about, and to see if the topics they discuss change over time. Is there any correlation between, for example, Michael Gove being appointed Secretary of State for Education and posts discussing his influence? We’ll see. First of all, I have to try and cluster the posts into groups according to content. I’ve been doing this already, and developed a methodology. However, while I’m still gathering data, and labelling a set of ‘training data’ (of which more in a future blog post) I’ve been experimenting with a different set of tools.
So, here’s my first step using Orange. Open the corpus, shuffle (randomise) the rows, and take a sample of 25% for further analysis, which equates to 2316 documents or rows.
The ‘corpus viewer’ icons along the top of the workflow shown above mean that I can see the corpus at each stage of the process. This is a glimpse of the second one, after shuffling. Double-clicking any of the icons brings up a view of the output, as well as the various options available for selection.
The next few steps are will give me an insight into the data, and already there are a series of decisions to make. I’m only interested in pre-processing the content of each blog post. The text has to be broken up into separate words, or ‘tokens’, punctuation removed, plus any URLs that are embedded in the text. In addition, other characters such as /?&* or whatever are also stripped out. So far, so straightforward as this screen grab shows:
Words can also be stemmed or lemmatised (see above).
“Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. ” The Stanford Natural Language Processing Group
If I lammatise the text as is, I get this lovely word cloud. BTW, it’s just a coincidence that ‘example’ is in red.
As you would expect from blogs written by teachers, words like ‘student’ and ‘teacher’ feature heavily. If I use the snowball stemmer (which is basically an efficient algorithm for stemming, explained here) then the word cloud looks like this:
Both of these word clouds are also generated from the corpus after a series of words know as ‘stop words’ are removed. These words are the ones we often use most frequently, but add little value to a text; words such as ‘the’, ‘is’, ‘at’ or ‘which’. There is no agreed standard, although most algorithms use the list provided by the Natural Language Toolkit (NLTK). I’ve chosen to use the list provided by Scikit-Learn, a handy module providing lots of useful algorithms. Their list is slightly longer. The use of stop words is well researched and recommended to reduce the number of unique tokens (words) in the data, otherwise referred to by computer scientists as dimensionality reduction. I also added some other nonesense that I noticed when I was preprocessing this data earlier in the year- phrases like ‘twitterfacebooklike’ – so in the end I created my own list combining the ‘standard’ words and the *rap, and copied them into a text file. This is referred to as ‘NY17stopwordsOrange.txt’ in the screenshot below.
My next big question, though, is what happens if I add to the list some of the words that are most frequently used in my data set – words like ‘teach’, ‘student’, ‘pupil’, ‘year’? So I added these words to the list: student, school, teacher, work, year, use, pupil, time, teach, learn, use. This is the result, using the stemmer:
There is research to suggest that creating a bespoke list of stop words that is domain-specific is worth doing as a step before going on to try and classify a set of documents. It’s the least-used words in the corpus that are arguably the most interesting and valuable. I’ll explore this some more in the next post, along with the following steps in the workflow.