Coding Resources, or: Things I Wish I’d Known When I Started

Resources2

As some of you know, I’m in the final year of my PhD in Web Science.  For whatever reason, I decided I’d learn whole load of new stuff from the ground up.  In my 50s.  With zero knowledge to start with except some very basic maths.  I needed to learn to write code, and although my MSc year included a module on writing code in Python, it did nothing more than get me familiar with what code actually looks like on the page.

I cried every Sunday night, prior to the workshop on Monday, because I just couldn’t see how to make things work.

Today, over two years on, I get it.  I can write it (although I still have to refer to a book or previous code I’ve written as a reminder) and my ability to think logically and has improved considerably.  During that time, I’ve amassed a range of books and URLs that have been, and still are, incredibly useful.  It’s time to share and provide myself with a post of curated resources at the same time.

First of all, you absolutely need a pencil (preferably with a rubber on the end), some coloured pens if you’re a bit creative, and plenty of A3 paper.  Initially, this is just for taking notes but I found then incredibly useful further along when I wanted to write the task that I needed my code to carry out, step by step.

Post-it notes – as many colours and sizes as you fancy.  Great for scribbling notes as you go, acting as bookmarks, and if you combine them with the coloured pens and A3 paper, you can make a flow chart.

Code Academy is a good place to start.  It takes you through the basics step by step, and helps you to both see what code looks like on screen, and how it should be written.  There are words that act as commands e.g. print, while, for etc.  that appear in different colours so you can see you’ve written something that’s going to do something, and you can see straight away that indents are important as they signal the order in which tasks are carried out (indents act like brackets in maths).

Just about every book that covers writing code includes a basic tutorial, but one that I bought and still keep referring back to is Automate The Boring Stuff With Python.  By the time you get here, you’ll be wanting to start writing your own code.  For that, I recommend you install Anaconda which will give you a suite of excellent tools.  Oh, and I use Python 3.6.
Resources1Once you’ve opened Anaconda, Spyder is the basic code editor.  I also use the Jupyter Notebook a lot.  I like it because it’s much easier to try out code bit by bot, so for example when I’m cleaning up some text data  and want to remove white space, or ‘new line’ commands, I can clear things one step at a time and see the results at the end of each one.  You can do the same using Spyder, but it isn’t as easy.

I’m going to list some books next, but before I do I should mention Futurelearn.  I have done several of the coding courses – current ones include ‘Data MiningWith WEKA’, ‘Advanced Data Mining With WEKA’ and ‘Learning To Code For Data Analysis’.  While these may not cover exactly what you have in mind to do (more on that in a minute), they will all familiarise you with gathering data, doing things with the data by writing code, and visualising the results.  They also help to get you thinking about the whole process.

I had a series of tasks I needed code to do for me.  In fact, I think the easiest way to learn how to write code is to have something in mind that you want it to do.  I needed to be able to gather text from blog posts and store it in a way that would make it easily accessible.  In fact, I needed to store the content of a blog post, the title of the post and the date it was published.  I later added the URL, as I discovered that for various reasons sometimes the title or the date (or both) were missing and that information is usually in the URL.  I then identified various other things I needed to do with the data, which led to identifying more things I needed to do with the data….. and so on.  This is where I find books so useful, so here’s a list:

  • Mining The Social Web, 2nd Edition.  The code examples given in this book are a little dated, and in fact rather than write the code line-by-line to do some things, you’d be better off employing what I’ll call for the sake of simplicity an app to do it for you.  It was the book that got me started, though, and I found the simple explanations for some of the things I needed to achieve very useful.
  • Data Science From Scratch.  I probably should have bought this book earlier, but it’s been invaluable for general information.
  • Python For Data Analysis, 2nd Edition.  Again, good for general stuff, especially how to use Pandas.  Imagine all the things you can do with an Excel spreadsheet, but once your sheet gets large, it becomes very difficult to navigate, and calculations can take forever.  Pandas can handle spreadsheet-style stuff with consummate ease and will only display what you want to see.  I love it.
  • Programming Collective Intelligence.  This book answered pretty much all the other questions I had, but also added a load more.  It takes you through all sorts of interesting algorithms and introduces things like building classifiers, but the main problem for me is that the examples draw on data that has already been supplied for you.  That’s great, but like so many other examples in all sorts of other books (and on the web, see below) that’s all fine until you want to use your own data.
  • This book began to answer the questions about how to gather your own data, and how to apply the models from the books cited above: Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from your Data.  This book has real-world examples which were relatively easy for me to adapt, as well as straightforward explanations as to how the code works.

Finally, some useful web sites.  The first represented a real break-through for me.  Not only did it present a real-world project from the ground up, but the man behind it, Brandon Rose (who also contributed to the last book in my list) is on Twitter and he answered a couple of questions from me when I couldn’t get his code to work with my data.  In fact, he re-wrote bots of my code for me, with explanations, which was incredibly helpful and got me started.  http://brandonrose.org/ is amazing.

This is the one and only video tutorial I’ve found useful.  Very useful, actually.  I find video tutorials impossible to learn anything from on the whole – you can’t beat a book for being able to go back, re-read, bookmark, write notes etc. – but this one was just what I needed to help me write my code to scrape blog posts, which are just web pages https://www.youtube.com/watch?v=BCJ4afDX4L4&t=34s.

https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/ and other blog posts.

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/ does what it says, and more.

http://www.ritchieng.com/machine-learning-multinomial-naive-bayes-vectorization/ useful walk-through.

http://www.ultravioletanalytics.com/2016/11/18/tf-idf-basics-with-pandas-scikit-learn/ 

The URLs listed above are quite specific to the project I’ve been working on.  I’d also like to add Scikit-Learn which provided all the apps I’ve been using.  The explanations and documentation that is included on the site was less than helpful as it assumed a level of knowledge that was, and to a certain extent still is way above my head.  However, what it gave me was the language to use when I was searching for how to write a piece of code.  Stack Overflow is the best resource there is for this, and most of my bookmarks are links to various questions and  responses.  However, it did take me a while to a) learn what form of words would elicit an answer to my problem, and b) to understand the answers.  I even tried asking a question myself.  Never again.  Unless you’re a fully-fledged computer science geek (and if you were, you wouldn’t be here) it’s hostile territory.

Finally, an excellent site that has been useful again and again: DataVizTools.

Going back to Anaconda for a minute, when you’re feeling a bit more confident, have a look at the Orange application.  I’ve blogged about it several times, and blog on the site is an excellent source of information and example projects.  The help pages are excellent for all the basic apps, although some of the  newer ones don’t have anything yet.

And to finish, a site that I found, courtesy of Facebook, this very morning.  This site lets you see how your code works with a visualiser, something I found myself doing with pencil and paper when my code wasn’t doing what it should and I didn’t know why.

Advertisements

Developing Categories, Part 4

I thought I’d have a quick look at the difference using a lemmatiser instead of a snowball stemmer makes to clustering using k-means and just my group of labelled blogs.  Here’s the silhouette plot based on groups:

SilPlotLemm

Remember, the closer the score is to 0, the more statistically likely it is that the blog could be in a different category.

Here’s the same data, this time with the number of categories set to 6, but grouped according to the category the algorithm has calculated as being the most appropriate.

SilPlotLemmV2

There appear to be blogs that, at least according the k-means, are in a category with a variety of different labels.  The algorithm isn’t learning anything, though, it’s just making decisions based on the scores of tokens in the blog, nothing else.  I simply wanted to see if lemmatising the blogs instead of stemming made much of a difference.

Here’s the same parameters as above, but using the snowball stemmer as before:

SilPlotSnow

And side-by-side (Snowballing / Lemmatising):

SilPlotSnow                                       SilPlotLemmV2

 

The answer is: overall, not that I can see.

Developing Categories, Part 3

stuff8I’ve already said that I wasn’t sure if ‘behaviour’ and ‘feedback, assessment & marking’ (FAM) should be separate categories, and some further analysis has convinced me that I need to drop them both.

One of the many useful features of Orange is the ‘concordance’ app, shown on the left in my workflow.  It allows for a sub-set of documents to be extracted based on a key word.  I chose to have a closer look at ‘marking’.  As you can see from the screenshot below, the app will show you your chosen word as it appears with a selected number of words either side.  The default is 5, which I stuck with.

stuff9

The white and blue bands represent individual documents, which can then be selected and viewed using the ‘corpus viewer’ app.  I browsed through several, deciding that they should best be classed as ‘professional concern’, ‘positioning’, ‘soapboxing’ or ‘reflective practice’.  I selected ‘assessment’ and ‘feedback’ as alternatives to ‘marking’, but a closer look at a few of them suggested the same.  I went back to the posts I’d originally classified as ‘FAM’ and reviewed them, and decided I could easily re-categorise them too.

Here’s an example of a post containing the key word ‘marking’:

Lesson 3 (previous post) had seen my Head of Department sit in with a year 7 group to look at ideas he could apply. His key observation- the need for a grounding in the terminology and symbols see first lesson which has been shared as a flipchart with the department. We move on apace to lesson 4 where pupils start to be involved in setting their own marking criteria linked to SOLO. Still no hexagons, a key aspect in the sequence of lessons now being blogged about by Paul Berry (see previous post). Linking activities between lessons has become very overt in this sequence of lessons. Our starter was a return to annotate the pics from last lesson. Most recall was at a uni-structural stage and some discussion ensued see Year 7 example below. The focus today was to be on marking information onto maps accurately. We have decided as a department to return to more traditional mapping skills as many of our pupils have a lack of sense of place. So we returned to the textbook (Foundations) and a copy of the main map was shared with the class. It limits the amount of information, and hopefully this will develop a stronger use of maps in future work. Before starting though we needed to determine a SOLO based marking criteria which allowed peer marking. The pupils in year 7 in particular had clear ideas already about this. We identified how they would mark and initial and day the marking as Sirdoes so it was clear who the peer marker was. The map task was time limited. I use a variety of flash based timers which I found online- the novelty value of how the timer will end can be a distraction at the end of a task but does promote pupil interest. I circulated the room giving prompts on how seas could include other terms e.g. Channel and ocean. The work rate was very encouraging. The peer marking was successful and invoked quite a lot of table based discussions. We started to identify the idea of feed forward feedback to allow improvement of future pieces of work. Lesson 4 with years 8 and 9 included a return to the SOLO symbols image sheet and sharing recall. Also a key facts based table quiz was used to promote teamwork and remind how we already know a range of facts. These quizzes provided a good opportunity to use the interactive nature of the board to match answers to locations. Writing to compare features in different locations became the focus for Years 8 and 9. We recapped the use of directions in Relational answers. Headings were provided and I circulated to support and/ or prompt as required. Now I need to identify opportunities to use HOT maps as recommended by others including Lucie Golton, John Sayers et al. from Twitters growing #SOLO community. Also the mighty hexagons and linking facts need to enter the arena. Please if commenting, which image size works better as lesson 3 or lesson 4?

This is clearly ‘reflective practice’, as the practitioner is clearly commenting on the successes of using the SOLO taxonomy model  with a variety of year groups.

If I have time, it may well be more appropriate to interrogate a particular category to visualise what sub-categories may emerge e.g.  I would expect ‘professional concern’ to encompass workload, marking, growth mindset, flipped  learning etc. , areas of concern that are ‘product’ as opposed to ‘process’.

Developing Categories, Part 2

So, while I deploy my bespoke python code to scrape the contents of umpteen WordPress and Blogger blogs, I’ve continued trying to classify blogs from my sample according the the categories I outlined in my previous post.

I say ‘trying’ because it’s not as straightforward as it seems.  Some blogs clearly don’t fit into any of the categories, e.g. where a blogger has simply written about their holiday, or for one blogger written a series of posts explaining various aspects of science or physics.  I reckon that this a science teacher writing for the benefit of his or her students, but as the posts are sharing ‘knowledge’ rather than ‘resources’, I can’t classify them.  Fortunately the label propagation algorithm I will eventually be using will allow for new categories to be instigated (or the ‘boundaries’ for existing categories to be softened) so it shouldn’t be a problem.

‘Soapboxing’, ‘professional concern’ and ‘positioning’ have also caused me to think carefully about my definitions.  ‘Soapboxing’ I’m counting as all posts that express an opinion in a strident, one-sided way,  with a strong feeling that the writer is venting frustration, and perhaps with a call to action.  These tend to be short posts, probably written because the blogger simply needs to get something off their chest and (possibly, presumably) get some support from others via the comments.  ‘Professional concern’, then, is a also post expressing a view or concern, but the language will be more measured.  Perhaps evidence from research or other bloggers will be cited, and the post will generally be longer.  The blogger may identify themselves as a teacher of some experience, or perhaps a head of department or other school leader.  As with ‘soapboxing’, a point of view will be expressed, but the call to action will be absent.

‘Positioning’ is a blog post that expresses a belief or method that the blogger holds to be valid above others, and expresses this as a series of statements.  Evidence to support the statements will be present, generally in the form of books or published research by educational theorists or other leading experts in the field of education.

Of course, having made some decisions regarding which blogs fit into these categories, I need to go back through some specific examples and try to identify some specific words or phrases that exemplify my decision.  And I fully expect other people to disagree with me, and be able to articulate excellent reasons why blog A is an example of ‘positioning’ rather than ‘professional concern’, but all I can say in response is that, while it’s possible to get a group of humans to agree around 75% of the time, it’s impossible to get them to agree 100%, and that’s but the joy and the curse of this kind of research.

Given more time, I’d choose some edu-people from Twitter and ask them to categorise a sample of blogs to verify (or otherwise) my decision, but as I don’t have that luxury the best I can do is make my definitions as clear as possible, and provide a range of examples as justification.

The other categories that aren’t proving straightforward are ‘feeedback, assessment and marking’ (‘FAM’) and ‘behaviour’.  I knew this might be the case, though, so I’m keeping an open mind about these.  I have seen examples of blogs discussing ‘behaviour’ that I’ve put into one of the three categories I’ve mentioned above, but that’s because the blogs don’t discuss ‘behaviour’ exclusively.

Anyway, I’ve categorised 284 (out of a total of 7,788) posts so far so I thought I’d have a bit of a look at the data.

stuff1

I used Orange again to get a bit more insight into my data.  Just looking at the top flow, after opening the corpus I selected the rows that had something entered in the ‘group’ column I created.

stuff2

Selecting rows.

stuff3

Creating classes.

 

 

 

 

 

 

 

 

I then created a class for each group name.  This additional information can be saved, and I’ve dragged the ‘save data’ icon onto the workspace, but I’ve chosen not to save it automatically for now.  If you do, and you give it a file name, every time you open Orange the file will be overwritten, which you may not want.  Then, I pre-processed the 284 blogs using the snowball stemmer, and decided I’d have a look at how just the sample might be clustered using k-means.

“Since it effectively provides a ‘suffix STRIPPER GRAMmar’, I had toyed with the idea of calling it ‘strippergram’, but good sense has prevailed, and so it is ‘Snowball’ named as a tribute to SNOBOL, the excellent string handling language of Messrs Farber, Griswold, Poage and Polonsky from the 1960s.”

Martin Porter

I’m not sure if I’ve explained k-means before, but here’s a nice link that explains it well.

“Clustering is a technique for finding similarity groups in a data, called clusters. It attempts to group individuals in a population together by similarity, but not driven by a specific purpose.”

The data points are generated from the words in the blogs.  These have been reduced to tokens by the stemmer, then a count is made of the number of times each word is used in a post.  The count is subsequently adjusted to take account of the length of the document so that a word used three times in a document of 50 words is not given undue weight compared with the same word used three times in a document of 500.  So, each document generates a score for each word used, with zero for a word not used that appears in another document or documents.  Mathematical things happen and the algorithm coverts each document into a data point in a graph like the ones in the link.  K-means then clusters the documents according to how similar they are.

I already know I have 8 classes, so that’s the number of clusters I’m looking for.  If I deploy the algorithm, I can see the result on a silhouette plot (the matching icon, top far right of the flow diagram above).  The closer to a score of ‘0’, the more likely it is that a blog post is on the border between two clusters.  When I select that the silhouette plot groups each post by cluster, it’s clear that ‘resources’ has a few blogs that are borderline.

Stuff4

stuff5

 

 

 

 

 

 

‘FAM’ and ‘behaviour’ are more clearly demarcated.  If I let the algorithm choose the optimal number of clusters (Orange allows between 2 and 30), the result is 6, although 8 has a score of 0.708 which is reasonable (as you can see, the closer to 1 the score is, the higher the probability that the number suggested is the ‘best fit’ for the total number  of clusters within the data set).

stuff6 As you can see from the screenshot below, cluster 4 is made up of posts from nearly all the groups.  Remember, though, that this algorithm is taking absolutely no notice of my categories, or the actual words as words that convey meaning.  It’s just doing what it does  based on numbers, and providing me with a bit of an insight into my data.

stuff7

Developing Categories

AF

An initial estimate of the possible number of categories in the 25% sample my nine-thousand-odd list of blog posts, provided by the Affinity Propagation (AP) algorithm, suggested over 100 categories.   Based on the words used in the posts it chose to put into a cluster, this was actually reasonable although way more than I can process.  It was also obvious that some of the categories could have been combined:  maths- and science-based topics often appeared together, for example.

A different method provided by an algorithm in Orange (k-means, allowing the algorithm to find a ‘fit’ of between 2 and 30 clusters) suggested three or four clusters.  How is it possible for algorithms, using the same data, to come up with such widely differing suggestions for clusters?  Well, it’s maths.  No doubt a mathematician could explain to me (and to you) in detail how the results were obtained, but for me all the explanation I need is that when you start converting words to numbers and use the results to decide which sets of numbers have greater similarity, you get a result that, while useful, completely disregards the nuances of language.

An initial attempt by me to review the content of the categories suggested by AP, but I had to give up after a few hours’ work.  I identified a good number of potential categories, including the ones suggested by the literature (see below), but I soon realised that it was going to be difficult to attribute some posts to a specific category.  A well-labelled training set is really important, even if it’s a small training set.  So, back to the research that has already been published, describing he reasons why teachers and other edu-professionals blog, and a chat with my supervisor, who made the observation that I needed to think about ‘process’ as opposed to ‘product’.

Bit of a lightbulb moment, then.  I’m not trying to develop a searchable database of every topic covered – I’m trying to provide a summary of the most important aspects of teaching discussed in blogs over a period of time.   The categories arising from the literature are clearly grounded in process, and so these are the ones I’ll use.  If you click on this link, you’ll be able to see the full version of the Analytical Framework, a snippet of which is pictured above.

As well as the main categories (the ones in the blue boxes), I decided to add two more: ‘behaviour’ and ‘assessment /  feedback / marking’ simply because these, in my judgement, are important enough topics to warrant categories of their own.  However, I’m aware that they overlap with all the others, and so I may revise my decision in the light of results.  What I’ll have to do is provide clear definitions of each category, linked with the terms associated with the relevant posts.

What will be interesting is exploring each category.  The ‘concordance‘ widget in Orange allows for some of the key terms to be entered, and to see how  they’re used in posts.  This will add depth to the analysis, and may even lead to an additional category or two if it appears, for example, that ‘Ofsted’ dominated blogs within the ‘professional concern’ category for a considerable period of time, an additional category would be justified.  My intention is to divide my data into sets by year (starting at 2004), although it may be prudent to sub-divide later years as the total number of blog posts increases year on year.

Clustering Blog Posts: Part 3

No interesting visuals this time.  I’ve been spending my Saturday going back and hand-labelling what will become a training set of blog posts.

I should have done this before now, but I’ve been putting it off, mainly because it’s so tedious.  I have my sample of 2,316 blogs grouped into 136 clusters, and I’m going through them, entering the appropriate labels in the spreadsheet.  Some background reading has made it clear that a set of well-labelled data, even a small set, is extremely beneficial to a clustering algorithm.  The algorithm can choose to add new documents to the already established set, start a new set, or modify the parameters of the labelled set slightly to include the new document.  Whatever it decides, it ‘learns’ from the examples given, and the programmer can test the training set on another sample to refine the set before launching it on the entire corpus.

There has been some research into the kind of things teachers blog about.  The literature suggests the following categories:

  1. sharing resources;
  2. building a sense of connection;
  3. soapboxing;
  4. giving and receiving support;
  5. expressing professional concern;
  6. positioning.

Some of these are clearly useful – there are plenty of resource-themed blogs, although I instinctively want to label resource-sharing blogs with a reference to the subject.  ‘Soapboxing’ and ‘expressing professional concern’ appear relatively straightforward.  ‘Positioning’ refers to the  blogger ‘positioning themselves in the community i.e. as an ex-
pert practitioner or possessor of extensive subject knowledge’.  That may be more problematic, although I haven’t come across a post that looked as if it might fit into that category yet.  The ones  that are left – ‘support’ and ‘connection’ are very difficult, grounded as they are in the writers’ sense of feeling and emotion.  I’m not sure they’re appropriate as categories.

The other category that emerges from current research is ‘reflective practice’.   I’ve already come across several blog posts discussing SOLO taxonomy which could be categorised as just that – SOLO taxonomy – or ‘reflective practice’ or ‘positioning’ or ‘professional concern’.   My experience as a teacher (and here’s researcher bias again) wants to (and already has) labelled these posts as SOLO, because it fits better with my research questions, in the same way that I’m going to label some posts ‘mindset’ or ‘knowledge organiser’.  What I may do – because it’s easy at this stage – is to create two labels where there is some overlap with the existing framework suggested by the literature, which may be useful later.

It’s also worth mentioning that I’m basing my groups on the content of the blog posts.  An algorithm counts the number of times all the words in the corpus are used in each post (so many will be zero) and then adjusts the number according  to the length of the document in which it appears.  Thus, each word becomes a ‘score’ and it’s these that are used to decide which documents are most similar to one another.   Sometimes, it’s clear why the clustering algorithm has made the decision is has, other times it’s not, and this is why I’m having to go through the laborious process of hand-labelling.  Often, the blog post title makes the subject of the content clear, but not always.

Teachers and other Edu-professionals, Gods-damn them, like to be creative and cryptic when it comes to titling their blogs, and they often draw on metaphors to explain the points they’re trying to make, all of which expose algorithms that reduce language to  numbers as the miserable , soulless and devoid-of-any-real-intelligence things they are.  How very dare they.