New Allotment

So, a few weeks ago I was finally allocated an allotment and I went to have a look.  As is normal with allotments (it seems), by the time the tenant finally gives them up, they’ve been left for at least a year, often two.  This one was no different.

From top to bottom…

I admit I thought about it for a couple of days, but in the end decided I probably wasn’t going to get anything any better any time soon, so accepted and went to the local council offices to sign the paperwork.

The sheds at the bottom were both falling down. The first task was making one good one out of the remains.  Oh, and clearing the brambles.

Let me tell you, some of those brambles were massive.  And have you ever tried buying a sickle or a scythe? Basically, it’s impossible to pick either up from a shop, even a large garden centre, but it turns out you can buy any number of sharp agricultural implements online.  Not that we needed to.  We found both in the smaller shed!

Under construction…

...a bit more...

Finished

Yesterday, I put a lock on the shed and took down all the tools and other useful things I’d brought from home.  This isn’t my first allotment.

Why do I need extra work? Well, aside from the fresh air, exercise, and the fact that I’m a vegetarian who loves all things that come from the soil  (except sprouts and turnips), as some of you know I live on a boat and use a waterless toilet.  This generates stuff that goes into a ‘hot bin’, an insulated compost bin that heats the contents to over 90°C.  This is essential to kill of any bacteria that may be lurking, although given my vegetable diet there probably aren’t many that are harmful. Certainly emptying solids into the bin isn’t an unpleasant chore.  Meat eaters produce horrid stuff.

Once the bin has been doing it’s thing at over 60°C for three months, it’s ready to be spread on a garden, and this is why I need an allotment.  I have a small patch of garden to use where my boat is moored, but it’s not enough.  Every three months I get a wheelbarrow-load of lovely compost, and it needs to go somewhere.  At least now I can take it to the allotment and will be able to use it there in the spring.

The next stage is to kill back all the grass and weeds so that I can dig over a strip for late autumn / spring planting.   While I try and be as environmentally friendly as I can, I’ll use chemicals for this.  There really isn’t anything more effective, and these days it’s relatively safe.  Plus, I’m not treating it all, just about a third.

The thing is, what looks like a huge undertaking is really only a series of small jobs.  And this is what I keep telling myself…

Advertisements

The Problem With Guessing K-Means

I’ve been grappling with the problem of how to find out what a group of professionals blog about. That seems simple enough on the face of it, but when there are over 9,000 blogs in a sample set of data, it’s not so easy. I can’t read every one, and even if I could, can you imagine how long it might take me to group them into topics?

Enter computer science in the form of algorithms.

I’ll gloss over the hours…. days…. weeks of researching how the various alternatives work, and why algorithm A is better than algorithm B for my type of data. Turns out k-means is the one I need.

Put very simply, each blog post (document) is made up of words. Each word is used x amount of times, both in the document and in the entire collection of documents (corpus). An adjustment must be made for the overall length of the document (a word used ten times in a document of 100 words doesn’t have the same significance as the same word used ten times in document of 1000 words), but once this has been done it’s possible to give each document an overall ‘score’, which is converted to a position (or vector) within the corpus.

It helps to think of the position as a ‘vector’ in a space with an infinite number of dimensions, even if you can’t visualise it, which I can’t. But, having done this, it’s then possible to k-means to randomly pick a number of starting vectors (the number being picked in advance) and it will proceed to find all of the documents closest to it until it finds the distance becomes too great or it begins to overlap with a neighbouring group, in which case it starts again somewhere else. The algorithm does this over and again until it completes the task successfully as it can (or it’s told to do it for a maximum number of tries, or iterations) and then it tells you how many documents it’s put in each cluster.

In theory, the algorithm should produce the same number of clusters every time you run it, although that doesn’t always happen as I found with my data. The other thing is, without grouping the set manually, there’s no way of telling what the actual number of k should be, which rather defeats the point of the algorithm…. except when you’re dealing with large data sets, you’ve got no choice.

Of course, you CAN just keep clustering, adding 1 to your chosen number for k until you think you’ve got results you’re happy with. I started doing that, beginning with 10 and working up to 15, by which time I was totally bored and considering the possibility that my actual optimum number of clusters might we over 100…. Every time I ran the algorithm, the number of posts in each cluster changed, although two were stable. That seemed to be telling me that I was a long way from finding the optimum number.

Enter another load of algorithms that can help you estimate the optimum number for k. They aren’t a magic bullet – they can only help with an estimation, and each one goes about the process in a different way. I chose the one I did because a) I found the code in a very recent book written by a data scientist, and b) he gave an example of how to write the code AND IT WORKED.

Guess how many clusters it estimated I had? Go on, guess….. seven hundred and sixty. Of course I now have to go back and evaluate the results, but still. Seven hundred and sixty.

Good job I stopped at 15.

Stopwords

Having successfully divided my data set up into separate years yesterday, I thought I’d go back to basics and have a look at stopwords.

in language processing, it’s apparent that that are quite a few words that absolutely no value to a text.  These are words like ‘a’, ‘all’, ‘with’ etc.  NLTK (Natural Language Tool Kit – a module that can be used to process text in various ways.  You can have a play with it here) has a list of 127 words that could be considered the most basic ones.  Scikit-learn, which I’m using for some of the more complicated text processing algorithms) uses a list of 318 words taken from research carried out by the University of Glasgow .  A research paper published by them makes it clear that a fixed list is of limited use, and in fact a bespoke list should be produced if the corpus is drawn from a specific domain, as I’m doing with my blogs written by teachers and other Edu-professionals.

Basically, the more frequently a word is used in a corpus, the less useful it is.  For example, if you were presented with a data base of blogs written by teachers, and you wanted to find the blogs written about ‘progress 8’, that’s what your search term would be, possibly with some extra filtering-words like ‘secondary’ and ‘England’.  You would know not to bother with ‘student’, ‘children’ or ‘education’ because they’re words you’d expect time find in pretty much everything.  Those words are often referred to as ‘noise’.

The problem is that if the word ‘student’ was taken out of the corpus altogether, and treated as a stopword, that might have an adverse effect on the subsequent analysis of the data.  In other words, just because the word is used frequently doesn’t make it ‘noise’.   The bigger problem, then, is how to decide which of the most frequently used terms in a corpus can safely be removed.  And of course there’s the issue of whether the words on the default list should be used as well.

The paper I referred to above addresses this very problem, with some success.  I’m still trying to understand exactly how it works, but it seems to be based on the idea that a frequently-used word may in fact be an important search term.  And the reason I’ve spent so much time on this is because the old adage ‘rubbish in, rubbish out’ is indeed true, and before I go any further with the data I have, I at least need to understand the factors that may impact the results.

Thinking it through… Part 2

Having had chance to think about, and articulate some ideas as to how to deal with my data set, I started dividing it up into blogs posts by year.  I like using Pandas for Python, although it can be difficult to find help with it that is pitched at the right level.  Anyway, I separated out all the year from 2004 to 2017 and saved them in individual .csv files.

Than I had a go at clustering posts from 2017.  With ‘only’ 230 blog posts, this was relatively easy in terms of processing using the hardware available on my laptop.  I stuck with 10 clusters as I’d used this arbitrary number when I clustered the whole set.  I’ll talk in more detail about the results in the next post, but some issues remain to be addressed:

  • What to do with the entries that don’t include the year they were posted.
  • The stop words obviously need sorting out, as I’m getting rubbish like ‘facebooktwittergoogleprintmoreemaillinkedinreddit’ as one of the top terms in a cluster.  Two clusters, in fact.
  • As mentioned in the previous post, some of the titles include ‘posted on’ followed by the date of posting, and/or the category; and sometimes the blog post itself rather than the title.  I should probably try and remove the ‘posted by’ from the beginning, and I can probably get rid of the category as well.  Following that, the first sentence would probably do as the title.

The big question, though, is should I use the data from the entire set as training data for these subsequent sub-sets?  That would probably mean experimenting with different numbers of clusters until I got what looked like a coherent set of topics (which will obviously be down to my own professional judgement and inevitable researcher bias) and label them, or should I subject each subset to the principles of unsupervised learning and see what happens?

Then there’s presenting my data.  I would like something like this, explained here by the late, great Hans Rosling.

I’m imagining my timeline along the horizontal axis, probably starting around 2004 and finishing with the present.  This will probably be broken down into quarters.  The vertical axis will be the topics discussed, summed up in one or two words if possible.  How cool would that be?

Thinking It Through…

This blog is intended to be a record of the things I’ve been thinking about as I’ve looked over a sample of my data.  You might find it a bit boring…. that’s allowed.  You don’t have to read it.

Dealing with Data: Dates

I’m working on a sample of blog post data that I scraped for my PhD upgrade report (and a paper for the Web Science conference that wasn’t accepted, sadly).  The data contains ‘just’ 11,197 rows of text data: The contents of each blog post, the date it was posted, and the title of the post.  Well, that’s what I wanted when I wrote the code that went through a list of URLs and scraped the data.

A spreadsheet with 12,000 rows is just about manageable, by which I mean you wouldn’t want to print the data out, but you can scroll through and have a look at what you’ve got using Excel.  A sample like this is useful because you can observe the data you’ve gathered, and anticipate some of the problems with it.

The first thing I noticed is that rows 1486 to 2971 appear to be duplicates of the previous batch of rows.  Obviously this has happened because the source URLs have become duplicated.  Now, when I got my first list of URLs together, not all of them could be scraped.  There are several reasons for this:

  • wrong address;
  • URL no longer available;
  • password protected blog;
  • the code simply won’t work on the given URL.

My code stops running when it encounters a URL it can’t access.  Up to now, I’ve been manually cutting out the offending URL, and copying it in to a separate document that I can look at later.  This is the first place an error could be made, by me of course.

Task 1: amend code so that a URL that can’t be processed is automatically written to a separate file, and the code continues to iterate through the rest of the list.

When you’re dealing with around 1000 URLs, as I hope to do, the less intervention by me the better.

Then, there’s the data that’s gathered.  First, Excel is a very poor tool for viewing data scraped from the web.  I used Pandas (a Python module) to clean it up a bit (removing the whitespace at the beginning and end of the text) first before opening it up in Excel.  Then, it’s possible to see what’s in each cell, and align it top/left if necessary.  As I was only interested in reviewing the ‘date’ and ‘title’ columns at this stage, I saved the file with a slightly different name and deleted the ‘content’ column.  The reduction in file size makes it a bit easier to manage.

All looks good.  This is a typical entry:

65 September 11, 2012 Reading

65 is the index number given to the entry when the data was scraped, so it’s the 66th blog post from this URL (entries start at zero).

Then there’s this:

Problem 1

0 Posted on December 5, 2016 Carnival of Mathematics 140

The way the date is represented is crucial to my project.

Task 2: Remove ‘posted on ’ from the string.

Easy enough to do you’d think, but actually not.  It is possible to strip the first n-characters from the beginning of a string, but the code will iterate through every row and do the same, which is not what I want.  The other option is to split the string and copy the ‘posted on ’ (the space after ‘on’ is deliberate) bit to another column.  So, the pseudo-code would look like this:

if row in ‘Date column’ contains the string ‘posted on ’;

split string after ‘posted on ’;

write to row in ‘Posted On’ column.

Problem 2

1 Posted on January 29, 2017January 29, 2017 Education

So much is such a waste of time

Posted on January 29, 2017January 29, 2017

There are a couple of problems here.  If I split the date string as I did previously, it’s not going to help me.  I’d be left with ‘January 29, 2017January 29, 2017’.  Now what?

Secondly, the title cell looks to me as if it contains a category for the post, the title, and the date the post was made (again).  At this point, I’m thinking of finding this particular blog post via a google search, and looking at the HTML structure of the page to see why I’m getting these extra bits of unnecessary information.  It may not look like much, but:

  • when my spreadsheet has one hundred and eleven thousand rows, or more, that’s a lot of extra data;
  • I eventually want to use the titles when I present my data visually to an audience;
  • The title itself may be useful to add some substance to my analysis, so I don’t want it ‘dirtied’ with useless characters.

Problem 3

This row has a similar issue, although there is no category.  I’ve added the stars to protect the identity of the blogger.

0 Posted on November 9, 2015 What did I learn?

Posted on November 9, 2015 by C******* M*****

I’m not sure what to do about the date here, so let’s move on.  I can do this though:

Task 3: examine the HTML structure of this blog URL with a view to modifying the code used to scrape the data.

Problem 4

Here’s something else interesting:

183 Posted on March 1, 2010September 9, 2010 Software and websites I couldn’t do without

Two dates.  I suspect that the first date is the one on which the blog entry was posted, and the second is the date it was amended /updated.  Again, how am I going to deal with this?  I think I’m going to have to go back to the HTML again and see if I can make another modification to my code.  I’m only on row 925… let’s move on.

Problem 5

Here’s my next oddity:

0 2016-09-12 by k***** National Drama CPD Training for secondary teachers

I can split the string here:

if row in ‘Date column’ contains the string ‘ by’;

split string before ‘ by’;

write to row in ‘By’ column.

The space is in a different place now.  This matters, because while you and I see a space in Excel, there is in fact a character there, and it counts.  It quite literally ‘counts’ too, because it has a place.  It’s number 10 in the string (remember, counting begins at zero).  So, if I were to split the string at the space before ‘by’, it might actually split at a different place in a different cell (remember my code will iterate through every row of the column, so I need to be sure that it will only impact the cells I want it to).

Task 4: split string at ‘ by’.

The date that’s left in the cell will be in a different format from previous dates i.e. it’s 2016-9-12 rather than September 12, 2016.  Will this make a difference?  I don’t know yet.

Problem 6

0 2017-02-05 00:00:00 314. Maths is a foreign language

Problem 7

This date has the time as well.  Again, I don’t know what difference this will make.

1 21st December Phase diagrams

Now here’s a problem – no year.  A crucial piece of information is missing, and it’s missing for 696 rows (from row 4603).  Previously, I used Pandas to do a quick audit (locating rows containing 2017, 2016, 2015 etc. and had established that 786 rows were unaccounted for.  It looks as if I’ve found some of them.

p.s. rows 9522 to 9552 are similar, so there’s another 30.  Only 40 unaccounted for.

30 Posted by  b********1 Hello world!

Problem 8

This cell indicates there’s no spaces between ‘Feb’, ‘17’ and ‘2017’ although when I pasted the row into this word document, each element was on a different line.

0 Feb172017 Learning & Teaching GCSE Mathematics

This will probably be ok because when I come to analyse my data, the important pieces are the month and the year, both of which are clear.

Problem 9

And what about this?

36 8. März 201330. März 2016 Build your own low-cost slate! | Baue dein eigenes low-cost Slate!

I know from looking at this blog before that not all of it is in a foreign language (I’m assuming it’s a foreign language teacher), so do I leave this entire blog out of my master list?

Problem 10

I could split these strings, although the figure given for the number of comments varies.

0 04 Apr 2016 Leave a comment The World is Upside Down
6 22 Apr 2015 3 Comments Revision – what works best?

if row in ‘Date column’ contains the string ‘ leave a comment’;

split string before ‘ leave’;

write to row in ‘Leave’ column.

if row in ‘Date column’ contains the string ‘ (number) comments’;

split string before ‘ (number)’;

write to row in ‘Leave’ column.

It’s possible to write code that will take any numerical value for ‘(number)’.

Problem 11

Then there’s this – no title at all.

333 March 3, 2012  

I really need something here, but what?  I could amend my code so that, if it fails to find a blog post title, the phrase ‘No Title’ is written into the row instead.  Alternatives include:

  • use the first sentence from the blog post itself (which can be extracted from the ‘Contents’ cell);
  • use the three most common terms from the post (obtained from the TF-IDF analysis I’m doing on the whole data set);
  • deploy some other text analysis technique to summarise the post in one sentence, which, when you think about it, is exactly what we try and do when we come up with a title for our own blogs.

This affects quite a few rows, so it needs addressing.

Problem 12

0

Standard

Posted by

mrstuartcampbell

Posted on

March 14, 2015

Posted under

Uncategorized

Comments

Leave a comment

Peer Observation – Priceless CPD, for free!

 

I’ve copied and pasted this ‘as is’, although in the spreadsheet the data in the date cell appears on one line.  This highlights one of the issues when viewing data – it will appear differently when looked at through different windows, and yet each window has its advantages.  Excel is good for scrolling through data, and for basic numerical functions.  For everything else, I use Pandas for Python, usually via the Jupyter notebook that’s part of the Anaconda suite.

Problem 13

1

13

Tuesday

Dec 2016

Hi Guys. This page will contain all the BSGP (bronze, silver, gold, platinum) skill sheets for your perusal.

All resources are free to a good home and are intended to be used for what they are… banks of questions rising in difficulty to help complement your teaching, not replace it!

As I create new resources I’ll add them here so check back often. At some point i’ll probably give the project a formal name and organise it a little better than I am at the minute.

All answer sheets can be found in a password protected blog post (called ‘answer sheets’ of all things!).

Hit me up on twitter  ( @mrlyonsmaths ) for the password

Algebra

mlm-expanding-multiple-brackets

mlm-expanding-single-brackets

mlm-factorise-single-brackets

mlm-factorising-double-brackets

mlm-linear-silultaneous

mlm-quad-simult-equations

mlm-quadratic-solving-equation

mlm-simplifying-algebra

mlm-solve-quadratic-factorising

mlm-solving-linear-equations

Number

mlm-multiplication

Here’s a row where the contents are appearing where the title should be.  I’m willing to bet that this is because of the HTML structure of the page, so I need to revisit my master code.  It’s not the only set of blog posts from a URL either.

Task 5: revisit master code for extracting ‘Title’ from this blog URL.

And all these problems are, of course, the ones I’ve uncovered in my sample.  The ones I know about.  My final data set will be huge, and I’ll have little chance of spotting anomalies unless I accidentally stumble upon them.

Welcome to my world of big data.

 

 

Back to the Classroom.

A few weeks ago, I was asked if I’d be interested in running a workshop for year 12 students as part of the ESRC* Festival of Social Science.  This was organised as part of the University of Southampton’s Learn with US (Outreach) programme which I’d quite like to do more work with in the future.  The theme of the workshops was looking at how technology, and mobile phones and devices in particular, are being used in social science research.  As part of my research, I’m looking at networks and network (or graph) theory, so I thought I could have a go at teaching that.  I find networks fascinating, AND I knew I had some excellent resources that could be adapted for use with students, so why not?

I’m also really keen on promoting the idea that a) computer science is for women too, b) web science is an excellent way of combining the social sciences with computer science, and c) age is no barrier.  A teacher who was accompanying a group of students also told me that, as well as being a role model for girls, I was also showing students why being able to write code was so important as it could have a real practical benefit.

I really miss being in the classroom.  Why will be the subject of another blog post, but suffice to say that, for me, there’s something exhilarating about putting things together (in this case, a PowerPoint and some handouts to guide students the through some actual hands-on work) so that I can deliver knowledge in a way that I hope is interesting.  I like being in charge, in my own space, directing my own personal show.  It’s also a really good chance for me to consolidate my own learning, which is one of the benefits of teaching.

The students were, of course, excellent.  They were made up of groups from several schools – one or two local, others from further afield.  It was really interesting to observe how different as groups they were from one another, which I assume reflects both the socio-economic background they were drawn from (and is almost certainly directly related to the catchment area of the school) and the ethos of the school itself.  The interactions between them, them and their teachers, and with me was markedly different from session to session.  Having only taught in one school before (and not really being detached enough to just observe), it was a fascinating experience for me.  It was, though, overwhelmingly positive and I thoroughly enjoyed it!

I’m sure they left with a positive view of the University of Southampton, and I hope they were inspired by my workshop, and the others they attended.

By the way, the resources I used were borrowed and adapted from the ‘Power of Social Networks’ MOOC** that has just finished on Futurelearn.  It’ll be repeated though, if you fancy a dabble into the world of social networks.

*Economic & Social Research Council

**Massive Open Online Course

Learning

I’m a slow learner.  By that I mean it can take me a while to put all the pieces together so I can see the whole picture.  If I was a detective, I’d be the plodding kind that takes ages to interrogate every witness, look at every piece of evidence, and use one of those huge pin boards to visually represent the case.  I wouldn’t have a Eureka! moment part way through when I could suddenly see whodunnit and spend just a few seconds demonstrating how everything that remained fitted together.

I’ve just spent the best part of three months teaching myself to write code so that I can copy blog posts from over 800 bloggers, together with the date the blog was posted and the title.  Actually, in the end I’ve written code that will do that for most of the blogs in my list, for reasons I’ll explain in the next post.

Writing computer code to do a variety of somethings, and do them in the right order, is hard.  I’ve been using Python, which is pretty straightforward and relatively easy to read if you’ve never seen code before.  While lots of things still happen ‘under the bonnet’ so to speak, the commands that make those things happen are pretty transparent.  It does exactly what you tell it to do, and executes your commands in a precise and logical order.  This is how it works:

(1 + 2) + (3 x 4) = ?

3 + 12 = 15

It will do the calculations in the brackets first before moving on to the second stage, where it adds the totals from the bracketed calculations together.

A similar instruction in Python would be:

if len(blogPostTitle) > len(blogPostDate):
blogPostTitle.pop()

So, if the length of the list ‘blogPostTitle’ is greater than the length of the list ‘blogPostDate’, remove the last item in the blogPostTitleList.  The second line is indented so that Python knows it must execute this line of code before it moves on to the next.  My code goes through a sequence of instructions, not all of which have to be carried out if certain conditions aren’t met, and it must execute this code several times before it can move on to repeat the process – in my case, on every item in a list – before it ends.

Typing it out like that makes it sound extremely simple, but the form of words, and the sequential structure of those words, have kept me occupied for weeks.  I’ve no doubt someone with a better grasp of maths than I have would grasp the logical structure behind it, and learn the language, much faster than I.  In fact, a long piece of code that does a specific thing can be labelled as a ‘function’ and given a name, and called on to do its work using just the name, saving you from copying and pasting all the code again (and having to make numerous corrections if it needs amending).

During the course of this project, I’ve written a bit of code.  Searched on Google for how to write the next bit of code.  Read bits from books on programming.  Searched again.  Written a bit.  Got one little thing working (like putting all the URLs in a list Python can read).  Written the next bit of code.  Or rather, tried, and repeated the process above several times over.  And believe me, reading coding solutions online, when you’re a coding novice, is less than helpful.  Just knowing what key words to put into your search is a major leap forward.  Finally, I ended up drawing diagrams of what I needed my code to do, printing it out and cutting it up with scissors so I could visualise the sequence of events and the result if I changed anything around, and then I went back to that last piece of working code I put together and I could see the final thing I needed to do to make it work.

I strongly suspected that getting to grips with code would improve my maths skills, and I was right.  It really made me think about the sequence of events as much as the language used to describe them, and of course if you want to be any kind of an engineer, you have to understand the rules of logic.  I feel as if I’ve really actually learned something properly, and that was one of my main goals in doing this PhD.  I’ve levelled up.

code

My personal blood, sweat and time, but no tears.