Author Archives: sarahtiggy2

The Problem With Guessing K-Means

I’ve been grappling with the problem of how to find out what a group of professionals blog about. That seems simple enough on the face of it, but when there are over 9,000 blogs in a sample set of data, it’s not so easy. I can’t read every one, and even if I could, can you imagine how long it might take me to group them into topics?

Enter computer science in the form of algorithms.

I’ll gloss over the hours…. days…. weeks of researching how the various alternatives work, and why algorithm A is better than algorithm B for my type of data. Turns out k-means is the one I need.

Put very simply, each blog post (document) is made up of words. Each word is used x amount of times, both in the document and in the entire collection of documents (corpus). An adjustment must be made for the overall length of the document (a word used ten times in a document of 100 words doesn’t have the same significance as the same word used ten times in document of 1000 words), but once this has been done it’s possible to give each document an overall ‘score’, which is converted to a position (or vector) within the corpus.

It helps to think of the position as a ‘vector’ in a space with an infinite number of dimensions, even if you can’t visualise it, which I can’t. But, having done this, it’s then possible to k-means to randomly pick a number of starting vectors (the number being picked in advance) and it will proceed to find all of the documents closest to it until it finds the distance becomes too great or it begins to overlap with a neighbouring group, in which case it starts again somewhere else. The algorithm does this over and again until it completes the task successfully as it can (or it’s told to do it for a maximum number of tries, or iterations) and then it tells you how many documents it’s put in each cluster.

In theory, the algorithm should produce the same number of clusters every time you run it, although that doesn’t always happen as I found with my data. The other thing is, without grouping the set manually, there’s no way of telling what the actual number of k should be, which rather defeats the point of the algorithm…. except when you’re dealing with large data sets, you’ve got no choice.

Of course, you CAN just keep clustering, adding 1 to your chosen number for k until you think you’ve got results you’re happy with. I started doing that, beginning with 10 and working up to 15, by which time I was totally bored and considering the possibility that my actual optimum number of clusters might we over 100…. Every time I ran the algorithm, the number of posts in each cluster changed, although two were stable. That seemed to be telling me that I was a long way from finding the optimum number.

Enter another load of algorithms that can help you estimate the optimum number for k. They aren’t a magic bullet – they can only help with an estimation, and each one goes about the process in a different way. I chose the one I did because a) I found the code in a very recent book written by a data scientist, and b) he gave an example of how to write the code AND IT WORKED.

Guess how many clusters it estimated I had? Go on, guess….. seven hundred and sixty. Of course I now have to go back and evaluate the results, but still. Seven hundred and sixty.

Good job I stopped at 15.

Stopwords

Having successfully divided my data set up into separate years yesterday, I thought I’d go back to basics and have a look at stopwords.

in language processing, it’s apparent that that are quite a few words that absolutely no value to a text.  These are words like ‘a’, ‘all’, ‘with’ etc.  NLTK (Natural Language Tool Kit – a module that can be used to process text in various ways.  You can have a play with it here) has a list of 127 words that could be considered the most basic ones.  Scikit-learn, which I’m using for some of the more complicated text processing algorithms) uses a list of 318 words taken from research carried out by the University of Glasgow .  A research paper published by them makes it clear that a fixed list is of limited use, and in fact a bespoke list should be produced if the corpus is drawn from a specific domain, as I’m doing with my blogs written by teachers and other Edu-professionals.

Basically, the more frequently a word is used in a corpus, the less useful it is.  For example, if you were presented with a data base of blogs written by teachers, and you wanted to find the blogs written about ‘progress 8’, that’s what your search term would be, possibly with some extra filtering-words like ‘secondary’ and ‘England’.  You would know not to bother with ‘student’, ‘children’ or ‘education’ because they’re words you’d expect time find in pretty much everything.  Those words are often referred to as ‘noise’.

The problem is that if the word ‘student’ was taken out of the corpus altogether, and treated as a stopword, that might have an adverse effect on the subsequent analysis of the data.  In other words, just because the word is used frequently doesn’t make it ‘noise’.   The bigger problem, then, is how to decide which of the most frequently used terms in a corpus can safely be removed.  And of course there’s the issue of whether the words on the default list should be used as well.

The paper I referred to above addresses this very problem, with some success.  I’m still trying to understand exactly how it works, but it seems to be based on the idea that a frequently-used word may in fact be an important search term.  And the reason I’ve spent so much time on this is because the old adage ‘rubbish in, rubbish out’ is indeed true, and before I go any further with the data I have, I at least need to understand the factors that may impact the results.

Thinking it through… Part 2

Having had chance to think about, and articulate some ideas as to how to deal with my data set, I started dividing it up into blogs posts by year.  I like using Pandas for Python, although it can be difficult to find help with it that is pitched at the right level.  Anyway, I separated out all the year from 2004 to 2017 and saved them in individual .csv files.

Than I had a go at clustering posts from 2017.  With ‘only’ 230 blog posts, this was relatively easy in terms of processing using the hardware available on my laptop.  I stuck with 10 clusters as I’d used this arbitrary number when I clustered the whole set.  I’ll talk in more detail about the results in the next post, but some issues remain to be addressed:

  • What to do with the entries that don’t include the year they were posted.
  • The stop words obviously need sorting out, as I’m getting rubbish like ‘facebooktwittergoogleprintmoreemaillinkedinreddit’ as one of the top terms in a cluster.  Two clusters, in fact.
  • As mentioned in the previous post, some of the titles include ‘posted on’ followed by the date of posting, and/or the category; and sometimes the blog post itself rather than the title.  I should probably try and remove the ‘posted by’ from the beginning, and I can probably get rid of the category as well.  Following that, the first sentence would probably do as the title.

The big question, though, is should I use the data from the entire set as training data for these subsequent sub-sets?  That would probably mean experimenting with different numbers of clusters until I got what looked like a coherent set of topics (which will obviously be down to my own professional judgement and inevitable researcher bias) and label them, or should I subject each subset to the principles of unsupervised learning and see what happens?

Then there’s presenting my data.  I would like something like this, explained here by the late, great Hans Rosling.

I’m imagining my timeline along the horizontal axis, probably starting around 2004 and finishing with the present.  This will probably be broken down into quarters.  The vertical axis will be the topics discussed, summed up in one or two words if possible.  How cool would that be?

Thinking It Through…

This blog is intended to be a record of the things I’ve been thinking about as I’ve looked over a sample of my data.  You might find it a bit boring…. that’s allowed.  You don’t have to read it.

Dealing with Data: Dates

I’m working on a sample of blog post data that I scraped for my PhD upgrade report (and a paper for the Web Science conference that wasn’t accepted, sadly).  The data contains ‘just’ 11,197 rows of text data: The contents of each blog post, the date it was posted, and the title of the post.  Well, that’s what I wanted when I wrote the code that went through a list of URLs and scraped the data.

A spreadsheet with 12,000 rows is just about manageable, by which I mean you wouldn’t want to print the data out, but you can scroll through and have a look at what you’ve got using Excel.  A sample like this is useful because you can observe the data you’ve gathered, and anticipate some of the problems with it.

The first thing I noticed is that rows 1486 to 2971 appear to be duplicates of the previous batch of rows.  Obviously this has happened because the source URLs have become duplicated.  Now, when I got my first list of URLs together, not all of them could be scraped.  There are several reasons for this:

  • wrong address;
  • URL no longer available;
  • password protected blog;
  • the code simply won’t work on the given URL.

My code stops running when it encounters a URL it can’t access.  Up to now, I’ve been manually cutting out the offending URL, and copying it in to a separate document that I can look at later.  This is the first place an error could be made, by me of course.

Task 1: amend code so that a URL that can’t be processed is automatically written to a separate file, and the code continues to iterate through the rest of the list.

When you’re dealing with around 1000 URLs, as I hope to do, the less intervention by me the better.

Then, there’s the data that’s gathered.  First, Excel is a very poor tool for viewing data scraped from the web.  I used Pandas (a Python module) to clean it up a bit (removing the whitespace at the beginning and end of the text) first before opening it up in Excel.  Then, it’s possible to see what’s in each cell, and align it top/left if necessary.  As I was only interested in reviewing the ‘date’ and ‘title’ columns at this stage, I saved the file with a slightly different name and deleted the ‘content’ column.  The reduction in file size makes it a bit easier to manage.

All looks good.  This is a typical entry:

65 September 11, 2012 Reading

65 is the index number given to the entry when the data was scraped, so it’s the 66th blog post from this URL (entries start at zero).

Then there’s this:

Problem 1

0 Posted on December 5, 2016 Carnival of Mathematics 140

The way the date is represented is crucial to my project.

Task 2: Remove ‘posted on ’ from the string.

Easy enough to do you’d think, but actually not.  It is possible to strip the first n-characters from the beginning of a string, but the code will iterate through every row and do the same, which is not what I want.  The other option is to split the string and copy the ‘posted on ’ (the space after ‘on’ is deliberate) bit to another column.  So, the pseudo-code would look like this:

if row in ‘Date column’ contains the string ‘posted on ’;

split string after ‘posted on ’;

write to row in ‘Posted On’ column.

Problem 2

1 Posted on January 29, 2017January 29, 2017 Education

So much is such a waste of time

Posted on January 29, 2017January 29, 2017

There are a couple of problems here.  If I split the date string as I did previously, it’s not going to help me.  I’d be left with ‘January 29, 2017January 29, 2017’.  Now what?

Secondly, the title cell looks to me as if it contains a category for the post, the title, and the date the post was made (again).  At this point, I’m thinking of finding this particular blog post via a google search, and looking at the HTML structure of the page to see why I’m getting these extra bits of unnecessary information.  It may not look like much, but:

  • when my spreadsheet has one hundred and eleven thousand rows, or more, that’s a lot of extra data;
  • I eventually want to use the titles when I present my data visually to an audience;
  • The title itself may be useful to add some substance to my analysis, so I don’t want it ‘dirtied’ with useless characters.

Problem 3

This row has a similar issue, although there is no category.  I’ve added the stars to protect the identity of the blogger.

0 Posted on November 9, 2015 What did I learn?

Posted on November 9, 2015 by C******* M*****

I’m not sure what to do about the date here, so let’s move on.  I can do this though:

Task 3: examine the HTML structure of this blog URL with a view to modifying the code used to scrape the data.

Problem 4

Here’s something else interesting:

183 Posted on March 1, 2010September 9, 2010 Software and websites I couldn’t do without

Two dates.  I suspect that the first date is the one on which the blog entry was posted, and the second is the date it was amended /updated.  Again, how am I going to deal with this?  I think I’m going to have to go back to the HTML again and see if I can make another modification to my code.  I’m only on row 925… let’s move on.

Problem 5

Here’s my next oddity:

0 2016-09-12 by k***** National Drama CPD Training for secondary teachers

I can split the string here:

if row in ‘Date column’ contains the string ‘ by’;

split string before ‘ by’;

write to row in ‘By’ column.

The space is in a different place now.  This matters, because while you and I see a space in Excel, there is in fact a character there, and it counts.  It quite literally ‘counts’ too, because it has a place.  It’s number 10 in the string (remember, counting begins at zero).  So, if I were to split the string at the space before ‘by’, it might actually split at a different place in a different cell (remember my code will iterate through every row of the column, so I need to be sure that it will only impact the cells I want it to).

Task 4: split string at ‘ by’.

The date that’s left in the cell will be in a different format from previous dates i.e. it’s 2016-9-12 rather than September 12, 2016.  Will this make a difference?  I don’t know yet.

Problem 6

0 2017-02-05 00:00:00 314. Maths is a foreign language

Problem 7

This date has the time as well.  Again, I don’t know what difference this will make.

1 21st December Phase diagrams

Now here’s a problem – no year.  A crucial piece of information is missing, and it’s missing for 696 rows (from row 4603).  Previously, I used Pandas to do a quick audit (locating rows containing 2017, 2016, 2015 etc. and had established that 786 rows were unaccounted for.  It looks as if I’ve found some of them.

p.s. rows 9522 to 9552 are similar, so there’s another 30.  Only 40 unaccounted for.

30 Posted by  b********1 Hello world!

Problem 8

This cell indicates there’s no spaces between ‘Feb’, ‘17’ and ‘2017’ although when I pasted the row into this word document, each element was on a different line.

0 Feb172017 Learning & Teaching GCSE Mathematics

This will probably be ok because when I come to analyse my data, the important pieces are the month and the year, both of which are clear.

Problem 9

And what about this?

36 8. März 201330. März 2016 Build your own low-cost slate! | Baue dein eigenes low-cost Slate!

I know from looking at this blog before that not all of it is in a foreign language (I’m assuming it’s a foreign language teacher), so do I leave this entire blog out of my master list?

Problem 10

I could split these strings, although the figure given for the number of comments varies.

0 04 Apr 2016 Leave a comment The World is Upside Down
6 22 Apr 2015 3 Comments Revision – what works best?

if row in ‘Date column’ contains the string ‘ leave a comment’;

split string before ‘ leave’;

write to row in ‘Leave’ column.

if row in ‘Date column’ contains the string ‘ (number) comments’;

split string before ‘ (number)’;

write to row in ‘Leave’ column.

It’s possible to write code that will take any numerical value for ‘(number)’.

Problem 11

Then there’s this – no title at all.

333 March 3, 2012  

I really need something here, but what?  I could amend my code so that, if it fails to find a blog post title, the phrase ‘No Title’ is written into the row instead.  Alternatives include:

  • use the first sentence from the blog post itself (which can be extracted from the ‘Contents’ cell);
  • use the three most common terms from the post (obtained from the TF-IDF analysis I’m doing on the whole data set);
  • deploy some other text analysis technique to summarise the post in one sentence, which, when you think about it, is exactly what we try and do when we come up with a title for our own blogs.

This affects quite a few rows, so it needs addressing.

Problem 12

0

Standard

Posted by

mrstuartcampbell

Posted on

March 14, 2015

Posted under

Uncategorized

Comments

Leave a comment

Peer Observation – Priceless CPD, for free!

 

I’ve copied and pasted this ‘as is’, although in the spreadsheet the data in the date cell appears on one line.  This highlights one of the issues when viewing data – it will appear differently when looked at through different windows, and yet each window has its advantages.  Excel is good for scrolling through data, and for basic numerical functions.  For everything else, I use Pandas for Python, usually via the Jupyter notebook that’s part of the Anaconda suite.

Problem 13

1

13

Tuesday

Dec 2016

Hi Guys. This page will contain all the BSGP (bronze, silver, gold, platinum) skill sheets for your perusal.

All resources are free to a good home and are intended to be used for what they are… banks of questions rising in difficulty to help complement your teaching, not replace it!

As I create new resources I’ll add them here so check back often. At some point i’ll probably give the project a formal name and organise it a little better than I am at the minute.

All answer sheets can be found in a password protected blog post (called ‘answer sheets’ of all things!).

Hit me up on twitter  ( @mrlyonsmaths ) for the password

Algebra

mlm-expanding-multiple-brackets

mlm-expanding-single-brackets

mlm-factorise-single-brackets

mlm-factorising-double-brackets

mlm-linear-silultaneous

mlm-quad-simult-equations

mlm-quadratic-solving-equation

mlm-simplifying-algebra

mlm-solve-quadratic-factorising

mlm-solving-linear-equations

Number

mlm-multiplication

Here’s a row where the contents are appearing where the title should be.  I’m willing to bet that this is because of the HTML structure of the page, so I need to revisit my master code.  It’s not the only set of blog posts from a URL either.

Task 5: revisit master code for extracting ‘Title’ from this blog URL.

And all these problems are, of course, the ones I’ve uncovered in my sample.  The ones I know about.  My final data set will be huge, and I’ll have little chance of spotting anomalies unless I accidentally stumble upon them.

Welcome to my world of big data.

 

 

Back to the Classroom.

A few weeks ago, I was asked if I’d be interested in running a workshop for year 12 students as part of the ESRC* Festival of Social Science.  This was organised as part of the University of Southampton’s Learn with US (Outreach) programme which I’d quite like to do more work with in the future.  The theme of the workshops was looking at how technology, and mobile phones and devices in particular, are being used in social science research.  As part of my research, I’m looking at networks and network (or graph) theory, so I thought I could have a go at teaching that.  I find networks fascinating, AND I knew I had some excellent resources that could be adapted for use with students, so why not?

I’m also really keen on promoting the idea that a) computer science is for women too, b) web science is an excellent way of combining the social sciences with computer science, and c) age is no barrier.  A teacher who was accompanying a group of students also told me that, as well as being a role model for girls, I was also showing students why being able to write code was so important as it could have a real practical benefit.

I really miss being in the classroom.  Why will be the subject of another blog post, but suffice to say that, for me, there’s something exhilarating about putting things together (in this case, a PowerPoint and some handouts to guide students the through some actual hands-on work) so that I can deliver knowledge in a way that I hope is interesting.  I like being in charge, in my own space, directing my own personal show.  It’s also a really good chance for me to consolidate my own learning, which is one of the benefits of teaching.

The students were, of course, excellent.  They were made up of groups from several schools – one or two local, others from further afield.  It was really interesting to observe how different as groups they were from one another, which I assume reflects both the socio-economic background they were drawn from (and is almost certainly directly related to the catchment area of the school) and the ethos of the school itself.  The interactions between them, them and their teachers, and with me was markedly different from session to session.  Having only taught in one school before (and not really being detached enough to just observe), it was a fascinating experience for me.  It was, though, overwhelmingly positive and I thoroughly enjoyed it!

I’m sure they left with a positive view of the University of Southampton, and I hope they were inspired by my workshop, and the others they attended.

By the way, the resources I used were borrowed and adapted from the ‘Power of Social Networks’ MOOC** that has just finished on Futurelearn.  It’ll be repeated though, if you fancy a dabble into the world of social networks.

*Economic & Social Research Council

**Massive Open Online Course

Learning

I’m a slow learner.  By that I mean it can take me a while to put all the pieces together so I can see the whole picture.  If I was a detective, I’d be the plodding kind that takes ages to interrogate every witness, look at every piece of evidence, and use one of those huge pin boards to visually represent the case.  I wouldn’t have a Eureka! moment part way through when I could suddenly see whodunnit and spend just a few seconds demonstrating how everything that remained fitted together.

I’ve just spent the best part of three months teaching myself to write code so that I can copy blog posts from over 800 bloggers, together with the date the blog was posted and the title.  Actually, in the end I’ve written code that will do that for most of the blogs in my list, for reasons I’ll explain in the next post.

Writing computer code to do a variety of somethings, and do them in the right order, is hard.  I’ve been using Python, which is pretty straightforward and relatively easy to read if you’ve never seen code before.  While lots of things still happen ‘under the bonnet’ so to speak, the commands that make those things happen are pretty transparent.  It does exactly what you tell it to do, and executes your commands in a precise and logical order.  This is how it works:

(1 + 2) + (3 x 4) = ?

3 + 12 = 15

It will do the calculations in the brackets first before moving on to the second stage, where it adds the totals from the bracketed calculations together.

A similar instruction in Python would be:

if len(blogPostTitle) > len(blogPostDate):
blogPostTitle.pop()

So, if the length of the list ‘blogPostTitle’ is greater than the length of the list ‘blogPostDate’, remove the last item in the blogPostTitleList.  The second line is indented so that Python knows it must execute this line of code before it moves on to the next.  My code goes through a sequence of instructions, not all of which have to be carried out if certain conditions aren’t met, and it must execute this code several times before it can move on to repeat the process – in my case, on every item in a list – before it ends.

Typing it out like that makes it sound extremely simple, but the form of words, and the sequential structure of those words, have kept me occupied for weeks.  I’ve no doubt someone with a better grasp of maths than I have would grasp the logical structure behind it, and learn the language, much faster than I.  In fact, a long piece of code that does a specific thing can be labelled as a ‘function’ and given a name, and called on to do its work using just the name, saving you from copying and pasting all the code again (and having to make numerous corrections if it needs amending).

During the course of this project, I’ve written a bit of code.  Searched on Google for how to write the next bit of code.  Read bits from books on programming.  Searched again.  Written a bit.  Got one little thing working (like putting all the URLs in a list Python can read).  Written the next bit of code.  Or rather, tried, and repeated the process above several times over.  And believe me, reading coding solutions online, when you’re a coding novice, is less than helpful.  Just knowing what key words to put into your search is a major leap forward.  Finally, I ended up drawing diagrams of what I needed my code to do, printing it out and cutting it up with scissors so I could visualise the sequence of events and the result if I changed anything around, and then I went back to that last piece of working code I put together and I could see the final thing I needed to do to make it work.

I strongly suspected that getting to grips with code would improve my maths skills, and I was right.  It really made me think about the sequence of events as much as the language used to describe them, and of course if you want to be any kind of an engineer, you have to understand the rules of logic.  I feel as if I’ve really actually learned something properly, and that was one of my main goals in doing this PhD.  I’ve levelled up.

code

My personal blood, sweat and time, but no tears.

Rome

This time last week, I was probably on an aircraft waiting to fly back to Gatwick, following five days in Rome.  It seems like a long time ago now.

20161013_180922

He’s got wood.

You really can’t take more than three steps in Rome without stumbling over some ancient ruins.  They’re everywhere.  Often, they’re just some pillars, supported by metal bands and standing among weeds and rubble, usually up against more contemporary buildings.  I don’t doubt that just a few feet beneath the pavements, so much more remains undiscovered.  The point is, you can see so much without paying a penny, like the Trevi Fountain, which is pretty much what we did.

I didn’t bother to do any specific research before I went.  I watched Mary Beard’s series on BBC4 when it was broadcast.  I wanted to just look, and take it all in.  And it really is spectacular.  My general photos are here.  The river you can see is the river Tiber – the Tiber!  I don’t know why this excited me so much, but it did.  I wish I’d kept up with Latin, though.  Just the street names can tell you so much, but I might have been able to read some of the inscriptions and graffito.

Anyway, out hotel was this one, which was central for everything and very comfortable.  Mind you, it was a bit inconsistent.  My room was right at the top, with my very own balcony, and very spacious for a single room.  Wifi was pretty near impossible to get, though.  My travelling companions had single rooms each, both of which were smaller than mine (although one had a queen(?) sized bed and a door that was incredibly difficult to open.  The other was more like a cupboard and had a leaking bidet.  And none of our rooms were in the hotel we originally booked, which was this one.  For some reason, they’d made a mistake and had to move us.  The Helvazia was more central, but further away from the Colosseum and the conference venue which is the reason one of us was there in the first place.

20161015_091840

Even had a lime tree…

In fact, the mistake with the hotel booking came at the end of a day that had started with the rail strike meaning that I had to get a taxi to the station because my train was cancelled, then the aforementioned taxi hit a cyclist who had come tearing out from a park straight across the road (and was wearing his earphones….), and then the train to Gatwick was cancelled as well so we ended up getting another taxi (fifty quid each) to the airport because we didn’t want to take any more chances with public transport.  Sigh.

I didn’t find Rome as expensive as I thought it would be, which probably says more about how prices have risen generally than anything else.  We are really well (apart from the last evening, which was ok but not up to the standard of previous choices).  We ate here (which was my favourite, and by far the cheapest, especially with wine at 7 euros a litre) on the first evening; lunched here on Thursday (lovely, freshly cooked food but very uncomfortable seating if you have anything other than a small bottom); a Sicilian restaurant Melo on Thursday evening; and here on Friday evening.  The Constanza was something special.  Not only was the food and wine excellent, but the restaurant itself is partially in the remains of a Roman theatre.  Saturday was a bit of a disappointment.  We wanted to go to a place that made pizzas fresh and right in front of you, but unfortunately it was full, complete with queue of people waiting for a table

20161014_131337

The Tiber!

Two spectacular places we did pay to visit were the Colosseum and Trajan’s Market.  My Colosseum pictures are here.  Trajan’s market (photos here) was originally directly linked with the Colosseum.  The magnificent horse sculptures you can see are modern, and part of a touring public exhibition, the Lapiderium.  Given that I hard a tour guide say that more  ‘exotic’ animals were dispatched for public entertainment (and probably by ‘accident’ in the chariot races) in the Colosseum than at any time in history, I thought they were a poignant reminder of how cruel human beings can be.20161015_180440I would definitely go back to Rome again.  I didn’t see the Sistine Chapel,  or visit any of the art galleries.  I’d do some more research as well, and visit something with a bit more knowledge under my belt.  Oh, and I’d take several pairs of comfortable walking shoes and loads of pairs of socks.  Walking around Rome is incredibly hard on your feet, paved as it is with small granite blocks  if you’re lucky, crumbling concrete and tarmac if you aren’t.  It’s the best way, though, as nothing is especially far away and public transport looked packed and tricky to negotiate unless you speak some Italian.  I wouldn’t like to be there in the summer.  It was very warm even in October, and busy.  I can only imagine how hot and crowded it must be July/August.