What Big Data Can’t Tell You

I’ve spent what seems like months writing Python code that will let me download the content of blog posts.  You can  do this using what’s known as an RSS (Rich Site Summary) feed, but that only yields a summary of the most recent posts, when I need the whole post, and every post the blogger has written.  In some cases, this goes back years.  It’s been a painful process, and will be the subject of my next blog post, but just for a bit of ‘fun’ I thought I’d look at the comments feed instead.

A while ago, Tom Starkey (@tstarkey1212) asked on Twitter if there was any way of finding out which Edu-blogs might be the most popular.  One way of finding out might be to look at the number of comments made on posts, so I thought I’d use the RSS feed this time to download the latest ones and have a look.  I wrote some Python code, and bingo! there they all were in a nice tidy spreadsheet.  There are some issues, though.  Quite a few, actually.

  1. I have no idea if I have the http address of every Edu-blogger out there.  My source was the list in a spreadsheet provided by Andrew Old .  How complete it is depends on a) whether you’ve heard of Andrew, or b) whether you’ve heard of him but don’t want to add your blog to ‘his’ spreadsheet.  Still, there are over 800 blogs on there so it’s a big enough sample to be getting on with.
  2. The information I needed was the blog post title, the name of the commenter, the date the comment was made, the comment itself, and the http link to the comment.  The link is important because it contains the title of the blog site.  RSS feeds yield particular information as they’re kind of standardised.  However, the title of the blog post contained in the link, the bit in bold in fact: https://sarahhewittsblog.wordpress.com/ isn’t always the actual title.  Nor is it always after the //, so any attempt to automatically extract the title based on its position in the http address was difficult.  When you’ve got over 8000 rows in your spreadsheet, you so want to automate the process if you can.  I chose not to, because….
  3. The name of the commenter might also be the title of the blog.  In fact, this was the case for quite a few posts, something that only becomes obvious when you slowly scroll through each of those 8000-plus rows.
  4. The name of the commenter should be the very last item in the field yielded by the RSS feed.  In theory, it should be easy to extract because it would come after a comma or possibly even a | symbol.  So, I could write some code that would iterate over every one of those 8000 rows and just extract the commenter’s name and put it in a separate column, right?  Wrong.  Some fields were truncated because they were too long.  Relying on commas to demarcate the right characters risked getting the wrong information.  Sometimes there was nothing more than a space.  In the end, I did it manually, copying and pasting.  That also helped me to identify names that related to the blog title and the name of the commenter, so I could match them up.
  5. Finally, the most obvious thing.  Not everyone who read a blog leaves a comment.  In fact, I’m willing to bet most people don’t. And if they do, I bet they do it via either posting a link to the blog with a recommendation, or simply retweeting the link that brought the blog to their attention in the first place.  The only way of knowing who reads a blog is in the hands of the blogger themselves via their stats pages, or possibly Google with their page link algorithm.  Still, I think the real proof (in spite of what some bloggers have claimed) lies in those stats.  And given I’ve been accessing some sites repeatedly in an effort to see if my code works, there may be some glitches there as well.

In spite of all this, I gathered my data and used NodeXL to produce a graph.  Three, in fact.  The basic one is here and is best viewed using a laptop or PC.  I’ve made some notes based on the graph metrics (graph-notes) and there are two other versions here and here .  Again, it’s best you view them using a laptop or a PC.

Finally, if your blog isn’t on Andrew’s spreadsheet, and you want to see how it compares with everyone else’s (or you’d like me to include it in the data I’ll be using for my PhD) you can either add it yourself or let me know the address and I’ll add to my own records.  I intend to anonymise all my data before I publish it because I know how sensitive it is even though it’s public (I’m an ex-teacher myself).  Or you can send me your viewing stats because, after all, they paint the clearer picture.

The thing is, though, that while it’s easy to think your blog might the one that’s influencing everyone and getting them ‘on your side’,  knowing and proving it are completely different things altogether.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s