This blog is intended to be a record of the things I’ve been thinking about as I’ve looked over a sample of my data. You might find it a bit boring…. that’s allowed. You don’t have to read it.
Dealing with Data: Dates
I’m working on a sample of blog post data that I scraped for my PhD upgrade report (and a paper for the Web Science conference that wasn’t accepted, sadly). The data contains ‘just’ 11,197 rows of text data: The contents of each blog post, the date it was posted, and the title of the post. Well, that’s what I wanted when I wrote the code that went through a list of URLs and scraped the data.
A spreadsheet with 12,000 rows is just about manageable, by which I mean you wouldn’t want to print the data out, but you can scroll through and have a look at what you’ve got using Excel. A sample like this is useful because you can observe the data you’ve gathered, and anticipate some of the problems with it.
The first thing I noticed is that rows 1486 to 2971 appear to be duplicates of the previous batch of rows. Obviously this has happened because the source URLs have become duplicated. Now, when I got my first list of URLs together, not all of them could be scraped. There are several reasons for this:
- wrong address;
- URL no longer available;
- password protected blog;
- the code simply won’t work on the given URL.
My code stops running when it encounters a URL it can’t access. Up to now, I’ve been manually cutting out the offending URL, and copying it in to a separate document that I can look at later. This is the first place an error could be made, by me of course.
Task 1: amend code so that a URL that can’t be processed is automatically written to a separate file, and the code continues to iterate through the rest of the list.
When you’re dealing with around 1000 URLs, as I hope to do, the less intervention by me the better.
Then, there’s the data that’s gathered. First, Excel is a very poor tool for viewing data scraped from the web. I used Pandas (a Python module) to clean it up a bit (removing the whitespace at the beginning and end of the text) first before opening it up in Excel. Then, it’s possible to see what’s in each cell, and align it top/left if necessary. As I was only interested in reviewing the ‘date’ and ‘title’ columns at this stage, I saved the file with a slightly different name and deleted the ‘content’ column. The reduction in file size makes it a bit easier to manage.
All looks good. This is a typical entry:
|65||September 11, 2012||Reading|
65 is the index number given to the entry when the data was scraped, so it’s the 66th blog post from this URL (entries start at zero).
Then there’s this:
|0||Posted on December 5, 2016||Carnival of Mathematics 140|
The way the date is represented is crucial to my project.
Task 2: Remove ‘posted on ’ from the string.
Easy enough to do you’d think, but actually not. It is possible to strip the first n-characters from the beginning of a string, but the code will iterate through every row and do the same, which is not what I want. The other option is to split the string and copy the ‘posted on ’ (the space after ‘on’ is deliberate) bit to another column. So, the pseudo-code would look like this:
if row in ‘Date column’ contains the string ‘posted on ’;
split string after ‘posted on ’;
write to row in ‘Posted On’ column.
|1||Posted on January 29, 2017January 29, 2017||Education
So much is such a waste of time
Posted on January 29, 2017January 29, 2017
There are a couple of problems here. If I split the date string as I did previously, it’s not going to help me. I’d be left with ‘January 29, 2017January 29, 2017’. Now what?
Secondly, the title cell looks to me as if it contains a category for the post, the title, and the date the post was made (again). At this point, I’m thinking of finding this particular blog post via a google search, and looking at the HTML structure of the page to see why I’m getting these extra bits of unnecessary information. It may not look like much, but:
- when my spreadsheet has one hundred and eleven thousand rows, or more, that’s a lot of extra data;
- I eventually want to use the titles when I present my data visually to an audience;
- The title itself may be useful to add some substance to my analysis, so I don’t want it ‘dirtied’ with useless characters.
This row has a similar issue, although there is no category. I’ve added the stars to protect the identity of the blogger.
|0||Posted on November 9, 2015||What did I learn?
Posted on November 9, 2015 by C******* M*****
I’m not sure what to do about the date here, so let’s move on. I can do this though:
Task 3: examine the HTML structure of this blog URL with a view to modifying the code used to scrape the data.
Here’s something else interesting:
|183||Posted on March 1, 2010September 9, 2010||Software and websites I couldn’t do without|
Two dates. I suspect that the first date is the one on which the blog entry was posted, and the second is the date it was amended /updated. Again, how am I going to deal with this? I think I’m going to have to go back to the HTML again and see if I can make another modification to my code. I’m only on row 925… let’s move on.
Here’s my next oddity:
|0||2016-09-12 by k*****||National Drama CPD Training for secondary teachers|
I can split the string here:
if row in ‘Date column’ contains the string ‘ by’;
split string before ‘ by’;
write to row in ‘By’ column.
The space is in a different place now. This matters, because while you and I see a space in Excel, there is in fact a character there, and it counts. It quite literally ‘counts’ too, because it has a place. It’s number 10 in the string (remember, counting begins at zero). So, if I were to split the string at the space before ‘by’, it might actually split at a different place in a different cell (remember my code will iterate through every row of the column, so I need to be sure that it will only impact the cells I want it to).
Task 4: split string at ‘ by’.
The date that’s left in the cell will be in a different format from previous dates i.e. it’s 2016-9-12 rather than September 12, 2016. Will this make a difference? I don’t know yet.
|0||2017-02-05 00:00:00||314. Maths is a foreign language|
This date has the time as well. Again, I don’t know what difference this will make.
|1||21st December||Phase diagrams|
Now here’s a problem – no year. A crucial piece of information is missing, and it’s missing for 696 rows (from row 4603). Previously, I used Pandas to do a quick audit (locating rows containing 2017, 2016, 2015 etc. and had established that 786 rows were unaccounted for. It looks as if I’ve found some of them.
p.s. rows 9522 to 9552 are similar, so there’s another 30. Only 40 unaccounted for.
|30||Posted by b********1||Hello world!|
This cell indicates there’s no spaces between ‘Feb’, ‘17’ and ‘2017’ although when I pasted the row into this word document, each element was on a different line.
|0||Feb172017||Learning & Teaching GCSE Mathematics|
This will probably be ok because when I come to analyse my data, the important pieces are the month and the year, both of which are clear.
And what about this?
|36||8. März 201330. März 2016||Build your own low-cost slate! | Baue dein eigenes low-cost Slate!|
I know from looking at this blog before that not all of it is in a foreign language (I’m assuming it’s a foreign language teacher), so do I leave this entire blog out of my master list?
I could split these strings, although the figure given for the number of comments varies.
|0||04 Apr 2016 Leave a comment||The World is Upside Down|
|6||22 Apr 2015 3 Comments||Revision – what works best?|
if row in ‘Date column’ contains the string ‘ leave a comment’;
split string before ‘ leave’;
write to row in ‘Leave’ column.
if row in ‘Date column’ contains the string ‘ (number) comments’;
split string before ‘ (number)’;
write to row in ‘Leave’ column.
It’s possible to write code that will take any numerical value for ‘(number)’.
Then there’s this – no title at all.
|333||March 3, 2012|
I really need something here, but what? I could amend my code so that, if it fails to find a blog post title, the phrase ‘No Title’ is written into the row instead. Alternatives include:
- use the first sentence from the blog post itself (which can be extracted from the ‘Contents’ cell);
- use the three most common terms from the post (obtained from the TF-IDF analysis I’m doing on the whole data set);
- deploy some other text analysis technique to summarise the post in one sentence, which, when you think about it, is exactly what we try and do when we come up with a title for our own blogs.
This affects quite a few rows, so it needs addressing.
March 14, 2015
Leave a comment
|Peer Observation – Priceless CPD, for free!|
I’ve copied and pasted this ‘as is’, although in the spreadsheet the data in the date cell appears on one line. This highlights one of the issues when viewing data – it will appear differently when looked at through different windows, and yet each window has its advantages. Excel is good for scrolling through data, and for basic numerical functions. For everything else, I use Pandas for Python, usually via the Jupyter notebook that’s part of the Anaconda suite.
|Hi Guys. This page will contain all the BSGP (bronze, silver, gold, platinum) skill sheets for your perusal.
All resources are free to a good home and are intended to be used for what they are… banks of questions rising in difficulty to help complement your teaching, not replace it!
As I create new resources I’ll add them here so check back often. At some point i’ll probably give the project a formal name and organise it a little better than I am at the minute.
All answer sheets can be found in a password protected blog post (called ‘answer sheets’ of all things!).
Hit me up on twitter ( @mrlyonsmaths ) for the password
Here’s a row where the contents are appearing where the title should be. I’m willing to bet that this is because of the HTML structure of the page, so I need to revisit my master code. It’s not the only set of blog posts from a URL either.
Task 5: revisit master code for extracting ‘Title’ from this blog URL.
And all these problems are, of course, the ones I’ve uncovered in my sample. The ones I know about. My final data set will be huge, and I’ll have little chance of spotting anomalies unless I accidentally stumble upon them.
Welcome to my world of big data.