Houston, We May Have A Problem….

I’ve been writing up my PhD.  This has been a very slow process, mainly because I’ve had to spend quite a bit of time going back through all my references, and re-planning the whole thing.  I bought post-it notes, and a tabletop flip chart (which is also like one massive post-it), and I’ve re-appraised everything.  As I write, I’m constantly adding more post-its as prompts of things I need to look up / do / add to the ‘discussion’ section at the end.

One of the things I decided I’d do was go back through my original data to make sure that I’d gathered everything I needed to, and to see if I could improve the cleaning-up process.  In computer science circles, this is often referred to as ‘text wrangling’.  Your typical blog post contains URLs, other advertising rubbish that’s added by the platform, junky unicode, characters representing carriage returns, new lines…. I could go on.  This all has to be removed.  A text data file, when its being prepared for analysis, can get very big very quickly – and there’s a limit to the data file that even my pretty-well-spec’d laptop can handle.  Having discovered this excellent site, I can not copy and paste a section of a blog post with some rubbish in it, and generate the code snippet that will remove it.  Regex can be tricky – the broader the parameters i.e. the greater freedom you give it to remove the stuff you don’t want, the more chance there is that it’ll remove other stuff you really would have preferred to keep.  It’s difficult to check, though, so in the end you probably have to just take the risk.

The other thing I wanted  to do was expand the contractions in the posts so that ‘isn’t’ becomes ‘is not’ etc.  I think it’s important to leave behind a data set that may be useful to future researchers, some of whom might be interested in sentiment analysis.  Expanding contractions helps to keep the meaning of the writing intact.

Then, I decided I’d go back and look again at how I’d processed my data.  As you may recall, my aim is to classify as many edu-blogs as possible according to a pre-defined list of categories drawn from the research that’s already been done on what teachers blog about.   I chose this approach because the potential number of topics is completely unknown, and  potentially huge.  It’s possible to run an algorithm that will cluster blogs without any prior information, but the trouble is that a) you still have to give it some idea how many clusters you might be expecting, and b) the results will vary slightly each time it’s run.  Its not a model; there’s no consistency.

One of the alternatives is to label a small set of blog posts with numbers representing categories, and then use an algorithm that will take this information and classify the unlabelled posts.  This is how it works: imagine having a double handful of brown smarties and a clear perspex box, say 1m x1m.  You throw the smarties into the box, but by magic they remain stationery, but scattered, in space. Now you have a small number coloured  smarties, several of the remaining colours, and you chuck them in as well.  They also hang in space.  The label spreading algorithm assumes that the coloured smarties are the labels, and it sets about relabelling all the brown smarties according to how close they are to each different colour.  You can allow it to change the colours of the non-brown smarties if you want, and you can give it some freedom as to how far it can spread, say, the red colour.  The algorithm spreads and re-spreads each colour (some of the different coloured smarties will be quite close to each other…. where should the boundary be drawn?) until it reaches convergence.

The picture here (and above) is a great example.  Not only does it look like a load of smarties (which I’m now craving btw) but it also perfectly illustrates one of the fundamental problems with this approach – if your data, when plotted into a 3D space, is an odd shape, spreading labels across it can be a bit of a problem.  The algorithm draws a network (there are lines connecting the smarties if you look closely) and uses the links between the smarties – officially called ‘nodes’, links are ‘edges’ – to determine how many ‘hops’ (edges) it takes to get from your labelled node to your closest unlabelled one.

Each of these nodes could represent a blog post.  It has co-ordinates in this space.  The co-ordinates are generated from the words contained in the post.  The words have to be represented as numbers because none of the algorithms can deal with anything else – this is maths territory we’re in, after all.

I’ve one this label spreading thing before with a sample set of data.  It seemed to work ok.  A quick audit of the results was promising.  I had another run through the code with a different set of data, including the training set I’d developed earlier, and realised that things weren’t quite the same.  The algorithm has had a bit of an upgrade since I last deployed it.  There were some issues, and the developers from scikit-Learn made some improvements.  That got me re-thinking what I’d done, and I realised two things: I’d made a fundamental error, and the new results I was getting needed a bit of an audit.

The book on the right has been invaluable!

The fundamental error really shows up how hard it is to do data / computer science when you aren’t a data / computer scientist.  I was feeding the algorithm the wrong set of data.  I should have been feeding it an array of data based on distance, but I wasn’t.  I was still getting results though, so I didn’t notice.  The thing is, nowhere is there anything that says ‘if you want to do this, you must first do this because this’.  It’s just assumed by every writer of computer science books and blogs and tutorials that you know.  I went back and re-read a few things, and could see that the crucial bit of information was inferred.  I can spot it now I’ve gained a lot more knowledge.  So, fault corrected, move on, nothing to see here.

The audit of results isn’t very encouraging, though.  There were many mis-categorisations, and some that were just a bit…. well… odd but understandable.  One of my categories is ‘soapboxing’ – you know, having a bot of a rant about something.  Another is ‘other’ to try and catch the posts that don’t fit anywhere else.  Turns out of you have a rant in a blog post about something that isn’t about education, it still gets classed as ‘soapboxing’, which makes perfect sense when you think about it.  An algorithm can’t distinguish between a post about education and a post that isn’t, because I’m thinking about concepts / ideas / more abstract topics for blog posts, and it’s just doing maths.  Post x is closer to topic a than topic b, and so that’s where it belongs.

There are other approaches to this.  I could use topic modelling to discover topics, but that has problems too.  ‘People’ might be a valid topic, but is that useful when trying to understand what teachers have been blogging about?

My label spreading approach has been based on individual words in a blog post, but I could expand this to include commonly-occurring pairs or trios of words.  Would this make a significant difference?  It  might.  It would also put some strain on my laptop, and while this shouldn’t necessarily be a reason not to do something, it’s a legitimate consideration.  And I have tried tweaking the parameters of the algorithm.  It makes little difference.  Overall, the results aren’t different from one another, which is actually a good thing.  I can make a decision about what settings I think are best, and leave it at that.  The problem, the real problem, is that I’m working with text data – with language – and that’s a problem still not solved by AI.

What I cannot do is make the data fit my outcome.  Worst case scenario, I have a lot to add to the ‘discussion’ part of my PhD.  If I can come up with a better analytical framework, I will.  The hard work – harvesting and wrangling the data – has already been done.  If I have to find some more papers to add to the literature review, that’s no hardship.  In the meantime, I’ve slowed down again, but I’m learning so much more.

Advertisements

Kicking Off The Actual Writing

For the last two days, I’ve been up NORTH at a writing retreat, organised by the DEN, and held here.  I’ll add a link to my photos, but I’ll put this one here because it sums up the place perfectly!

20180612_105025

I got loads of work done, as you can see.  I love technology, but sometimes you have to get the stationery out and do it the old-fashioned way.  Besides, who doesn’t love stationery, amiright?

20180612_162151

The narrative…

20180612_162216

The overall structure, minus the Introduction (which is next)

20180612_162237

The Introduction

20180612_162255

Starting the Literature Review, and some extra thoughts….

20180612_162128

….and the Literature Review specifically focusing on the blogosphere!

University of Shanghai Datathon (March 2018)

It’s taken me longer than usual to write this blog post.  This is partly because I’ve been very poorly since returning from China (the Doctor thinks I may have picked up a viral infection on the plane on the way home), and partly because the trip was very different from previous ones.  The purpose of this trip was to participate in a Datathon, with one day scheduled to have a look around Shanghai.  As it turned out, spending four days exploring data with other students was an absolute joy, and I was able to use many of my skills and the data processing tools I’d gathered which made the whole experience a really positive one.  The group of students I was working with – Hello if you’re reading this! – were absolutely lovely, and they looked after me really well.

My lovely group: Richard, Christie, and Eric. Our presentation is in the background!

The hotel where we were staying was within the University campus (I think I’m right in saying it’s actually owned by the University of Shanghai), which itself is a 50-minute journey by subway to the city centre.  I wish I’d taken more pictures of the campus, which was large, open and airy, with lots of green space and gardens (all my photos are here).  The computer science building where we were based was a few minutes’ walk away.

The data we were given to work with included some from the NYC Taxi and Limousine commission.  This is a huge set of data that people have already done some amazing – and silly – things with like this which shows that you can make it from Upper West Side to Wall Street in 30 minutes like Bruce Willis and Samuel L Jackson.  The theme of the exercise was ‘Smart Transport, Our Environment, and More’, which is a very hot topic at the moment, especially driver-less vehicles.   The University of Shanghai is conducting a lot of research on autonomous vehicles, including transport by sea.  We were given one year of data to work with, but even when this year was broken down into months, the size of the files made it impossible to work with on a laptop.  While my group worked on the main project, I drew a 1% sample of January 2013 to work with, the largest sample I could extract and still be able to process the data.  I’ve included a few images here, which were generated using Orange (part of the Anaconda suite) which I’ve  blogged about previously.

Passenger count – mainly – Surprise! – single person journeys.

All three groups in the Datathon converged around the idea of predicting where taxis would be in highest demand, and at what times.  There’s a link to our presentation, data and code here, and the work of the other groups can be found here.  I particularly liked the work on ‘volcanoes and black holes’, which is basically the same problem, but visualised differently.

 

James with a couple of his group. That’s part of the ‘volcanoes and black holes’ presentation from Jon’s group behind them.

The other two PhD students – Jon and James – were both really good coders, which was just as well as the students they were working with were less experienced in this area.  In my group is was the opposite – they were able to crack right on with writing the code, while I did some of the ‘big picture’ stuff and helped with the presentation.

The nice thing about working with geo-tagged data is that is can be used to generate some lovely graphics.  These can tell you so much, and prompt other questions, like for example why don’t more people share a cab, and what would it take to persuade them to do so?  Even so, and although I haven’t been to New York, I do know that you have to know more about the location than a map and data will tell you.  You also have to know about people, and the way they behave.  Nevertheless, this is a fascinating open data set, which is being added to every year.  Similar data would be, I believe, easily available in Shanghai and other cities in China, and no doubt will be used in similar research.

Here you can see all the journeys from my sample plotted on a map.

2-6 passenger journeys. The larger the circle, the further the trip.

We all presented our work on Monday, 26th March in front of Professor Dame Wendy Hall, Professor Yi-Ke Guo, and Dr. Tuo Leng.  I know they were impressed with what had been achieved, and I think all the students (us included) gained a lot from the experience.  This is my second trip to China, and I have to say it made a huge difference being able to do something with the data.  In my (limited) experience, unless you’re a naturally gregarious person, it can be difficult to get fully engaged when some of the people you’re working with don’t speak English very well, and/or are reluctant to speak.  Fortunately for me, my group were both good English speakers, and happy to chat while working.  For Jon and James, I think the students with them were less chatty, but the fact that the guys could write code helped to break down those barriers.  the fact that I could code, and had some useful data analysis tools I could draw on, made all the difference.  I felt more confident, knowing that I could make some useful contributions.  Of course, Shanghai is a more cosmopolitan city than Shenzen, which probably makes a difference.

To sum up, then, this was a proper working trip which turned out to be both interesting and informative.  I met some lovely, lovely people and had a brilliant time.  I even managed to find plenty of vegetarian food to eat, and proper coffee.  I’m glad I’m not a vegan, though.

So, What DO teachers talk about?

So, having put the final piece of the coding jigsaw in place, here are the first set of results.  The diagram below represents a set of 7,786 blog posts gathered from blog URLs.  The earliest is 2009, the latest 2016.  They’re currently a  lumped in together, although in the end the data set will be a) much, much larger, and b) broken down by year (and normalised so that a proper comparison can be made).

There are lots of things going on here – how I’ve defined the categories; how I initially categorised some posts to form a training set; how the algorithms work and were applied to the data; in spite of what some people will tell you, data science has all the appearances of giving nice, clear cut answers when in fact the opposite – especially when dealing with text – is often true.

The journey to get here has been long and challenging.  Still, I’m happy.

blogs

A Research Trip to Singapore , Part 2

20180122_224940

So, Singapore then.  First of all, it’s a really small country.  Have a look on Google maps.  It’s basically a city-state rather than a country.  It’s also clean, calm, and very green.  The pavements are pretty much spotless, everywhere is neat and tidy – even the H&M store in one of the big shopping malls was neat with all the clothes hanging on racks.  There’s a general air of sophistication – I didn’t see anyone dressed in baggy tracksuits, or trashy leggings.  Shoes, even trainers, were clean and new-looking.   Of course there were exceptions, but almost no-one looked down-at-heel.

Singapore is rich.  With Malaysia next door as a source of cheap labour, many people can afford cleaners and nannies.  Clearly, shopping is the number one pass-time.   The malls are huge and stretch the length of the main road down through the city centre.

20180128_163819

There’s a reason it’s green…

Singapore has worked hard to become what it is.  It went from poverty-stricken to one of the richest countries in a single generation, after gaining independence in 1965.  It’s rated very highly for education and healthcare, and is a major player in finance, foreign exchange, oil refining and trading, and is one  of the world’s busiest container ports.

It’s also claimed  back a lot of land.  The fabulous gardens by the bay are entirely built on reclaimed land – land that was drained and then left for 10 years to dry out.

Singapore isn’t without its drawbacks, though.  The death penalty is still a punishment for some crimes.  Homosexuality is illegal.  You can’t chew gum in public, or smoke unless you’re within 3 metres of a designated smoking area.  Jaywalking is illegal.  And it’s the kind of state where the police will arrest you first, and ask questions later.  On the flip side, the streets are clean and crime is extremely rare.  There are a couple of casinos, but Singaporians have to sign in to use them, and if any member of their family is concerned about their gambling, they are denied entry.  People generally drive considerately, and I didn’t see a single dented or scratched vehicle.

20180127_143806

The beautiful gardens by the bay.

Alcohol, especially wine, is expensive.  I’m told people pop across to Malaysia and buy it in bulk.  Food is also pricey, although you can buy anything from Spanish tapas to Singapore noodles.  We didn’t manage to visit the places where street-food is sold (known as Hawkers Markets) but we did find a Food Republic which is basically a canteen-style arrangement of several independent food outlets where you can buy a variety of Asian dishes.  Kimchi with fried rice, an egg on top, a bowl of clear soup and two little dishes of something unidentifiable was the equivalent of £2.  The basements of the shopping malls offer a similar arrangement, although a little more expensive at around £5.  Everything is spotlessly clean.

FYI, if you’re a vegeterian like me, it’s harder than you might think to find things you can eat.  If tofu is on the menu, it’s worth asking if it can be substituted for the usual chicken or pork.  The trouble is, I’m not sure Singapore would recognise a green vegetable if it was jumping up and down holding a sign saying ‘I’m a vegetable. Eat me!’.  Salad is practically non-existent.  Indian food offers dhals , Korean food kimchi, but from what I could see practically every other dish includes meat or seafood, and is fried.  No wonder Singapore has a problem with diabetes.

It’s incredibly hot in January, with tropical downpours accompanied by full-on thunder and lightning more often than not.  I love a good storm.  I recommend taking an umbrella everywhere with you, and wearing sensible, waterproof footwear.  It’s far too hot for coats.   Public transport is cheap and plentiful, and as everything is in English it’s easy to find your way around.

The hotel we stayed in provided a free smart phone for guests to use, which included free and unlimited access to mobile data as well as free phone calls (including international calls).  This was so useful, although I learned on the very last day that you can register your own mobile phone for free public wifi across the city.  How cool is that?

My biggest disappointment was finding out that the Raffles Hotel was closed for renovation.  I was really looking forward to a Singapore Sling there.  I did have one somewhere else, but it just wasn’t the same.  Still, I managed to bring back a bottle of Bombay Star gin from the duty free shop (£24 for a litre!) so that went some way to making up for it.

I’m not going to post all my photographs here, but I’ve included a link so you can see the entire album here.  So, to sum up, lovely country, lovely polite people, bit expensive but there are ways of mitigating this.  Be mindful of the law, and you’ll have a great time.  Oh, and Levis are incredibly cheap.

 

 

 

A Research Trip to Singapore, Part 1

Some of you know that I spent last week in Singapore on a research trip, sponsored by the University of Southampton.  In this first post, I’m just going to focus on the work side, and save the experience of Singapore (together with lots of lovely photos) for the next one.

The Web Science CDT (Centre for Doctoral Training) usually runs a couple of research trips to other universities over the course of a year.  In 2015, I went to Tsingua University in China as part of a group of PhD students.  Our picture is the main one I use on my web site, and it accurately conveys the general mood of the whole experience.  This time, I was only with two other students – Clare, who is doing her PhD with one foot in Education like me, and Jon who is a bit a maths genius and whose PhD is firmly rooted in AI.20180124_120057

The theme behind the invitation-only conference was ‘Wellness’.  Some students from the National University of Singapore (NUS) led by Dr Zhao-Yan Ming (Zoe) have developed a mobile phone app – DietLens –  which invites you to photograph your plate of food, and it will then tell you the nutritional content.  This doesn’t sound all that important, until you know that Singapore, in common with other Asian countries, has a serious problem with type 1 diabetes.  This isn’t visibly weight-related, but a genetic pre-disposition coupled with a diet high in fried food and sugar.  The government is facing a significant rise in the cost of treating people with diabetes, and something needs to be done to encourage people to change their eating habits.

We spent some time with Zoe and two of her students, going through the app and what it can do.  Not only can it ‘read’ the nutritional content of a plate of food with around 80% accuracy, but it can also estimate the portion size.  So far, there is a database of several hundred local foods, most of which I seem to recall come from restaurant fare, especially the food served at the Hawkers Centres.  The database is being expanded with home-cooked food as well.

The app is a good example of what’s termed ‘deep learning’ in AI.  Every time food is photographed, it prompts the user to identify it with a series of options to select.  These include an option for the user to enter the recipe if the food has been made at home.  Every time food is photographed, the algorithm behind the app learns more about food identification, improving its accuracy.

Of course, the best outcome would be for users to choose healthier food once they know how potentially unhealthy their existing choices are.  However, we know from extensive research in Behavioural Science that persuading people to change their habits is extremely difficult.  Just being shown evidence that their food is high in fat and sugar, and low in complex carbohydrates, isn’t enough.  Most people simply carry on doing what they’ve already done, even when their issue is health-related.  Think about how many times someone we know acknowledges that they really MUST give up smoking, but carries on regardless of all the warnings.  We also know from research that people are more likely to change if they have the support of a network of family and/or friends, or even a group of people they don’t know i.e. Weight Watchers, with their weigh-ins and meetings.

So, having spent a morning looking at the app, we went away to discuss how we might add value to the app, or consider some wider issues.  We knew we’d been allocated 30 minutes to present our ideas on Thursday (we saw the app in action on Tuesday) and had to come up with some ideas fast.

We were able to present three research questions covering four projects:

  • To what extent does the perception of information in the DietLens app affect behavioural change?
  • Can small online social networks improve communication between members of real-life food sharing networks in order to encourage behavioural change in dietary
    choices?
  • How can we use [the] data to encourage users to make better food choices, and continue to do so?

The first project suggested ways of improving the user interaction with the app with the intention of retaining the user, although it would be deemed a success if at some point the user no longer needed to use the app because they had improved their eating habits.  The user interface should be easy to use, take up minimal time, and have an intuitive interface.  Displaying nutritional values using a simplified ‘traffic light’ system was presented.

The second project proposed using a small social network to encourage users to change their eating behaviour.   Food consumption could be shared, and an ‘encourager’ could be identified.   Reward schemes could further encourage the user.  These enhancements also produce data, which can be used to evaluate the success (or otherwise) of the app.

The third and fourth projects focus on the use of this data.  ‘Nudge theory‘ underpins the suggestions for encouraging long-term change in dietary habits. Nudge theory isn’t new, but it’s gained popularity in the wake of a book published in 2008.

book

Even the UK Government has a Behavioural Insights Team, otherwise known as the ‘nudge unit’.   It’s been responsible for things like writing to people who have not paid their council tax, informing them that most of their neighbours have already done so, thereby exerting subtle pressure to conform with the perceived behaviour of the group.

The app could make use of this by generating messages of encouragement from within the app, and allowing others who have access to generate messages or ‘thumbs up’ signs.  If a group of users chose to use the app together, the app could tell the members of the group when one of them made a healthy choice or cooked a healthy meal.

Building on this, by inviting others to ‘share’ your food and see what you’re eating, a small social network is created.  Everyone could be part of the group trying to make better food choices, or one person could invite others to join them for encouragement and support.  A study carried out among a sample Mexican and Hispanic people in the USA (all trying to lose weight) asked the simple question: who gave you the most encouragement?  and revealed that it was their children.  The DietLens app could ask the same question, perhaps at the end of each week, to establish whether the same holds true for Singaporeans.

20180128_160403

Local nudges on the Singaporean underground.

Of course there are wider issues to consider, such as data privacy and ethics.  Furthermore, just because the app has been built and meets all the requirements doesn’t mean that people will use it, or even download it.   There must also be an accompanying advertising campaign, promotion in schools, and other marketing techniques that have been used successfully to promote campaigns like anti-smoking here in the UK, alongside a continuous analysis of the data.

I’m pleased to say that it looks as if the code for the app is going to be sent to us in Southampton so that we can train it on British food, especially the food we cook at home.  I’m especially looking forward to trying it out with my vegeterian recipes.  Given that I barely saw a vegetable in Singapore (other than kimchi, which was heavily spiced and fried with rice) I’m not sure how it will cope with anything green.  I should imagine broccoli will cause it have a bit of a moment.

I hope to be able to add some photos of our presentation to this post, as my primary supervisor was in the audience, together with Dame Wendy Hall, to whom we owe our thanks for setting up the trip and inviting us along.  Watch this space!

Update: here’s a link to a video of our presentation.  It starts at 33.51.

 

Label Spreading

This week, I finally managed to get the last lines of code I needed written.  I wanted to apply the label spreading algorithm provided by scikit learn but the documentation provided is next to useless, even bearing in mind how much I’ve learned so far.  There are other ways of grouping data, but my approach from the start has always been to go with the most straightforward, tried and tested methods.  After all, my contribution isn’t about optimising document classification, but the results of document classification, which will reveal what pretty much everyone from one community who writes a blog has been writing about.

The label spreading algorithm works by representing a document as a point in space, and then finding all the other points that are closest to it than, say, another document somewhere else.  I gave the algorithm a set of documents that I’d already decided should be close to each other in the form of a training set of blog posts allocated to one of six categories.  The algorithm can then work out how the rest of the unlabelled blog posts should be labelled based on how close (or distant) they are from the training group.

It’s also possible to give the algorithm a degree of freedom (referred to as clamping) so that it can relax the boundaries and reassign some unlabelled data to an adjacent category that is more appropriate.  I don’t know yet exactly how this works, but it will have something to do with the probability that document  would be a better fit with category a than category b.

I ran the algorithm twice with different clamping parameters, and you can see the results below.

alpha = 0.2, gamma = 20 alpha = 0.1, gamma = 20
Category No. of Posts Category No. of Posts Category No. of Posts
6 21 6 475 6 506
5 98 5 1915 5 1920
4 34 4 1013 4 1044
3 27 3 505 3 516
2 34 2 746 2 712
1 78 1 3132 1 3088
-1 7494 -1 0 -1 0

The first couple of columns are the set of posts with just my labelled training set. -1 represents the unlabelled data.  Thereafter you can see two sets of results, one with a clamping setting of 0.2 (alpha), the other slightly less flexible at 0.1.

alpha : float

Clamping factor. A value in [0, 1] that specifies the relative amount that an instance should adopt the information from its neighbors as opposed to its initial label. alpha=0 means keeping the initial label information; alpha=1 means replacing all initial information (scikit learn).

I’m still trying to find out exactly what the gamma parameter does.  I just went with the value given by all the scikit documentation I could find.

I then went through 50 randomly selected posts that had originally been unlabelled to see what category they had been allocated.   I changed 26 of them, although 10 of these were labelled with a new category which I’m just calling ‘other’ at the moment.  So, in summary, I changed 32% of the sample and added 10% of the sample to a new category.

I always knew from previous explorations of the text data that there would be posts that went into the ‘wrong’ category, but the degree of ‘wrong’ is only according to my personal assessment.  I could be ‘wrong’, and I have absolutely no doubt that others would disagree with how I’ve defined my categories and identified blog posts that ‘fit’, but that’s the joy / frustration of data science.  Context and interpretation are everything.