It’s taken me longer than usual to write this blog post. This is partly because I’ve been very poorly since returning from China (the Doctor thinks I may have picked up a viral infection on the plane on the way home), and partly because the trip was very different from previous ones. The purpose of this trip was to participate in a Datathon, with one day scheduled to have a look around Shanghai. As it turned out, spending four days exploring data with other students was an absolute joy, and I was able to use many of my skills and the data processing tools I’d gathered which made the whole experience a really positive one. The group of students I was working with – Hello if you’re reading this! – were absolutely lovely, and they looked after me really well.
The hotel where we were staying was within the University campus (I think I’m right in saying it’s actually owned by the University of Shanghai), which itself is a 50-minute journey by subway to the city centre. I wish I’d taken more pictures of the campus, which was large, open and airy, with lots of green space and gardens (all my photos are here). The computer science building where we were based was a few minutes’ walk away.
The data we were given to work with included some from the NYC Taxi and Limousine commission. This is a huge set of data that people have already done some amazing – and silly – things with like this which shows that you can make it from Upper West Side to Wall Street in 30 minutes like Bruce Willis and Samuel L Jackson. The theme of the exercise was ‘Smart Transport, Our Environment, and More’, which is a very hot topic at the moment, especially driver-less vehicles. The University of Shanghai is conducting a lot of research on autonomous vehicles, including transport by sea. We were given one year of data to work with, but even when this year was broken down into months, the size of the files made it impossible to work with on a laptop. While my group worked on the main project, I drew a 1% sample of January 2013 to work with, the largest sample I could extract and still be able to process the data. I’ve included a few images here, which were generated using Orange (part of the Anaconda suite) which I’ve blogged about previously.
All three groups in the Datathon converged around the idea of predicting where taxis would be in highest demand, and at what times. There’s a link to our presentation, data and code here, and the work of the other groups can be found here. I particularly liked the work on ‘volcanoes and black holes’, which is basically the same problem, but visualised differently.
The other two PhD students – Jon and James – were both really good coders, which was just as well as the students they were working with were less experienced in this area. In my group is was the opposite – they were able to crack right on with writing the code, while I did some of the ‘big picture’ stuff and helped with the presentation.
The nice thing about working with geo-tagged data is that is can be used to generate some lovely graphics. These can tell you so much, and prompt other questions, like for example why don’t more people share a cab, and what would it take to persuade them to do so? Even so, and although I haven’t been to New York, I do know that you have to know more about the location than a map and data will tell you. You also have to know about people, and the way they behave. Nevertheless, this is a fascinating open data set, which is being added to every year. Similar data would be, I believe, easily available in Shanghai and other cities in China, and no doubt will be used in similar research.
We all presented our work on Monday, 26th March in front of Professor Dame Wendy Hall, Professor Yi-Ke Guo, and Dr. Tuo Leng. I know they were impressed with what had been achieved, and I think all the students (us included) gained a lot from the experience. This is my second trip to China, and I have to say it made a huge difference being able to do something with the data. In my (limited) experience, unless you’re a naturally gregarious person, it can be difficult to get fully engaged when some of the people you’re working with don’t speak English very well, and/or are reluctant to speak. Fortunately for me, my group were both good English speakers, and happy to chat while working. For Jon and James, I think the students with them were less chatty, but the fact that the guys could write code helped to break down those barriers. the fact that I could code, and had some useful data analysis tools I could draw on, made all the difference. I felt more confident, knowing that I could make some useful contributions. Of course, Shanghai is a more cosmopolitan city than Shenzen, which probably makes a difference.
To sum up, then, this was a proper working trip which turned out to be both interesting and informative. I met some lovely, lovely people and had a brilliant time. I even managed to find plenty of vegetarian food to eat, and proper coffee. I’m glad I’m not a vegan, though.