In my previous post, I finished with a very brief criticism of using AI to replace the traditional search using Google or similar. The unreliability of AI is a popular topic at the moment – I read ‘Weapons of Math Destruction‘ by Cathy O’Neil not long after it was published, and there are now many more books and journal articles covering the subject. The best way for me to explain my reservations is it terms of my own research. I am trying to uncover the topics discussed by teacher bloggers, and to see if the things they talk about have changed over time. To do this, I’ve harvested many thousands of posts, and am using algorithms to help me categorise them.
There are different approaches to this, but one of the most straightforward is to use a word count. Basically, the words that appear most frequently in a text probably indicate the topic being discussed. However, the first step is to reduce the words used in the entire corpus – ‘noise’ words like ‘and’, ‘but’, ‘so’. I can also go on to remove other words that don’t appear to be adding any value – in the case of teacher blogs, this might be ‘thing’, ‘year’ or ‘week’. But every time I remove a word, there is a ripple effect across the entire corpus that I cannot see and evaluate because the set of data is too large to trawl through manually. I am blind, and can only accept the results the algorithm presents me with. If we can accept that leaving AI to trawl through publications and deliver the ones we have said we want to us, knowing that there may be some it has missed and be prepared to re-train the AI or run a ‘traditional’ search from time to time, then fine. But we cannot trust the AI entirely. Even if it provides two lists – the one it is confident we want (i.e. the one with the best probability scores), and a second list of ‘maybes’ (with lower probability scores) and, by picking some items from the second list we help it to ‘learn’, we still need to be mindful that the paper with the lowest probability score may be just what we were looking for. AI doesn’t yet understand language, it can only turn words into scores.
Secondly, I referred to using social networks to reveal communities of interest. I was hoping to do this with a good example from the Art world, but here my domain knowledge is lacking and it would take me more time to identify some key names. It’s worth mentioning here that ‘key names’ doesn’t necessarily mean ‘top names’. It’s my experience that the people at the top of their field don’t have time to dabble with social media or write blogs, but they DO publish papers and write journal articles, both of which are discoverable. Just to run ahead for a moment, the one young artist I did pick up on, Alannah Cooper, uses Instagram. The problem with Instagram is that it’s owned by Facebook and is a ‘closed shop’ for researchers. Nevertheless, the URL can be recorded.
Anyway, to return to Social Media, in order to construct an example of how straightforward it can be to extract a network, I chose to use Twitter and searched using a phrase that has been doing the rounds recently ‘flattening the grass’. It refers to an Edu-Twitter ‘scandal’ about a school who give students a very stern talking to in assembly, possibly even calling out the names of individual students who need to stop mucking about and actually do some work. Or words to that effect. It caused a bit of a storm, and was picked up by TES. The graph generated from the results is shown here. I’ve also included a graphic below, which is a bit small I know, but if you want to download the original file and zoom in for a better look it’s also here.
The grouping algorithm behind this has made clusters based on their links with one another. The core cluster is Group 1, with smaller groups and unconnected accounts shown on the right. Profile information is also collected, and I could have removed everyone who didn’t identify as a teacher. However, not everyone discloses their profession, and as this is ‘my’ community I recognise one or two names that don’t immediately declare themselves to be Educators. The top influencers are the TES, and Paul Garvey (Groups 2 and 1) – the TES because they published the story (which broke originally on Twitter a week or so before) and Paul because he’s been actively engaged in any debate that centres around discipline and behaviour management.
Using my personal access to Twitter does restrict the number of results I can get (and producing a graph draws heavily on the processing power of my laptop, even though it’s quite a high specification) but the University of Southampton has its own access to the Twitter stream and bigger and better things could be produced. As well as the graph, I now have a collection of Twitter IDs, links to individuals web sites and blog URLs, and other data such as any hashtags used. All of this data can be used to expand the network and extract other members of the community. It is also entirely possible that, as well as being able to discover the public writings of artists and designers, photographic evidence of their work would also be revealed if they have provided it.
What we won’t get, of course, is people who don’t use social media in any form, but we may still pick up names that can be linked back to publications.
To finish then, let me summarise these last two posts thus: the Arts deserve their own ‘Web of…’ to support their work, improve funding, and highlight the work of outstanding individuals in the field; wherever possible this should include examples of work (or work in progress) recorded digitally; writing (and images) by the artist/designer posted on the Web can represent a powerful archive for the community and the work of future historians and curators and should be documented in some way; AI has some potential to improve the way we search for artefacts; tools from computer science can be usefully deployed in the search and gather process.