Having successfully divided my data set up into separate years yesterday, I thought I’d go back to basics and have a look at stopwords.
in language processing, it’s apparent that that are quite a few words that absolutely no value to a text. These are words like ‘a’, ‘all’, ‘with’ etc. NLTK (Natural Language Tool Kit – a module that can be used to process text in various ways. You can have a play with it here) has a list of 127 words that could be considered the most basic ones. Scikit-learn, which I’m using for some of the more complicated text processing algorithms) uses a list of 318 words taken from research carried out by the University of Glasgow . A research paper published by them makes it clear that a fixed list is of limited use, and in fact a bespoke list should be produced if the corpus is drawn from a specific domain, as I’m doing with my blogs written by teachers and other Edu-professionals.
Basically, the more frequently a word is used in a corpus, the less useful it is. For example, if you were presented with a data base of blogs written by teachers, and you wanted to find the blogs written about ‘progress 8’, that’s what your search term would be, possibly with some extra filtering-words like ‘secondary’ and ‘England’. You would know not to bother with ‘student’, ‘children’ or ‘education’ because they’re words you’d expect time find in pretty much everything. Those words are often referred to as ‘noise’.
The problem is that if the word ‘student’ was taken out of the corpus altogether, and treated as a stopword, that might have an adverse effect on the subsequent analysis of the data. In other words, just because the word is used frequently doesn’t make it ‘noise’. The bigger problem, then, is how to decide which of the most frequently used terms in a corpus can safely be removed. And of course there’s the issue of whether the words on the default list should be used as well.
The paper I referred to above addresses this very problem, with some success. I’m still trying to understand exactly how it works, but it seems to be based on the idea that a frequently-used word may in fact be an important search term. And the reason I’ve spent so much time on this is because the old adage ‘rubbish in, rubbish out’ is indeed true, and before I go any further with the data I have, I at least need to understand the factors that may impact the results.