I thought I’d have a quick look at the difference using a lemmatiser instead of a snowball stemmer makes to clustering using k-means and just my group of labelled blogs. Here’s the silhouette plot based on groups:
Remember, the closer the score is to 0, the more statistically likely it is that the blog could be in a different category.
Here’s the same data, this time with the number of categories set to 6, but grouped according to the category the algorithm has calculated as being the most appropriate.
There appear to be blogs that, at least according the k-means, are in a category with a variety of different labels. The algorithm isn’t learning anything, though, it’s just making decisions based on the scores of tokens in the blog, nothing else. I simply wanted to see if lemmatising the blogs instead of stemming made much of a difference.
Here’s the same parameters as above, but using the snowball stemmer as before:
And side-by-side (Snowballing / Lemmatising):
The answer is: overall, not that I can see.