Developing Categories, Part 4

I thought I’d have a quick look at the difference using a lemmatiser instead of a snowball stemmer makes to clustering using k-means and just my group of labelled blogs.  Here’s the silhouette plot based on groups:

SilPlotLemm

Remember, the closer the score is to 0, the more statistically likely it is that the blog could be in a different category.

Here’s the same data, this time with the number of categories set to 6, but grouped according to the category the algorithm has calculated as being the most appropriate.

SilPlotLemmV2

There appear to be blogs that, at least according the k-means, are in a category with a variety of different labels.  The algorithm isn’t learning anything, though, it’s just making decisions based on the scores of tokens in the blog, nothing else.  I simply wanted to see if lemmatising the blogs instead of stemming made much of a difference.

Here’s the same parameters as above, but using the snowball stemmer as before:

SilPlotSnow

And side-by-side (Snowballing / Lemmatising):

SilPlotSnow                                       SilPlotLemmV2

 

The answer is: overall, not that I can see.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s