One of the coolest/sweetest/most interesting parts of all this clustering is how you find secret trends in the data. For example: I took a look at this dataset concerning passengers of the HMS Titanic, their class, sex, age, and their survival status (did they die or not die). Found that here: hakank.org/weka/titanic.arff. Anyways, I was fooling around with and came up with some fantastic results. With the Simple K Means Clustering tool in weka, I clustered the <expletive deleted> out of it. I started off with just two cluster groups with a training set. That came with some USELESS information. There were two groups: adult male crewmen that died, and 3rd class adult females that lived. Not really useful. I started to arbitrary increase the number of cluster, and that’s when things started to get interesting.
I got up to 16 clusters.
Here’s the results:
=== Run information ===
Scheme: weka.clusterers.SimpleKMeans -N 16 -A “weka.core.EuclideanDistance -R first-last” -I 500 -S 10
Relation: relation
Instances: 2201
Attributes: 4
class
age
sex
survived
Test mode: split 75% train, remainder test
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 2
Within cluster sum of squared errors: 67.0
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(2201) (168) (96) (670) (192) (387) (48) (93) (13) (141) (118) (90) (75) (17) (16) (57) (20)
======================================================================================================================================================================================================
class crew 2nd 3rd crew crew 3rd 3rd 2nd 2nd 1st 1st 3rd 3rd 3rd 2nd 1st crew
age adult adult adult adult adult adult child adult adult adult adult adult adult child child adult adult
sex male male female male male male male female female female male female male female male male female
survived no no no no yes no no yes no yes no yes yes no yes yes yes
=== Model and evaluation on test split ===
kMeans
======
Number of iterations: 3
Within cluster sum of squared errors: 66.0
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(1650) (127) (86) (507) (163) (68) (282) (44) (67) (107) (72) (5) (10) (16) (66) (27) (3)
======================================================================================================================================================================================================
class crew 2nd 1st crew crew 3rd 3rd 1st 3rd 1st 2nd 2nd 2nd 3rd 3rd 3rd 1st
age adult adult adult adult adult adult adult adult adult adult adult child adult child adult child adult
sex male male male male male female male male male female female male male female female male female
survived no no no no yes yes no yes yes yes yes yes yes no no no no
Clustered Instances
0 40 ( 7%)
1 32 ( 6%)
2 166 ( 30%)
3 49 ( 9%)
4 22 ( 4%)
5 105 ( 19%)
6 18 ( 3%)
7 21 ( 4%)
8 34 ( 6%)
9 21 ( 4%)
10 6 ( 1%)
11 4 ( 1%)
12 1 ( 0%)
13 23 ( 4%)
14 8 ( 1%)
15 1 ( 0%)
As you can see from this the two most popular clusters by far are number 2 and number 5. Number two is adult male crew members that died. Number 5 is 3rd class adult males that died. The two largest groups. Dead. It makes me think of WOMEN AND CHILDREN FIRST! Then you look at cluster 13. 3rd class adult females. A bit sickening if you think about it too much. Anyways.
The most fun part was actually visualizing this data in weka. I was able to set up neat chart that grouped all the clusters in different ways. One of the most interesting ways was the survival rate among the classes taking sex into account. When you turn up the jitter, you see the data and understand it what it really says. Your brain recognizes patterns, like the fact that only a few female crew members perished. Or that even though the 1st class was the smallest, it had more survivors, male and female, than any other.
Things like that make all this kind of stuff really useful/interesting. Many-eyes.com seems like a pretty good data visualizer, though it is a little bit limited. Also, their dataset uploading link seems to be having issues. I wonder if they will get around to fixing that. One of the niftiest visualizers is the word tree for lingual data sets. An example here: CLICK ME!!!!!!!!!11!!11!!