FOLIO 9!

24 April, 2009 - Leave a Response

The whole point of our project was comparing the effectiveness of hierarchical clustering with different datasets.

One of the datasets we clustered was based on the genre about 1000+ movies. This generated a couple really big clusters. Nothing really exciting. It made a little sense but it wasn’t that effective and showing the proximity of the titles.

The other dataset was the keywords for each movie. This was an intense dataset to acquire and process. But in the end it was able to produce an intelligible output that made a bit of sense.

The main reason though that the keyword clustering was better is the sheer size of the dataset. The more data and the more specific the data the better job clustering an algorithm.

Portfolio 8 – The Ocho

23 April, 2009 - Leave a Response

Wow, optimization. Some good stuff. So there’s a couple different ways to do some optimization: Random Guessing, Hill-Climbing, Annealing, and the Genetic Algorithm.

Each one has their uses.

Random Guessing is a pretty inefficient method of finding optimal solutions to a problem. You just keep guessing random stuff until you get a good solution. Useful for only getting a semi-ok solution. It’s very inefficient because if it finds a good one it doesn’t care and picks something else crazy.

Hill Climbing is a step above the Random guessing. It still random picks one, but it’ll go higher or lowing depending on whether which solution is better and keeps going that way until the solutions don’t get better.

Annealing attempts to solve hill-climbing problems of getting stuck in a local minimum and not being able to find the global minimum which is the best possible soltuions.
It accepts possible worse solution for a bit so it get over some humps to find a better one. But once again if it finds the best possible or just an ok one is arbitrary.

The genetic algorithm tries to solve the arbitrarity to an extend. It comes up with random solutions, ranks all of them and then branches off on them, trying to find all the good things with mutations and crossovers and what not.

Bottom line is though that none of these algorithms are perfect at what they do. But thats not the point of optimization. Perfect is great but pretty good is great too.

Portfolio 5 – Let’s VISUALIZE!

7 March, 2009 - Leave a Response

The hardest single thing about this assignment was finding the data to use. I racked my brain trying to think of and then find interesting data to use. The first idea I had was casualty list from the Iraqi Conflict. I actually found some a little something but it really couldn’t be visualized without some serious manipulation. There were no numbers, there wasn’t any organization, it was all jumbled up in a single text file. After spending an hour or two with Google just looking for anything, I decided to take a break.

I spoke to Ross Serino about this topic and we still had no luck finding anything. And then, in a moment of epiphany (or whatever you’d like to call it), a single word popped into my head: BASEBALL. What better thing to visualize than a sport that is ALL about numbers and stats. And the interesting thing about baseball now is the steroid controversy. I recently read an article about how since the late 90′s steroids have been sneaking into the sport. I approached Ross with this idea and like magic we were able to find a source.

This was no ordinary source (refer to Ross Serino’s blogpost for links ), this was not only a complete statistical history of baseball but in sql format! Oh what luck! And then with a little help from ole PHP and Mr. JavaScript we were able to make a little chart depicting a couple things:

(1) At bats and hits
(2) Hits and home runs
(3) At bats and home runs

Each year (from 1995) we can see each player as a dot on a graph, along with all the other players. Here’s where it gets interesting: When you look at either hits/home runs or at bats/home runs you can identify certain players that really don’t fit in with the rest of the data, especially people with an abnormal home run percentages. Although this really doesn’t prove that a player has used steroids, it makes you think. So when you look at the chart you can see ones that MIGHT have used steroids. It makes you think though when the abnormalities are like Barry Bonds, Mark McGuire, Sammy Sosa, and Jason Giambi.

Porfolio 4 – Clusters and Visualizations

21 February, 2009 - Leave a Response

One of the coolest/sweetest/most interesting parts of all this clustering is how you find secret trends in the data.  For example: I took a look at this dataset concerning passengers of the HMS Titanic, their class, sex, age, and their survival status (did they die or not die). Found that here:  hakank.org/weka/titanic.arff. Anyways, I was fooling around with and came up with some fantastic results. With the Simple K Means Clustering tool in weka, I clustered the <expletive deleted> out of it.  I started off with just two cluster groups with a training set. That came with some USELESS information. There were two groups: adult male crewmen that died, and 3rd class adult females that lived. Not really useful. I started to arbitrary increase the number of cluster, and that’s when things started to get interesting.

I got up to 16 clusters.

Here’s the results:

 

=== Run information ===

 

Scheme:       weka.clusterers.SimpleKMeans -N 16 -A “weka.core.EuclideanDistance -R first-last” -I 500 -S 10

Relation:     relation

Instances:    2201

Attributes:   4

              class

              age

              sex

              survived

Test mode:    split 75% train, remainder test

 

 

=== Clustering model (full training set) ===

 

 

kMeans

======

 

Number of iterations: 2

Within cluster sum of squared errors: 67.0

Missing values globally replaced with mean/mode

 

Cluster centroids:

                         Cluster#

Attribute    Full Data          0          1          2          3          4          5          6          7          8          9         10         11         12         13         14         15

                (2201)      (168)       (96)      (670)      (192)      (387)       (48)       (93)       (13)      (141)      (118)       (90)       (75)       (17)       (16)       (57)       (20)

======================================================================================================================================================================================================

class             crew        2nd        3rd       crew       crew        3rd        3rd        2nd        2nd        1st        1st        3rd        3rd        3rd        2nd        1st       crew

age              adult      adult      adult      adult      adult      adult      child      adult      adult      adult      adult      adult      adult      child      child      adult      adult

sex               male       male     female       male       male       male       male     female     female     female       male     female       male     female       male       male     female

survived            no         no         no         no        yes         no         no        yes         no        yes         no        yes        yes         no        yes        yes        yes

 

 

 

=== Model and evaluation on test split ===

 

kMeans

======

 

Number of iterations: 3

Within cluster sum of squared errors: 66.0

Missing values globally replaced with mean/mode

 

Cluster centroids:

                         Cluster#

Attribute    Full Data          0          1          2          3          4          5          6          7          8          9         10         11         12         13         14         15

                (1650)      (127)       (86)      (507)      (163)       (68)      (282)       (44)       (67)      (107)       (72)        (5)       (10)       (16)       (66)       (27)        (3)

======================================================================================================================================================================================================

class             crew        2nd        1st       crew       crew        3rd        3rd        1st        3rd        1st        2nd        2nd        2nd        3rd        3rd        3rd        1st

age              adult      adult      adult      adult      adult      adult      adult      adult      adult      adult      adult      child      adult      child      adult      child      adult

sex               male       male       male       male       male     female       male       male       male     female     female       male       male     female     female       male     female

survived            no         no         no         no        yes        yes         no        yes        yes        yes        yes        yes        yes         no         no         no         no

 

 

Clustered Instances

 

 0       40 (  7%)

 1       32 (  6%)

 2      166 ( 30%)

 3       49 (  9%)

 4       22 (  4%)

 5      105 ( 19%)

 6       18 (  3%)

 7       21 (  4%)

 8       34 (  6%)

 9       21 (  4%)

10        6 (  1%)

11        4 (  1%)

12        1 (  0%)

13       23 (  4%)

14        8 (  1%)

15        1 (  0%)

 

As you can see from this the two most popular clusters by far are number 2 and number 5. Number two is adult male crew members that died. Number 5 is 3rd class adult males that died. The two largest groups. Dead. It makes me think of WOMEN AND CHILDREN FIRST! Then you look at cluster 13. 3rd class adult females. A bit sickening if you think about it too much. Anyways.

The most fun part was actually visualizing this data in weka. I was able to set up neat chart that grouped all the clusters in different ways. One of the most interesting ways was the survival rate among the classes taking sex into account. When you turn up the jitter, you see the data and understand it what it really says. Your brain recognizes patterns, like the fact that only a few female crew members perished. Or that even though the 1st class was the smallest, it had more survivors, male and female, than any other.

Things like that make all this kind of stuff really useful/interesting. Many-eyes.com seems like a pretty good data visualizer, though it is a little bit limited. Also, their dataset uploading link seems to be having issues. I wonder if they will get around to fixing that. One of the niftiest visualizers is the word tree for lingual data sets. An example here: CLICK ME!!!!!!!!!11!!11!!

Portfolio THREE continued

12 February, 2009 - Leave a Response

I think pylast.py may have more bugs that haven’t been addressed. I don’t know if anyone else has had this problem, but if you’ve got a track object and you try to use the “get_album()” function, it HATES YOU.

Here’s a snippet of my code.

album = track.get_album()
print album

Here’s what it tells me!
File “/Users/mattthompson/Desktop/python/recommender.py”, line 53, in
main()
File “/Users/mattthompson/Desktop/python/recommender.py”, line 44, in main
album = track.get_album()
File “/Users/mattthompson/Desktop/python/pylast.py”, line 1672, in get_album
return Album(_extract(node, “artist”), _extract(node, “title”))
TypeError: __init__() takes exactly 6 arguments (3 given)

One huge issue I have with this API is the severe lack of helpful documentation. The last.fm site has what seems like a list of all the methods but it they really aren’t the methods, and the parameters are often incorrect. WHY?! The answer is: I don’t know.

To work around that little issue I had, I redesigned my program to display the top albums and find similar albums in a fashion like the songs. But turns out, there’s no get_similar() feature in the album class. The closest thing to that was “get_top_tags()” which was broken. :( I abandoned that feature.

Final Thought: pylast.py: a semi-enjoyable headache

Portfolio Three – Interesting…

12 February, 2009 - Leave a Response

Turns out that the last.fm stuff is pretty nifty and fun to play with. Team the FEAR got together and we worked out the kinks together. The toughest part was actually figuring out what the key and the api key was. Once we got all that figured out, and how to reference all the methods so they actually did something instead of spitting out errors, it was smooth sailing.

After we all got the groundwork laid, we pretty went off on our own to try to make the best app. Well sort of. “Best” isn’t the right term. How about “different”? That one is much better. Anyways. In mine take on the project, I ask a user to input a artist/band, spit back 10 similar artists, get the top tracks, take the weight of the tracks and average it out, then print out only the tracks that are above the average weight. Now the user should be able to select a track and get similar albums based on that, but python likes to tell me that my tabs are spaces. EVEN WHEN THEY ARE NOT…

answer to question 4, precomputing the 5 closest users…

1 February, 2009 - Leave a Response

For this problem I wrote a function, though a little ridiculous and really not very pythony. The source code….

def similarUsers(prefs, user):

for other in prefs:
i = 0;
if other == user: continue
sim[i]=similarity(prefs,person,other)
i++

sim.sort()

sim.reverse()

for k in range(0,5):
newsim[k] = sim[k]
return newsim

So you take that function’s output and plug it in as the prefs.

Update on Python.

1 February, 2009 - Leave a Response

MovieLens!
Alright so the link for the MovieLens dataset is broken. So I didn’t get to do that little snippet.

My computer hates, me, python, or both. So I just googled the Tanimoto Smiliarity Thing, and I got a bunch of results about fingerprints and molecules. Then one wikipedia articles that has some weird notation in it. Maybe magnitude? Dot something? Overall my experience with python is proving to be VERY frustrating.

Update on the evil python

1 February, 2009 - Leave a Response

Alright! Python has reached new levels of evilness. I finally got the pydelicious module to actually work with my file. But then! OH NO!
>>>
>>> delusers=initializeUserDict(‘programming’)

Traceback (most recent call last):
File “”, line 1, in
delusers=initializeUserDict(‘programming’)
File “/Users/mattthompson/Desktop/delrecs.py”, line 8, in initializeUserDict
for p2 in get_urlposts(p1['href']):
TypeError: string indices must be integers
>>>

What does that even mean? I’m running python 2.5. I don’t know if thats what is causing a problem? I copied the code verbatim. Here’s a copy of the file.
rom pydelicious import get_popular,get_userposts,get_urlposts
def initializeUserDict(tag,count=5):
user_dict={}
# get the top count’ popular posts
for p1 in get_popular(tag=tag)[0:count]:
# find all users who posted this
for p2 in get_urlposts(p1['href']):
user=p2['user']
user_dict[user]={}
print user_dict
return user_dict

def fillItems(user_dict):
all_items={}
# Find links posted by all users
for user in user_dict:
for i in range(3):
try:
posts=get_userposts(user)
break
except:
print “Failed user “+user+”, retrying”
time.sleep(4)
for post in posts:
url=post['href']
user_dict[user][url]=1.0
all_items[url]=1
# Fill in missing items with 0
for ratings in user_dict.values():
for item in all_items:
if item not in ratings:
ratings[item]=0.0

That is EXACTLY what the book has. I’m pretty sure anyways. Like 99% sure. I don’t even know. Python is the devil. Updates on a book assignment later.

folio dos – weka and pydelicious.

1 February, 2009 - Leave a Response

Sweet. As of writing this, halftime of the superbowl, pydelicious does not want to install or play nice. Maybe i’ll figure it out. The good news is that weka is much nicer. It’s a pretty intuitive piece of software too. Everything just fell into place. Running every dataset through every different prediction method seems like it would be pretty fun.

On to J48 and heart disease.

=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: cleveland-14-heart-disease
Instances: 303
Attributes: 14
age
sex
cp
trestbps
chol
fbs
restecg
thalach
exang
oldpeak
slope
ca
thal
num
Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
——————

thal = fixed_defect
| ca <= 0
| | exang = no: 50_1 (3.06/1.0)
| ca > 0: >50_1 (10.0)
thal = normal
| ca <= 0: 0
| | cp = typ_angina
| | | trestbps 50_1 (4.0/1.0)
| | | trestbps > 138: 50_1 (20.0/3.0)
| | cp = non_anginal: 50_1 (3.0)
| | | | exang = yes: <50 (2.0)
| | | restecg = normal: <50 (4.0)
| | | restecg = st_t_wave_abnormality: <50 (0.0)
thal = reversable_defect
| cp = typ_angina
| | chol <= 229: 229
| | | age 50_1 (2.0)
| | | age > 48: <50 (3.0/1.0)
| cp = asympt
| | oldpeak 50_1 (8.0/1.0)
| | | restecg = normal
| | | | trestbps <= 136
| | | | | ca <= 0: 0
| | | | | | thalach <= 151: 151: >50_1 (3.0)
| | | | trestbps > 136: >50_1 (4.0)
| | | restecg = st_t_wave_abnormality: >50_1 (0.0)
| | oldpeak > 0.6: >50_1 (57.39)
| cp = non_anginal
| | slope = up: <50 (7.39/1.0)
| | slope = flat
| | | ca <= 0
| | | | trestbps <= 122: 122: >50_1 (3.0)
| | | ca > 0: >50_1 (8.0/1.0)
| | slope = down: <50 (1.0)
| cp = atyp_angina
| | ca <= 0
| | | oldpeak <= 0.1: 0.1: >50_1 (2.75/0.75)
| | ca > 0: >50_1 (2.25/0.25)

Number of Leaves : 30

Size of the tree : 51

Time taken to build model: 0.03 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances 235 77.5578 %
Incorrectly Classified Instances 68 22.4422 %
Kappa statistic 0.5443
Mean absolute error 0.1044
Root mean squared error 0.2725
Relative absolute error 52.0476 %
Root relative squared error 86.5075 %
Total Number of Instances 303

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.83 0.29 0.774 0.83 0.801 0.809 50_1
0 0 0 0 0 ? >50_2
0 0 0 0 0 ? >50_3
0 0 0 0 0 ? >50_4
Weighted Avg. 0.776 0.235 0.776 0.776 0.774 0.809

=== Confusion Matrix ===

a b c d e <– classified as
137 28 0 0 0 | a = 50_1
0 0 0 0 0 | c = >50_2
0 0 0 0 0 | d = >50_3
0 0 0 0 0 | e = >50_4

There’s the output from running the dataset with the J48 decision tree/classifier with the num class. As you can read above the method was 4 out of 5 times correct in it’s prediction of number?. Thats pretty good. I’m not really sure what the num means in the dataset though. One of J48′s limitations is that it can only predict nominal attributes. The best prediction rate was like 85% right. That’s even better. It was for the fbs category though. I don’t know what that means. But weka can predict t and f’s like no one’s business.

Right now python is my enemy. I’ll post an update later.

Follow

Get every new post delivered to your Inbox.