My Machine Learning Learning Experience (Part 3): Final Programming Assignment

April 7th 2016 (Thr) - April 8th 2016 (Fri)

Finished the final programming assignment from my school.

It's been a week since my last update. I spent some time writing some notes on what I'd learned so far, I will share them here once I'm done. In the meantime I also tried to catch up on my networking class, I've fallen way way way behind, but since the stuff is really boring, I ended up watching all two seasons of Daredevil and also 11.22.63 (I specifically waited til Tuesday when the finale came out so I could watch them all back to back since I'm a huge Stephen King fan, I'm really messed up...) Okay, so back to the assignment. This may be the easiest assignment I've done so far, because this time it's all about k-means.

First Task

I had already talked about k-means in the previous post (link) so I'm just gonna jump right ahead to the coding part. My first task is just trying to see how k-means perform its magic. From the starter code, I have 1000 data points generated using sklearn's make_blob() function from five centers:


centers = [[3,5], [5,1], [8,2], [6,8], [9,7]]
X, y = make_blobs(n_samples=1000,centers=centers,cluster_std=0.5,random_state=3320)

My first job is to plot all these points and color them accordingly:


Then of course, my next job is running k-means to see if it can truly clustering these five groups of points:


from sklearn.cluster import KMeans
clf = KMeans(n_clusters = n_clusters)
clf.fit(X)
pred = clf.predict(X)

I tried using number of clusters from two to six, I made the outputs into a nice gif so you can see the progress nicely:



Awesome, and then I used four evaluation metrics to prove using five clusters is the best we can get.


from sklearn import metrics
ari = metrics.adjusted_rand_score(y, clf.labels_)
mri = metrics.adjusted_mutual_info_score(y, clf.labels_)
v_measure = metrics.v_measure_score(y, clf.labels_)
silhouette_coeff = metrics.silhouette_score(X, clf.labels_)


And here're the results:


See those peaks over there? Everything works out great!

Task 2

Task 2 is more or less the same. Actually I started working on Task 2 first before moving on to Task 1. I pretty much moved the code I wrote here back to Task 1 to finish the task. Instead of playing with synthetic data, I'm working on LeCun's MNIST dataset that contains images of handwritten digits.

My starter code has already imported the images for me into a nice matrix X. Each image is in 28x28 and 10000 of these are imported, so X is a 10000x768 matrix where each row represents an image, it has 768 columns because the pixels of each image (28x28 pixels) are serialized.

And now it's my turn to do some work. I started by performing PCA while maintaining 0.95 POV:


pca = PCA(n_components=0.95).fit(X)
X_pca = pca.transform(X)

(Only two lines of code, so nice right? However, when I did at first was using the code I wrote for the last assignment to plot the POV and find exactly how many number of components I needed in order to maintain 0.95 POV. It was stupid.)

Then I used the reduced data to do k-mean clustering and calculate the scores for the four evaluation metrics:


from sklearn.cluster import KMeans
clf = KMeans(n_clusters)
clf.fit(X)

from sklearn import metrics
ari = metrics.adjusted_rand_score(y, clf.labels_)
mri = metrics.adjusted_mutual_info_score(y, clf.labels_)
v_measure = metrics.v_measure_score(y, clf.labels_)
silhouette_coeff = metrics.silhouette_score(X, clf.labels_)

I ran those code using eight to twelve clusters, then I plotted the centers of all clusters as images. This part is a little tricky, because I've been using the reduced data by PCA, so I have to transform the data back to its originals so I can recover the original 28x28 images. Turned out, it was really easy:


def show_images(n_clusters, kmean, pca):

    # in kmean.cluster_centers_.shape, what we get is a (n_clusters, 84) array
    # so for every [1,84] vector we have inside kmean.cluster_centers_,
    # we need to trasform it back to the original vector
    # then transorm it back into the original 28x28 matrix   
    cluster_centers_org = pca.inverse_transform(kmean.cluster_centers_)


    n_row = 1
    n_col = n_clusters
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(cluster_centers_org[i].reshape(28,28), cmap=plt.cm.gray)
        title_text = 'Center ' + str(i + 1)
        plt.title(title_text, size=12)
        plt.xticks(())
        plt.yticks(())

    plt.savefig('ex2_n_clusters_' + str(n_clusters) + '.png')

and these are all I got:

number of clusters is 8

number of clusters is 9

number of clusters is 10

number of clusters is 11

number of clusters is 12

I actually thought I would see all digits from 0 to 9 when I used 10 clusters, turned out it wasn't, what a bummer. I blamed it on the similarity of the digits 1 and 7, 4, 8 and 9. Since after plotting the scores of the metrics and I'm done, I'm too lazy to find out why.


You can check out my source code here: 



Kev



My Machine Learning Learning Experience (Part 3): Final Programming Assignment My Machine Learning Learning Experience (Part 3): Final Programming Assignment Reviewed by Kevin Lai on 12:02:00 AM Rating: 5

No comments:

Powered by Blogger.