CS 330: Algorithms: Design & Practice

Homework 4: K-Means
Due: In class on Thursday, March 17, 2016

Presentation By: Hanna and Mary

A simple but effective clustering algorithm is known as k-means. Clustering is used in pattern recognition to automatically group data based on a similarity metric. In the case of K-means clustering, the similarity metric is a distance function. The algorithm is as follows:

There are 2 tricky parts of k-means. 1. Deciding the k value for k-means is difficult because favoring clusters with low variance tends to yield more clusters. For the purpose of this assignment we will try between 2 and 6 clusters, and try to guess the right number based on how the clusters look to the naked eye. (Plotting 2 dimensions at a time will help for this part.) 2. Deciding distance based on the features is difficult if the class is based on a different coordinate system, or if there are multiple features available. For the purposes of this assignment, we will use the Euclidean distance as our distance metric, and we will use all available features. (You may explore other subsets of features in addition to gain clarity.)

1) Implement K-means using the language of your choice.

Next, we will use this algorithm to test two datasets: (see cluster1.csv and cluster2.csv ). Here is what you have to do:

2) Use k-means with k of 2 to 6 on each of the cluster1.csv data set. Make sure you run k-means on the same k for the same data set multiple times and us the means that occur most often.

3) Repeat (2) above but this time use the cluster2.csv data set.

Note: The organization of the data sets is such that each class is grouped together and sets are of similar, if not identical, sizes, but the labels are not included.

Answer the following questions for each algorithm:

What to hand in

1. A short formal report describing your findings. Use the outputs of your programs on the questions above to frame your narrative and conclusions.

2. In the Appendix of the report include: A printout of the program that you wrote. Explain how the programs were used in the exercise.

Presentation:

Hanna and Mary will give presentations on this exercise in class on Tuesday, March 15, 2016. The presentation should include a description of the problem and the overall solution. It should also include all the other things you had to do in the implementation process, and a discussion about the classes found by k-means.


Back to CS330 Materials