So Im looking for some help and guidance on this coding Im n
So I\'m looking for some help and guidance on this coding. I\'m new to program and although I understand the idea I\'m behind on the langauage.
Thanks
One of the most popular algorithms for p erforming clustering is the k - means method. The algorithm depends on the notion of distance between two points. For points with only one dimension (just single values), we can define the distance between two points and as The k - means algorithm will work by placing points into clusters an d computing their centroids , which is defined as the average of the data points in the cluster. Specifi cally, the algorithm works as follows:
1. Pick k, the number of clusters.
2. Initialize clusters by picking one point (centroid) per cluster. F or this assignment, you can pick the first k points as initial centroids for each corresponding cluster .
3. For each point, place it in the cluster whose current centroid it is nearest.
4. After all points are assigned, update the locations of centroids of the k clusters
5. Reassign all points to their closest centroid. This sometimes moves points between clusters.
6. Repeat 4,5 until convergence. Convergence occurs when points don ’ t move between clusters and centroids stabilize.
Requirements Y ou are to create a program using Python that does the following:
1. Ask s the user for a filename which contains the point data which is to be clustered (see Data File Format section for details) .
2. Ask s the user for the name of the output file.
3. Ask s the user for the number of clusters. This is the parameter k that will be used for k - means.
4. Read the input file a nd stores the points into a list 5. Appl ies the k - means algorithm to find the cluster for each point.
6. Display s the points that each cluster contains after each iteration of the algorithm
7. W r ite s the final cluster assignments t o the output file. YOU CANNOT USE ANY PYTHON PACKAGES FOR THIS PROGRAM (NUMPY, P ANDAS, ... ) - NO IMPORT STATEMENTS .
Additional Requirements
1. The name of your source code file should be kMeans .py . All your code should be within a single file.
2. Your code should follow good coding practices, including good use of whitespace and use of both inline and block comments.
3. You need to use meaningful identifier names that conform to standard naming conventions.
4. At the top of each file, you need to put in a block comment with the following information: your name, date, course name, semester, and assignment name.
5. The output of your program should exactly match the sample program output given at the end. Data File Format Le t N be the number of points and Pi to be the value o f point i . The input file should be of the following format: P1 P2 ... PN Example: 1.2 2.1 4.56 2.113 2.2
Sample Program Output
70
-
510, [semester] [year]
NAME: [put your name here]
PROGRAMMING ASSIGN
MENT #2
Enter the name of the input file: prog2
-
input
-
d
ata.txt
Enter the name of the output file: prog2
-
output
-
data.txt
Enter the number of clusters: 5
Iteration 1
0 [1.8]
1 [4.5, 6.5]
2 [1.1, 0.5]
3 [2.1, 3.2]
4 [9.8, 7.6, 11.32]
Iteration 2
0 [1.8, 2.1]
1 [4.5, 6.5]
2 [1.1, 0.5]
3 [3.2]
4 [9.8, 7.6, 11.32]
Iteration 3
0 [1.8, 2.1]
1 [4.5, 6.5]
2 [1.1, 0.5]
3 [3.2]
4 [9.8, 7.6, 11.32]
Output File Contents
Point 1.8 in cluster 0
Point 4.5 in cluster 1
Point 1.1 in cluster 2
Point 2.1 in cluster 0
Point 9.8 in cluster 4
Point 7.6 in cluster 4
Solution
Python 2.7 code
import os.path
import sys
print \"Please enter the input_data filename\"
in_file = raw_input().strip()
print \"Please enter the output filename\"
out_file = raw_input().strip()
print \"Please enter the number of clusters\"
k = int(raw_input().strip())
if os.path.exists(in_file):
with open(in_file) as in_f:
content = in_f.readline()
l = [float(i) for i in content.strip().split(\" \")]
centroids = l[0:k]
centroids_prev = l[0:k]
it = 0
change = 1000
while((change > 0.01)and (it < 1000)):
it = it + 1
print \"Iteration \", it
clusters = []
for i in range(0,k):
clusters.append([])
for e in l:
distances = []
for c in centroids:
distances.append(abs(c - e))
clusters[distances.index(min(distances)) - 1].append(e)
for ww in range(0,k):
print ww, clusters[ww]
#update centroids
for x in range(0,len(clusters)):
summ = 0.0
for y in range(0,len(clusters[x])):
summ = summ + clusters[x][y]
if(len(clusters[x]) > 0):
tmp = float(summ)/float(len(clusters[x]))
centroids[x] = tmp
else:
tmp = 0.0
centroids[x] = tmp
centroids_prev.sort()
centroids.sort()
change = 0
for s in range(0,k):
change = change + abs(centroids[s] - centroids_prev[s])
centroids_prev[s] = centroids[s]
f= open(out_file,\"w+\")
for i in range(0,len(clusters)):
for j in range(0,len(clusters[i])):
f.write(\"Point \")
f.write(str(clusters[i][j]))
f.write(\" in clusters \")
f.write(str(i))
f.write(\"\ \")
f.close()
Sample Input file:
1 840 221 3 4 5 222 223 224 225 2 841 842 843 844
Sample Output on command line:
Please enter the input_data filename
input.txt
Please enter the output filename
output.txt
Please enter the number of clusters
3
Iteration 1
0 [840.0, 841.0, 842.0, 843.0, 844.0]
1 [221.0, 222.0, 223.0, 224.0, 225.0]
2 [1.0, 3.0, 4.0, 5.0, 2.0]
Iteration 2
0 [221.0, 222.0, 223.0, 224.0, 225.0]
1 [840.0, 841.0, 842.0, 843.0, 844.0]
2 [1.0, 3.0, 4.0, 5.0, 2.0]
Sample Output.txt
Point 221.0 in clusters 0
Point 222.0 in clusters 0
Point 223.0 in clusters 0
Point 224.0 in clusters 0
Point 225.0 in clusters 0
Point 840.0 in clusters 1
Point 841.0 in clusters 1
Point 842.0 in clusters 1
Point 843.0 in clusters 1
Point 844.0 in clusters 1
Point 1.0 in clusters 2
Point 3.0 in clusters 2
Point 4.0 in clusters 2
Point 5.0 in clusters 2
Point 2.0 in clusters 2



