@bgarcial wrote:
Usually, to get the Euclidian distance, is used the
numpy.linalg.norm
to get the distance between some data points and clusters centroids.Precisely, in this context analyzing the KMeans algorithm implementation here presented, we have the following:
# Importing the dataset data = pd.read_csv('xclara.csv') print(data.shape) data.head() (3000, 2)
Get the
V1
andV2
columns onf1
andf2
variables# Getting the values and plotting it f1 = data['V1'].values f2 = data['V2'].values # We associate every i value of the column f1 with f2 and we put them as elements of a list X = np.array(list(zip(f1, f2))) # array X print(X) [ 2.072345 -3.241693] [ 17.93671 15.78481 ] [ 1.083576 7.319176] ... [ 64.46532 -10.50136 ] [ 90.72282 -12.25584 ] [ 64.87976 -24.87731 ]] # And we put the data on a scatter diagram plt.scatter(f1, f2, c='black', s=7)
Euclidean distance calculator
# Euclidean Distance Caculator # https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.linalg.norm.html # Geting the Euclidean distance between data points def dist(a, b, ax=1): return np.linalg.norm(a - b, axis=ax)
How to understand the norm vector inside Euclidian distance context?
# Number of clusters k = 3 # We generate a random data between 0 and the maximum value -20 of the X array that has the data of the # columns f1 and f2, with these inputs we generate the values of X and Y coordinates on which # will position the centroids # X coordinates of random centroids C_x = np.random.randint(0, np.max(X)-20, size=k) # Y coordinates of random centroids C_y = np.random.randint(0, np.max(X)-20, size=k) print(" x coordinates" +'\n', C_x) print("*****") print(" y coordinates" +'\n', C_y) print("*****") x coordinates [51 41 25] ***** y coordinates [18 76 53]
We associate these lists
C_x
andC_y
withzip
so that in a single list, have the location values for each centroid in the dispersion graphC = np.array(list(zip(C_x, C_y)), dtype=np.float32) print("Coordinates pair x,y associated inside list to" +'\n', "INITIALIZE RANDOM CENTROIDS" +'\n', C) Coordinates pair x,y associated inside list to INITIALIZE RANDOM CENTROIDS [[51. 18.] [41. 76.] [25. 53.]] # Plotting along with the Centroids plt.scatter(f1, f2, c='#050505', s=7) plt.scatter(C_x, C_y, marker='*', s=200, c='g') # To store the value of centroids when it updates C_old = np.zeros(C.shape)
Why is necessary store the old coordinates of centroids?
# Cluster Lables(0, 1, 2) clusters = np.zeros(len(X))
How to works the grab this distance? In the sense of norm/matriz vector …
# Error func. - Distance between new centroids and old centroids error = dist(C, C_old, None)
I have been understanding the implementation that the author creates, but this topic of norm vector is something new for me.
Posts: 1
Participants: 1