Machine Learning : Clustering, K means clustering (Part 19)
Clustering basically takes some data and groups them
K means clustering
Assume that we have some data
Step 1: Now, we have to decide how many clusters we want to make. For example, we want to create 2 clusters. Then we will assign 2 centroid randomly in the data.
Step 2: We can create a line between them. Blue is upper of the line and lower is red
Step 3: Then we assign those data with the blue and red color depending on their distance (blue of data which is above the line and red for data which is below the line)
Step 4: Now we take the average of blue data and average of red data. That will give us average for both of the data.
Step 5: Then we will move these 2 Centroid to the average coordinate.
Again, we will draw a line in between these 2 centroid.
The upper of the line becomes blue and lower becomes red. (Repeat step 3)
Then repeat step 4.
Then step 5
Then again, step 3
Then step 4
then step 5
Then step 3
Step 4
step 3
So, these repeats untill we get 2 cluster
So, finally
The Elbow Method
Let's assume this is out data
how many cluster should we have here?
If we tale 1 cluster, this is going to look like this:
If 2 cluster,
If 3 clusters,
Look, the more cluster we have, the less the WCSS value is
K means++
In K means, we get a data and randomly appoint some centroid. Then we create some cluster.
For example, this can be an output.
But the drawback is, as the centroid is randomly chosen, we might have different type of cluster
But if you see the data properly, this should not be a cluster.
So, this was the data :
and this is the 2 cluster we got
1st one cluster is better and it should be the answer.
So, to avoid such randomness, we came up with some rules :
Let's see how it works:
Step 1 & 2:
Step 3: (chosen the next centroid from the distance which is largest)
Step 4:
Step 5:
Let's code it down
Problem Statement:
Here we have customer details and we want to find a pattern among our customers.
we have their genre, spending score and many more. But we just want a pattern.
Let's code it down!!
We will import the libraries
but when we take the dataset, this time we don't need any Y (dependent variable). All of our data is independent .
But it's good to visualize cluster in 2D . So, we will take just 2 column as independent matrix
We will take column 4 and 5 [:,3:] means all rows and column 4 (index 3) and 5(index 4))
Then used the elbow method to find the optimal number of clusters
We are testing 10 clusters to see which one is much more better.
from sklearn.cluster import KMeans #importing
different wcss value for different cluster
wcss = []
for i in range(1,11): #choosing 10 cluster
Let's create our KMeans object now
kmeans= KMeans(n_clusters=i,init='k-means++',random_state=7)
n_clusters means number of clusters and appointig i every time (increasig clusters), init set not to go in random initialization, choose some random state as well
run the algo using fit function now
kmeans.fit
(X) #here only X is the feature
Append wcss value using a built in intertia_ which stores all wcss for us.
wcss.append(kmeans.inertia_) #inertia gets the wcss value for us
Let's plot it now
plt.plot(range(1,11),wcss) #in x we will have 1 to 10 and in y, we will have wcss valuesplt.title('The Elbow Method')plt.xlabel('Number of clusters')plt.ylabel('WCSS')
plt.show
()
Here from graph we can see at number 5, the wcss decreases slowly. so, Optimal number of cluster should be 5
Training the K-Means model on the dataset
First create the objects
kmeans= KMeans(n_clusters=5,init='k-means++',random_state=7)
Now we need to train it. But this time it's different. We need depending variable which we will get using fit_predict(). So, use kmeans.fit_predict(X)
Store it
y_kmeans=
kmeans.fit
_predict(X) #Depending variable done
Visualising the clusters
Using scatter function for 5 clusters. We are creating plots for cluster 0,1,2,3,4 (as 5 clusters)
plt.scatter(X[y_kmeans==0,0],X[y_kmeans==0,1],s=100, c='red',label = 'Cluster 1') #X coordinates and Y coordinates, color, lebel,size set to 100 to increase size of points
In x axis we will have Annnual Income and in y, we have Spending score
# For x coordinates , Choose column Annual income and rows which belongs to cluster 1. Here y_kmeans==0 will select people who belongs to cluster 1 which means row and column is set to 0 as in our feature X matrix, Annnual income is at column 0. So, x coordinate= X[y_kmeans==0,0]
For y axis, we will have second column of X feature and people who belogs to cluster 1.So, y coordinate= X[y_kmeans==0,1]
Now follow the process for other clusters
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
Let's take the centers now.Here, cluster_centers is a 2D array which has centers for all.
So, for x axis take all row and the first column #cluster_centers_[:,0]
For y axis take all row and the second column#cluster_centers_[:,1]
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')
#Now the title and labels
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')plt.legend()
plt.show
()
Let's analyze the out put now
here, cluster 2 represents people with low income and low spending
Here for cluster 1,
they earn more and might spend more.
This customers earns more but might not spend that much
So, that was it!
Practice the code from this repository