Machine Learning : Clustering, K means clustering (Part 19)

Clustering basically takes some data and groups them

K means clustering

Assume that we have some data

Step 1: Now, we have to decide how many clusters we want to make. For example, we want to create 2 clusters. Then we will assign 2 centroid randomly in the data.

Step 2: We can create a line between them. Blue is upper of the line and lower is red

Step 3: Then we assign those data with the blue and red color depending on their distance (blue of data which is above the line and red for data which is below the line)

Step 4: Now we take the average of blue data and average of red data. That will give us average for both of the data.

Step 5: Then we will move these 2 Centroid to the average coordinate.

Again, we will draw a line in between these 2 centroid.

The upper of the line becomes blue and lower becomes red. (Repeat step 3)

Then repeat step 4.

Then step 5

Then again, step 3

Then step 4

then step 5

Then step 3

Step 4

step 3

So, these repeats untill we get 2 cluster

So, finally

The Elbow Method

Let's assume this is out data

how many cluster should we have here?

If we tale 1 cluster, this is going to look like this:

If 2 cluster,

If 3 clusters,

Look, the more cluster we have, the less the WCSS value is

K means++

In K means, we get a data and randomly appoint some centroid. Then we create some cluster.

For example, this can be an output.

But the drawback is, as the centroid is randomly chosen, we might have different type of cluster

But if you see the data properly, this should not be a cluster.

So, this was the data :

and this is the 2 cluster we got

1st one cluster is better and it should be the answer.

So, to avoid such randomness, we came up with some rules :

Let's see how it works:

Step 1 & 2:

Step 3: (chosen the next centroid from the distance which is largest)

Step 4:

Step 5:

Let's code it down

Problem Statement:

Here we have customer details and we want to find a pattern among our customers.

we have their genre, spending score and many more. But we just want a pattern.

Let's code it down!!

We will import the libraries

but when we take the dataset, this time we don't need any Y (dependent variable). All of our data is independent .

But it's good to visualize cluster in 2D . So, we will take just 2 column as independent matrix

We will take column 4 and 5 [:,3:] means all rows and column 4 (index 3) and 5(index 4))

Then used the elbow method to find the optimal number of clusters

We are testing 10 clusters to see which one is much more better.

from sklearn.cluster import KMeans #importing

different wcss value for different cluster

wcss = []

for i in range(1,11): #choosing 10 cluster

Let's create our KMeans object now

kmeans= KMeans(n_clusters=i,init='k-means++',random_state=7)

n_clusters means number of clusters and appointig i every time (increasig clusters), init set not to go in random initialization, choose some random state as well

run the algo using fit function now

kmeans.fit(X) #here only X is the feature

Append wcss value using a built in intertia_ which stores all wcss for us.

wcss.append(kmeans.inertia_) #inertia gets the wcss value for us

Let's plot it now

plt.plot(range(1,11),wcss) #in x we will have 1 to 10 and in y, we will have wcss valuesplt.title('The Elbow Method')plt.xlabel('Number of clusters')plt.ylabel('WCSS')

plt.show()

Here from graph we can see at number 5, the wcss decreases slowly. so, Optimal number of cluster should be 5

Training the K-Means model on the dataset

First create the objects

kmeans= KMeans(n_clusters=5,init='k-means++',random_state=7)

Now we need to train it. But this time it's different. We need depending variable which we will get using fit_predict(). So, use kmeans.fit_predict(X)

Store it

y_kmeans=kmeans.fit_predict(X) #Depending variable done

Visualising the clusters

Using scatter function for 5 clusters. We are creating plots for cluster 0,1,2,3,4 (as 5 clusters)

plt.scatter(X[y_kmeans==0,0],X[y_kmeans==0,1],s=100, c='red',label = 'Cluster 1') #X coordinates and Y coordinates, color, lebel,size set to 100 to increase size of points

In x axis we will have Annnual Income and in y, we have Spending score

# For x coordinates , Choose column Annual income and rows which belongs to cluster 1. Here y_kmeans==0 will select people who belongs to cluster 1 which means row and column is set to 0 as in our feature X matrix, Annnual income is at column 0. So, x coordinate= X[y_kmeans==0,0]

For y axis, we will have second column of X feature and people who belogs to cluster 1.So, y coordinate= X[y_kmeans==0,1]
Now follow the process for other clusters

plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')

plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label = 'Cluster 3')

plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')

plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
Let's take the centers now.Here, cluster_centers is a 2D array which has centers for all.

So, for x axis take all row and the first column #cluster_centers_[:,0]

For y axis take all row and the second column#cluster_centers_[:,1]

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroids')

#Now the title and labels

plt.title('Clusters of customers')

plt.xlabel('Annual Income (k$)')

plt.ylabel('Spending Score (1-100)')plt.legend()

plt.show()

Let's analyze the out put now

here, cluster 2 represents people with low income and low spending

Here for cluster 1,

they earn more and might spend more.

This customers earns more but might not spend that much

So, that was it!

Practice the code from this repository