Machine Learning : Clustering, Hierarchical Clustering(HC) (Part 20)

It's almost the same as Kmeans but let's learn it using Agglomerative HC

Now, talking about this

closest clusters, we can calculate it by Euclidean

Now, let's measure distance between 2 clusters

based on your situation you can choose options to find the option

Now, let's apply the steps from Agglomerative HC

Assume we have 6 data points

we made each point a cluster

now we took 2 closest point and made them a cluster. So, we have 5 clusters now

now, we have 4 clusters.

Then to this

and finally

now, why did we do this?

Let's understand this with Dendograms

here on left, you can see points. and on right points are kept on the x axis

For p3 and p2, we made them a cluster first as they had less distance (Eucledian distance)

Dendogram shows that p2 and p3 are dissimilar and shows the distance.

Then this (p5 and p6 closer)

now p1 is close to the cluster of p2 and p3

So, in dendogram, we have connected P1 with p2 and p3

Now, p4 is closer to p5 and p6

Finally,

So, done!!!

How to get the maximum value from dendograms?

on left, we have all points and on right, we have dendogram

Dendogram represents dissimilarity between points

Let's take a threshold dissimilarity which should not be exceeded.

Assume the threshold is on 1.7 so, any cluster which has difference below 1.7 will be taken for now. (2 points represent 2 clusters)

We have 2 clusters which has difference below 1.7 here.

Again, if we set the threshold to below than this,

finally,

But how to find optimal number of cluster?

Find longest vertical line which does not cross horizontal lines

here this is the largest line which is not overlapped with horizontal line.

But we had options like these

but p1 and p4 can't be considered as they have overlap with p2 and p3 And p5 and p6

So, back to business.

take a threshold on the largest line and from that, we got our cluster number.

Now, let's take a test.

Can you say the optimal number o cluster here?

Among all of the green lines, we had largest between p4,p5,p6 and p7,p8,p9

Now we take a threshold on the line and that intersects 3 points

So, 3 cluster

Let's code this down with the mall customer csv file

This is again the same dataset we used in k means clustering blog

We will take column Annual income and spending score in our feature matrix

Let's start

  1. Import the libraries

  2. now we import the dataset and only take 2 column index 3 and 4 which means Annual Income and Spending score

    we could have used X=dataset.iloc[:,3:].values as well . This also means column 3 and beyond. We had 3 and 4 here and thus took both.

  3. Using the dendrogram to find the optimal number of clusterssing the dendrogram to find the optimal number of clusters

We import stuffs

import scipy.cluster.hierarchy as sch

Then we create a function for dendogram using sch.dendrogram. Within that we used sch.linkage(feature matrix, cluster technique). Here we will use minimum variance technique which is called as 'ward'

dendrogram = sch.dendrogram(sch.linkage(X,method='ward')) #creating dendrogram object

Then we plotted the dendogram

plt.title('Dendrogram')

plt.xlabel('Customers') #observatation points

plt.ylabel('Euclidean distances')

plt.show()

Let's understand how to get efficient cluster

We will take horizontal lines from top and expand that. after that, we will look for the largest line

we got the larger line there where the horizontal line gives us 5 points. So, 5 cluster to be made.

  1. Training the Hierarchical Clustering model on the dataset

    from sklearn.cluster import AgglomerativeClustering #importing

    Now create an object hc from module AgglomerativeClustering

    But here with that, AgglomerativeClustering(no of clusters, affinity is the type of distance we used, method or technique

    #or dendogram which we used 'ward')

    hc= AgglomerativeClustering(n_clusters=5,affinity='euclidean',linkage='ward') #creating object

    The we want to create dependent variable and train as well. So, we will use fit_predict. But only to train, we used to use fit

    y_hc=hc.fit_predict(X)

    Here [4 3 4 3 .... meant customer 001 will be in cluster 5 (index 4), customer 002 in cluster 4, customer 003 in cluster 5 and customer 004 in cluster 4........

  2. Visualize the cluster

    plt.scatter(X[y_hc==0,0],X[y_hc==0,1],s=100, c='red',label = 'Cluster 1')

    X coordinates and Y coordinates, color, lebel,size set to 100 to increase size of points. In x axis we will have Annnual Income and in y, we have Spennding score. For x coordinates , Choose column Annual income and rows which belongs to cluster 1. here y_hc==0 will select people who belongs to cluster 1 which means row and column is set to 0 as in our feature X matrix, Annnual income is at column 0

    #So, x coordinate= X[y_hc==0,0]

    # For y axis, we will have second column of X feature and people who belogs to cluster 1

    #So, y coordinate= X[y_hc==0,1]

    #Now follow the process for other clusters

    plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')

    plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')

    plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')

    plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
    #Now the title and labels

    plt.title('Clusters of customers')

    plt.xlabel('Annual Income (k$)')

    plt.ylabel('Spending Score (1-100)')

    plt.legend()plt.show()

So, cluster 3 has higher income and higher spending score whereas cluster 5 has lower income and lower spending score.

Done!!

But while working with the dendogram, we choose 5 clusters whereas we had another line which was very close and gives us 3 clusters

Let's do it for 3 cluster

Again the cluster 3 gave us the most Annual Income and most Spending Score. We can offer them various business deals.

Try with the code