Machine Learning : Clustering, Hierarchical Clustering(HC) (Part 20)
It's almost the same as Kmeans but let's learn it using Agglomerative HC
Now, talking about this
closest clusters, we can calculate it by Euclidean
Now, let's measure distance between 2 clusters
based on your situation you can choose options to find the option
Now, let's apply the steps from Agglomerative HC
Assume we have 6 data points
we made each point a cluster
now we took 2 closest point and made them a cluster. So, we have 5 clusters now
now, we have 4 clusters.
Then to this
and finally
now, why did we do this?
Let's understand this with Dendograms
here on left, you can see points. and on right points are kept on the x axis
For p3 and p2, we made them a cluster first as they had less distance (Eucledian distance)
Dendogram shows that p2 and p3 are dissimilar and shows the distance.
Then this (p5 and p6 closer)
now p1 is close to the cluster of p2 and p3
So, in dendogram, we have connected P1 with p2 and p3
Now, p4 is closer to p5 and p6
Finally,
So, done!!!
How to get the maximum value from dendograms?
on left, we have all points and on right, we have dendogram
Dendogram represents dissimilarity between points
Let's take a threshold dissimilarity which should not be exceeded.
Assume the threshold is on 1.7 so, any cluster which has difference below 1.7 will be taken for now. (2 points represent 2 clusters)
We have 2 clusters which has difference below 1.7 here.
Again, if we set the threshold to below than this,
finally,
But how to find optimal number of cluster?
Find longest vertical line which does not cross horizontal lines
here this is the largest line which is not overlapped with horizontal line.
But we had options like these
but p1 and p4 can't be considered as they have overlap with p2 and p3 And p5 and p6
So, back to business.
take a threshold on the largest line and from that, we got our cluster number.
Now, let's take a test.
Can you say the optimal number o cluster here?
Among all of the green lines, we had largest between p4,p5,p6 and p7,p8,p9
Now we take a threshold on the line and that intersects 3 points
So, 3 cluster
Let's code this down with the mall customer csv file
This is again the same dataset we used in k means clustering blog
We will take column Annual income and spending score in our feature matrix
Let's start
Import the libraries
now we import the dataset and only take 2 column index 3 and 4 which means Annual Income and Spending score
we could have used X=dataset.iloc[:,3:].values as well . This also means column 3 and beyond. We had 3 and 4 here and thus took both.
Using the dendrogram to find the optimal number of clusterssing the dendrogram to find the optimal number of clusters
We import stuffs
import scipy.cluster.hierarchy as sch
Then we create a function for dendogram using sch.dendrogram. Within that we used sch.linkage(feature matrix, cluster technique). Here we will use minimum variance technique which is called as 'ward'
dendrogram = sch.dendrogram(sch.linkage(X,method='ward'))
#creating dendrogram object
Then we plotted the dendogram
plt.title('Dendrogram')
plt.xlabel('Customers') #observatation points
plt.ylabel('Euclidean distances')
plt.show
()
Let's understand how to get efficient cluster
We will take horizontal lines from top and expand that. after that, we will look for the largest line
we got the larger line there where the horizontal line gives us 5 points. So, 5 cluster to be made.
Training the Hierarchical Clustering model on the dataset
from sklearn.cluster import AgglomerativeClustering
#importingNow create an object hc from module AgglomerativeClustering
But here with that, AgglomerativeClustering(no of clusters, affinity is the type of distance we used, method or technique
#or dendogram which we used 'ward')
hc= AgglomerativeClustering(n_clusters=5,affinity='euclidean',linkage='ward') #creating object
The we want to create dependent variable and train as well. So, we will use fit_predict. But only to train, we used to use fit
y_hc=
hc.fit
_predict(X)
Here [4 3 4 3 .... meant customer 001 will be in cluster 5 (index 4), customer 002 in cluster 4, customer 003 in cluster 5 and customer 004 in cluster 4........
Visualize the cluster
plt.scatter(X[y_hc==0,0],X[y_hc==0,1],s=100, c='red',label = 'Cluster 1')
X coordinates and Y coordinates, color, lebel,size set to 100 to increase size of points. In x axis we will have Annnual Income and in y, we have Spennding score. For x coordinates , Choose column Annual income and rows which belongs to cluster 1. here y_hc==0 will select people who belongs to cluster 1 which means row and column is set to 0 as in our feature X matrix, Annnual income is at column 0
#So, x coordinate= X[y_hc==0,0]
# For y axis, we will have second column of X feature and people who belogs to cluster 1
#So, y coordinate= X[y_hc==0,1]
#Now follow the process for other clusters
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 100, c = 'blue', label = 'Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 100, c = 'green', label = 'Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
#Now the title and labelsplt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show
()
So, cluster 3 has higher income and higher spending score whereas cluster 5 has lower income and lower spending score.
Done!!
But while working with the dendogram, we choose 5 clusters whereas we had another line which was very close and gives us 3 clusters
Let's do it for 3 cluster
Again the cluster 3 gave us the most Annual Income and most Spending Score. We can offer them various business deals.
Try with the code