Machine Learning : Clustering (Part 18)

Clustering is similar to classification, but the basis is different.

In Clustering you don’t know what you are looking for, and you are trying to identify some segments or clusters in your data. When you use clustering algorithms on your dataset, unexpected things can suddenly pop up like structures, clusters and groupings you would have never thought of otherwise.

We will learn about these 2 types of clustering:

  1. K-Means Clustering

  2. Hierarchical Clustering

K-Means Clustering: K-Means is an unsupervised machine learning algorithm used for clustering data points into a predefined number of clusters, ( k ). The algorithm works iteratively to assign each data point to one of ( k ) groups based on the features that are provided. Data points are clustered based on feature similarity. The results of K-Means clustering are not hierarchical and do not imply any order of the clusters

The steps involved in K-Means clustering are:

  1. Initialize ( k ) centroids randomly.

  2. Assign each data point to the nearest centroid.

  3. Recalculate the centroids as the mean of all points in a cluster.

  4. Repeat steps 2 and 3 until convergence or the end of a fixed number of iterations.

Hierarchical Clustering: Hierarchical clustering is another unsupervised learning algorithm that is used to group similar objects into clusters. The algorithm builds a hierarchy of clusters either in a bottom-up approach (agglomerative) or a top-down approach (divisive). In the agglomerative approach, each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. In divisive, all observations start in one cluster, which is then split recursively as one moves down the hierarchy.

The key output of hierarchical clustering is a dendrogram, which illustrates the arrangement of the clusters produced by the algorithm.

Both methods have their own use cases and are chosen based on the nature of the data and the desired outcome of the analysis.

Again,

Till now we learned supervised Machine Learning which involves input and output. We provide all data and also provide an output using which we are given an output

In this image, we provided some data of apple and told that it's apple. Once we provided an apple and now asked the model, what is this? It replied with an answer "It's an apple"

But here in clustering, we are provided some data and no output or other to train. The model will understand and create a group of the data.

Here the inputs are apple, orange and banana. But we haven't told it what's their name. The model itself creates groups of the same element.

Done!