UNIT-3 MACHINE LEARNING

Unit 3

Syllabus:

Ensemble Techniques II- C4.5, CHAID (Chi-Square Automatic Interaction Detection), Random Forest Algorithm.

Unsupervised Learning-Clustering: Measures of distance, k-means, Gaussian Mixture Model Clustering, Hierarchical Learning- Divisive, Agglomerative Clustering

The C4.5 algorithm is a successor to the ID3 algorithm. C4.5 was developed by Ross Quinlan with improvements in handling both categorical and continuous data, as well as handling missing values. While ID3 works well with categorical data, C4.5 introduces a mechanism to handle continuous data by selecting optimal thresholds for splitting. This allows C4.5 to be more versatile in handling different types of datasets.

Data set

We are going to create a decision table for the following dataset. It informs about decision making factors to play tennis at outside for previous 14 days

Firstly, we need to calculate global entropy. There are 14 examples; 9 instances refer to yes decision, and 5 instances refer to no decision.

CHAID (Chi-Squared Automatic Interaction Detector):

CHAID is a decision tree algorithm that is based on the chi-squared test of independence. It was developed by Gordon Kass in 1980, and it has since become a popular method for building decision trees. CHAID is an acronym for Chi-Squared Automatic Interaction Detector, which refers to the statistical test used to determine the significance of the relationships between variables.

The CHAID algorithm works by recursively splitting the data into subsets based on the categorical predictor variables that have the strongest association with the response variable. At each step, CHAID calculates the chi-squared test of independence between the response variable and each of the categorical predictor variables. The variable with the strongest association is chosen as the split variable, and the data is divided into subsets based on the categories of that variable. This process is repeated for each subset until the stopping criteria are met. The chi-squared test of independence tests the null hypothesis that there is no association between the two variables.

If the chi-squared test of independence shows that there is a significant association between the response variable and a predictor variable, then that predictor variable is chosen as the split variable. The data is then split into subsets based on the categories of the split variable. The process is repeated for each subset until the stopping criteria are met. The stopping criteria can be based on the depth of the tree, the number of observations in each leaf node, or other criteria.

Random Forest Algorithm:

Before understanding the working of the random forest algorithm in machine learning, we must look into the ensemble learning technique. Ensemble simply means combining multiple models. Thus a collection of models is used to make predictions rather than an individual model.

Ensemble uses two types of methods:

Random forest Classifier works on the Bagging principle. Bagging

Bagging, also known as Bootstrap Aggregation, serves as the ensemble technique in the Random Forest algorithm.

Selection of Subset: Bagging starts by choosing a random sample, or subset, from the entire dataset.
Bootstrap Sampling: Each model from these samples, called Bootstrap Samples, which we take from the original data with replacement. This process is known as row sampling.
Bootstrapping: The step of row sampling with replacement is referred to as bootstrapping.
Independent Model Training: We train each model independently on its corresponding Bootstrap. This training process generates results for each model.
Majority Voting: The final output by combining the results of all models through majority voting. We select the most commonly predicted outcome among the models.
Aggregation: This step by combining all the results and generating the final output based on majority voting, which we call aggregation.

Now let’s look at an example by breaking it down with the help of the following figure. Here the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with a replacement which means there is a high possibility that each sample won’t contain unique data. The model (Model 01, Model 02, and Model 03) obtained from this bootstrap sample is trained independently. Each model generates results as shown. Now the Happy emoji has a majority when compared to the Sad emoji. Thus based on majority voting final output is obtained as Happy emoji.

Boosting

Boosting is one of the techniques that use the concept of ensemble learning. A boosting algorithm combines multiple simple models (also known as weak learners or base estimators) to generate the final output. It is done by building a model by using weak models in series.

There are several boosting algorithms; AdaBoost was the first really successful boosting algorithm that was developed for the purpose of binary classification. AdaBoost is an abbreviation for Adaptive Boosting and is a prevalent boosting technique that combines multiple “weak classifiers” into a single “strong classifier.”

Random Forest:

Random Forest is a famous machine learning algorithm that uses supervised learning methods. You can apply it to both classification and regression problems. It is based on ensemble learning, which integrates multiple classifiers to solve a complex issue and increases the model's performance.

In layman's terms, Random Forest is a classifier that contains several decision trees on various subsets of a given dataset and takes the average to enhance the predicted accuracy of that dataset. Instead of relying on a single decision tree, the random forest collects the result from each tree and expects the final output based on the majority votes of predictions.

Now that you know what Random Forest Classifier is and why it is one of the most used classification algorithms in machine learning, let's dive into a real-life analogy to understand it better.

Working of Random Forest Algorithm

The Working of the Random Forest Algorithm is quite intuitive.

It is implemented in two phases:

.The first is to combine N decision trees with building the random forest, and

.the second is to make predictions for each tree created in the first phase.

The following steps can be used to demonstrate the working process:

Step 1: Pick M data points at random from the training set.

Step 2: Create decision trees for your chosen data points (Subsets).

Step 3: Each decision tree will produce a result. Analyse it.

Step 4: For classification and regression, accordingly, the final output is based on Majority Voting or Averaging, accordingly.

The flowchart below will help you understand better:

Example - Consider the following scenario: a dataset containing several fruits images. And the Random Forest Classifier is given this dataset. Each decision tree is given a subset of the dataset to work with. During the training phase, each decision tree generates a prediction result. The Random Forest classifier predicts the final decision based on most outcomes when a new data point appears.

Consider the following illustration:

Clustering Overview:

The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering.

A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters.

A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression.

Cluster analysis tools based on k-means, k-medoids, and several methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS.

Applications:

Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing.

In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns.

In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations.

Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost.

Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity.

Clustering can also be used for outlier detection, Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce.

A Categorization of Major Clustering Methods:

➢ Partitioning Methods

➢ Hierarchical Methods

➢ Density-Based Methods

➢ Grid-Based Methods

➢ Model-Based Methods

Partitioning Methods:

A partitioning method constructs k partitions of the data, where each partition represents a cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the following requirements:

Each group must contain at least one object, and

Each object must belong to exactly one group.

A partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another.

The general criterion of a good partitioning is that objects in the same cluster are close or related to each other, whereas objects of different clusters are far apart or very different.

Hierarchical Methods:

A hierarchical method creates a hierarchical decomposition of the given set of data objects. A hierarchical method can be classified as being either agglomerative or divisive, based on how the hierarchical decomposition is formed.

❖ The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one or until a termination condition holds.

❖ The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters,until eventually each object is in one cluster, or until a termination condition holds.

Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be undone. This rigidity is useful in that it leads to smaller computation costs by not having to worry about a combinatorial number of different choices.

Density-based methods:

❖ Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary shapes.

❖ Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold; that is, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points. Such a method can be used to filter out noise (outliers) and discover clusters of arbitrary shape.

❖ DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters according to a density-based connectivity analysis.

Grid-Based Methods:

❖ Grid-based methods quantize the object space into a finite number of cells that form a grid structure.

❖ All of the clustering operations are performed on the grid structure i.e., on the quantized space. The main advantage of this approach is its fast processing time, which is typically independent of the number of data objects and dependent only on the number of cells in each dimension in the quantized space.

❖ STING is a typical example of a grid-based method. Wave Cluster applies wavelet transformation for clustering analysis and is both grid-based and density-based.

Model-Based Methods:

❖ Model-based methods hypothesize a model for each of the clusters and find the best fit of the data to the given model.

❖ A model-based algorithm may locate clusters by constructing a density function that reflects the spatial distribution of the data points.

❖ It also leads to a way of automatically determining the number of clusters based on standard statistics, taking ―noise or outliers into account and thus yielding robust clustering methods.

Classical Partitioning Methods:

The most well-known and commonly used partitioning methods are

❖ k-Means Method

❖ k-Medoids Method

Partitioning Clustering: The K-Means Method:

The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low.

Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.

Clustering of a set of objects based on the k-means method

K-means Clustering Numerical Example with Solution

Now that we have discussed the algorithm, let us solve a numerical problem on k means clustering. The problem is as follows. You are given 15 points in the Cartesian coordinate system as follows.

Point	Coordinates
A1	(2,10)
A2	(2,6)
A3	(11,11)
A4	(6,9)
A5	(6,4)
A6	(1,2)
A7	(5,10)
A8	(4,9)
A9	(10,12)
A10	(7,5)
A11	(9,11)
A12	(4,6)
A13	(3,10)
A14	(3,8)
A15	(6,11)

Input Dataset

We are also given the information that we need to make 3 clusters. It means we are given K=3.We will solve this numerical on k-means clustering using the approach discussed below.

First, we will randomly choose 3 centroids from the given data. Let us consider A2 (2,6), A7 (5,10), and A15 (6,11) as the centroids of the initial clusters. Hence, we will consider that

Centroid 1=(2,6) is associated with cluster 1.
Centroid 2=(5,10) is associated with cluster 2.
Centroid 3=(6,11) is associated with cluster 3.

Now we will find the euclidean distance between each point and the centroids. Based on the minimum distance of each point from the centroids, we will assign the points to a cluster. I have tabulated the distance of the given points from the clusters in the following table

Point	Distance from Centroid 1 (2,6)	Distance from Centroid 2 (5,10)	Distance from Centroid 3 (6,11)	Assigned Cluster
A1 (2,10)	4	3	4.123106	Cluster 2
A2 (2,6)	0	5	6.403124	Cluster 1
A3 (11,11)	10.29563	6.082763	5	Cluster 3
A4 (6,9)	5	1.414214	2	Cluster 2
A5 (6,4)	4.472136	6.082763	7	Cluster 1
A6 (1,2)	4.123106	8.944272	10.29563	Cluster 1
A7 (5,10)	5	0	1.414214	Cluster 2
A8 (4,9)	3.605551	1.414214	2.828427	Cluster 2
A9 (10,12)	10	5.385165	4.123106	Cluster 3
A10 (7,5)	5.09902	5.385165	6.082763	Cluster 1
A11 (9,11)	8.602325	4.123106	3	Cluster 3
A12 (4,6)	2	4.123106	5.385165	Cluster 1
A13 (3,10)	4.123106	2	3.162278	Cluster 2
A14 (3,8)	2.236068	2.828427	4.242641	Cluster 1
A15 (6,11)	6.403124	1.414214	0	Cluster 3

Results from 1st iteration of K means clustering

At this point, we have completed the first iteration of the k-means clustering algorithm and assigned each point into a cluster.

In the above table, you can observe that the point that is closest to the centroid of a given cluster is assigned to the cluster.

Now, we will calculate the new centroid for each cluster.

In cluster 1, we have 6 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), A12 (4,6), A14 (3,8). To calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (3.833, 5.167).
In cluster 2, we have 5 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), and A13 (3,10). Hence, the new centroid for cluster 2 is (4, 9.6)
In cluster 3, we have 4 points i.e. A3 (11,11), A9 (10,12), A11 (9,11), and A15 (6,11). Hence, the new centroid for cluster 3 is (9, 11.25).

Now that we have calculated new centroids for each cluster, we will calculate the distance of each data point from the new centroids. Then, we will assign the points to clusters based on their distance from the centroids. The results for this process have been given in the following table.

Point	Distance from Centroid 1 (3.833, 5.167)	Distance from centroid 2 (4, 9.6)	Distance from centroid 3 (9, 11.25)	Assigned Cluster
A1 (2,10)	5.169	2.040	7.111	Cluster 2
A2 (2,6)	2.013	4.118	8.750	Cluster 1
A3 (11,11)	9.241	7.139	2.016	Cluster 3
A4 (6,9)	4.403	2.088	3.750	Cluster 2
A5 (6,4)	2.461	5.946	7.846	Cluster 1
A6 (1,2)	4.249	8.171	12.230	Cluster 1
A7 (5,10)	4.972	1.077	4.191	Cluster 2
A8 (4,9)	3.837	0.600	5.483	Cluster 2
A9 (10,12)	9.204	6.462	1.250	Cluster 3
A10 (7,5)	3.171	5.492	6.562	Cluster 1
A11 (9,11)	7.792	5.192	0.250	Cluster 3
A12 (4,6)	0.850	3.600	7.250	Cluster 1
A13 (3,10)	4.904	1.077	6.129	Cluster 2
A14 (3,8)	2.953	1.887	6.824	Cluster 2
A15 (6,11)	6.223	2.441	3.010	Cluster 2

Results from 2nd iteration of K means clustering

Now, we have completed the second iteration of the k-means clustering algorithm and assigned each point into an updated cluster. In the above table, you can observe that the point closest to the new centroid of a given cluster is assigned to the cluster.

Now, we will calculate the new centroid for each cluster for the third iteration.

In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14 (3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new centroid for cluster 3 is (10, 11.333).

At this point, we have calculated new centroids for each cluster. Now, we will calculate the distance of each data point from the new centroids. Then, we will assign the points to clusters based on their distance from the centroids. The results for this process have been given in the following table.

Point	Distance from Centroid 1 (4, 4.6)	Distance from centroid 2 (4.143, 9.571)	Distance from centroid 3 (10, 11.333)	Assigned Cluster
A1 (2,10)	5.758	2.186	8.110	Cluster 2
A2 (2,6)	2.441	4.165	9.615	Cluster 1
A3 (11,11)	9.485	7.004	1.054	Cluster 3
A4 (6,9)	4.833	1.943	4.631	Cluster 2
A5 (6,4)	2.088	5.872	8.353	Cluster 1
A6 (1,2)	3.970	8.197	12.966	Cluster 1
A7 (5,10)	5.492	0.958	5.175	Cluster 2
A8 (4,9)	4.400	0.589	6.438	Cluster 2
A9 (10,12)	9.527	6.341	0.667	Cluster 3
A10 (7,5)	3.027	5.390	7.008	Cluster 1
A11 (9,11)	8.122	5.063	1.054	Cluster 3
A12 (4,6)	1.400	3.574	8.028	Cluster 1
A13 (3,10)	5.492	1.221	7.126	Cluster 2
A14 (3,8)	3.544	1.943	7.753	Cluster 2
A15 (6,11)	6.705	2.343	4.014	Cluster 2

Results from 3rd iteration of K means clustering

Now, we have completed the third iteration of the k-means clustering algorithm and assigned each point into an updated cluster. In the above table, you can observe that the point that is closest to the new centroid of a given cluster is assigned to the cluster.

Now, we will calculate the new centroid for each cluster for the third iteration.

In cluster 1, we have 5 points i.e. A2 (2,6), A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To calculate the new centroid for cluster 1, we will find the mean of the x and y coordinates of each point in the cluster. Hence, the new centroid for cluster 1 is (4, 4.6).
In cluster 2, we have 7 points i.e. A1 (2,10), A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14 (3,8), and A15 (6,11). Hence, the new centroid for cluster 2 is (4.143, 9.571)
In cluster 3, we have 3 points i.e. A3 (11,11), A9 (10,12), and A11 (9,11). Hence, the new centroid for cluster 3 is (10, 11.333).

Here, you can observe that no point has changed its cluster compared to the previous iteration. Due to this, the centroid also remains constant. Therefore, we will say that the clusters have been stabilized. Hence, the clusters obtained after the third iteration are the final clusters made from the given dataset. If we plot the clusters on a graph, the graph looks like as follows.

Plot for K-Means Clustering

In the above plot, points in the clusters have been plotted using red, blue, and black markers. The centroids of the clusters have been marked using green circles

The k-Medoids Method:

k-MedoidsAlgorithm:

The k-medoids algorithm for partitioning based on medoid or central objects.The k-medoids method is more robust than k-means in the presence of noise and outliers, because a medoid is less influenced by outliers or other extreme values than a mean. However, its processing is more costly than the k-means method.

K-Medoids (also called Partitioning Around Medoid) algorithm was proposed in 1987 by Kaufman and Rousseeuw. A medoid can be defined as a point in the cluster, whose dissimilarities with all the other points in the cluster are minimum.

The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using

E = |Pi – Ci|

The cost in K-Medoids algorithm is given as

Algorithm:

1. Initialize: select k random points out of the n data points as the medoids.

2. Associate each data point to the closest medoid by using any common distance metric methods.

3. While the cost decreases: For each medoid m, for each data o point which is not a medoid:

· Swap m and o, associate each data point to the closest medoid, and recompute the cost.

· If the total cost is more than that in the previous step, undo the swap.

K-Medoids clustering with solved example

Let’s consider the following example: If a graph is drawn using the above data points, we obtain the following:

Step 1: Let the randomly selected 2 medoids, so select k = 2, and let C1 - (4, 5) and

C2 - (8, 5) are the two medoids.

Step 2: Calculating cost. The dissimilarity of each non-medoid point with the medoids is calculated and tabulated:

	X	Y	Dissimilarity from C1	Dissimilarity from C2	Cluster Assigned
0	8	7	6	2	C2
1	3	7	3	7	C1
2	4	9	4	8	C1
3	9	6	6	2	C2
4	8	5	-	-
5	5	8	4	6	C1
6	7	3	5	3	C2
7	8	4	5	1	C2
8	4	5	-	-
9	7	5	3	1	C2

Here Manhattan distance formula used to calculate the distance matrices between medoid and non-medoid points. That formula is

Distance = |X1-X2| + |Y1-Y2|.

Each point is assigned to the cluster of that medoid whose dissimilarity is less. So Points 1, 2, and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.

The Cost = (3 + 4 + 4) + (3 + 1 + 1 + 2 + 2) = 20

Step 3: randomly select one non-medoid point and recalculate the cost. Let the randomly selected point be (8, 4). The dissimilarity of each non-medoid point with the medoids – C1 (4, 5) and C2 (8, 4) is calculated and tabulated.

Each point is assigned to that cluster whose dissimilarity is less. So, points 1, 2, and 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.

The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22

Swap Cost = New Cost – Previous Cost = 22 – 20

and 2 >0 As the swap cost is not less than zero, we undo the swap.

Hence (4, 5) and (8, 5) are the final medoids.

The clustering would be in the following way

Measures of distance:

There are several types of distance measures, each with its strengths and weaknesses. Here are some of the most commonly used distance measures in clustering:

1. Euclidean Distance

The Euclidean distance is the most widely used distance measure in clustering. It calculates the straight-line distance between two points in n-dimensional space. The formula for Euclidean distance is:

where,

· p and q are two data points

· and n is the number of dimensions.

2. Manhattan Distance

The Manhattan distance, is the total of the absolute differences between their Cartesian coordinates, sometimes referred to as the L1 distance or city block distance. This determines the absolute difference among the pair of the coordinates. Suppose we have two points P and Q to determine the distance between these points we simply have to calculate the perpendicular distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate (x1, y1) and Q at (x2, y2). Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

The formula is:

Here the total distance of the Red line gives the Manhattan distance between both the points.

3. Cosine Similarity

Instead than concentrating on the exact distance between data points, cosine similarity measure looks at their orientation. It calculates the cosine of the angle between two data points, with a higher cosine value indicating greater similarity. This measure is often used for text data analysis, where the order of features (words in a sentence) might not be as crucial as their presence. It is used to determine how similar the vectors are, irrespective of their magnitude.

Cosine distance measure for clustering determines the cosine of the angle between two vectors given by the following formula.

Here (theta) gives the angle between two vectors and A, B are n-dimensional vectors.

4. Minkowski Distance

Minkowski distance is a generalized form of both Euclidean and Manhattan distances, controlled by a parameter p. In an N-dimensional space, a point is represented as, (x1, x2, ..., xN)

Consider two points P1 and P2:

P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)

5. Jaccard Index

The Jaccard distance measures the similarity of the two data set items as the intersection of those items divided by the union of the data items. This measure is ideal for binary data, where features can only take values of 0 or 1

Agglomerative hierarchical clustering:

This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied.

Most hierarchical clustering methods belong to this category. They differ only in their definition of intercluster similarity.

1. Agglomerative Algorithm: Single Link

Single-nearest distance or single linkage is the agglomerative method that uses the distance between the closest members of the two clusters. We will now solve a problem to understand it better:

Question. Find the clusters using a single link technique. Use Euclidean distance and draw the dendrogram.

Solution:

Step 1: Compute the distance matrix by:

So we have to find the Euclidean distance between each and every points, say we first find the euclidean distance between P1 and P2

So the DISTANCE MATRIX will look like this:

Similarly, find the Euclidean distance for every point. But there is one point to focus on that the diagonal of the above distance matrix is a special point for us. The distance above and below the diagonal will be same. For eg: d (P2, P5) is equivalent to d(P5, P2). So we will find the distance of the below section of the matrix.

2. Agglomerative Algorithm: Complete Link

In this algorithm, complete farthest distance or complete linkage is the agglomerative method that uses the distance between the members that are the farthest apart.

Example. For the given set of points, identify clusters using the complete link agglomerative clustering

Divisive hierarchical clustering:

This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.

It repeatedly executes the subsequent steps:

1. Identify the 2 clusters which can be closest together, and

2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the clusters are merged together.

Divisive Hierarchical clustering is precisely the opposite of Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the data points as a single cluster and in every iteration, we separate the data points from the clusters which aren’t comparable. In the end, we are left with N clusters.

Divisive Hierarchical clustering

Gaussian Mixture Models (GMM) Clustering:

Take an example of analysing customer buying pattern at a shopping mall. Some shoppers buy luxury goods, some prefer budget-friendly options, and others fall somewhere in between. Traditional clustering methods like K-Means force these customers into rigid groups. But what if a customer is partly a budget shopper and partly a luxury buyer? This is where Gaussian Mixture Models (GMM) shines — allowing soft clustering based on probabilities rather than hard assignments.

At its core, a GMM is a combination of several Gaussian components. These components are defined by their mean, covariance matrices, and weights, providing a comprehensive representation of data distributions. The probability density function of a GMM is a sum of its components, each weighted accordingly. It assumes that data is generated from a mixture of multiple Gaussian (normal) distributions.

Notation:

· K: Number of Gaussian components

· N: Number of data points

· D: Dimensionality of the data

GMM Parameters:

· Means (μ): Center locations of Gaussian components.

· Covariance Matrices (Σ): Define the shape and spread of each component.

· Weights (π): Probability of selecting each component.

GMM is trained using the Expectation-Maximization (EM) algorithm, which iteratively improves cluster assignments.

Step 1: Initialization — Randomly assign Gaussian components.
Step 2: Expectation Step (E-Step) — Compute the probability that each point belongs to a Gaussian.
Step 3: Maximization Step (M-Step) — Update Gaussian parameters (μ, Σ, π) to best fit the data.
Step 4: Repeat until Convergence.

Mathematical Representation of GMM:

· The likelihood of a data point under a GMM is the sum of the probabilities of that point belonging to each of the Gaussian components.

· The log-likelihood of the entire dataset is the sum of the logarithms of these probabilities for each data point.

Example:

CrazyLearners🙂

Search This Blog