Unit 3
Syllabus:
Ensemble Techniques II- C4.5, CHAID (Chi-Square Automatic Interaction
Detection), Random Forest Algorithm.
Unsupervised Learning-Clustering: Measures of distance, k-means, Gaussian Mixture Model
Clustering, Hierarchical Learning- Divisive, Agglomerative Clustering
The
C4.5 algorithm is a successor to the ID3 algorithm. C4.5 was developed by Ross
Quinlan with improvements in handling both categorical and continuous data,
as well as handling missing values. While ID3 works well with categorical data,
C4.5 introduces a mechanism to handle continuous data by selecting optimal
thresholds for splitting. This allows C4.5 to be more versatile in handling
different types of datasets.
Data set
We are going to create a decision table for the following
dataset. It informs about decision making factors to play tennis at outside for
previous 14 days
Firstly, we need to calculate global entropy. There are 14
examples; 9 instances refer to yes decision, and 5 instances refer to no
decision.
CHAID (Chi-Squared
Automatic Interaction Detector):
CHAID is a decision tree algorithm that is based on the chi-squared test
of independence. It was developed by Gordon Kass in 1980, and it has since
become a popular method for building decision trees. CHAID is an acronym for
Chi-Squared Automatic Interaction Detector, which refers to the statistical
test used to determine the significance of the relationships between variables.
The CHAID algorithm works by recursively splitting the data into
subsets based on the categorical predictor variables that have the strongest
association with the response variable. At each step, CHAID calculates the
chi-squared test of independence between the response variable and each of the
categorical predictor variables. The variable with the strongest
association is chosen as the split variable, and the data is divided into
subsets based on the categories of that variable. This process is repeated for
each subset until the stopping criteria are met. The chi-squared test of
independence tests the null hypothesis that there is no association between the
two variables.
If the chi-squared test of independence shows that there is a
significant association between the response variable and a predictor variable,
then that predictor variable is chosen as the split variable. The data is then
split into subsets based on the categories of the split variable. The process
is repeated for each subset until the stopping criteria are met. The stopping
criteria can be based on the depth of the tree, the number of observations in
each leaf node, or other criteria.
Random Forest Algorithm:
Before
understanding the working of the random forest algorithm in machine learning,
we must look into the ensemble learning technique. Ensemble simply
means combining multiple models. Thus a collection of models is used to make
predictions rather than an individual model.
Ensemble
uses two types of methods:
Random forest Classifier
works on the Bagging principle.
Bagging
Bagging,
also known as Bootstrap Aggregation, serves as the ensemble technique in the
Random Forest algorithm.
- Selection
of Subset: Bagging starts by choosing a random sample, or subset, from the
entire dataset.
- Bootstrap
Sampling: Each model from these samples, called Bootstrap Samples, which we
take from the original data with replacement. This process is known as row
sampling.
- Bootstrapping: The
step of row sampling with replacement is referred to as bootstrapping.
- Independent
Model Training: We train each model independently on its corresponding Bootstrap.
This training process generates results for each model.
- Majority
Voting: The final output by combining the results of all models through
majority voting. We select the most commonly predicted outcome among the
models.
- Aggregation: This
step by combining all the results and generating the final output based on
majority voting, which we call aggregation.
Now let’s look at an example by breaking
it down with the help of the following figure. Here the bootstrap sample is
taken from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap
sample 03) with a replacement which means there is a high possibility that each
sample won’t contain unique data. The model (Model 01, Model 02, and Model 03)
obtained from this bootstrap sample is trained independently. Each model
generates results as shown. Now the Happy emoji has a majority when compared to
the Sad emoji. Thus based on majority voting final output is obtained as Happy
emoji.
Boosting
Boosting is
one of the techniques that use the concept of ensemble learning. A boosting
algorithm combines multiple simple models (also known as weak learners or base
estimators) to generate the final output. It is done by building a model by
using weak models in series.
There are
several boosting algorithms; AdaBoost was the first really successful boosting
algorithm that was developed for the purpose of binary classification. AdaBoost
is an abbreviation for Adaptive Boosting and is a prevalent boosting technique
that combines multiple “weak classifiers” into a single “strong classifier.”
Random Forest:
Random Forest is a
famous machine learning algorithm that uses supervised learning methods. You
can apply it to both classification and regression problems. It is based
on ensemble learning, which integrates multiple classifiers to solve a complex
issue and increases the model's performance.
In layman's terms,
Random Forest is a classifier that contains several decision trees on
various subsets of a given dataset and takes the average to enhance the
predicted accuracy of that dataset. Instead of relying on a single decision
tree, the random forest collects the result from each tree and expects the
final output based on the majority votes of predictions.
Now that you know
what Random Forest Classifier is and why it is one of the most used
classification algorithms in machine learning, let's dive into a real-life
analogy to understand it better.
Working of
Random Forest Algorithm
The Working of the Random Forest
Algorithm is quite intuitive.
It is implemented in two phases:
.The first is
to combine N decision trees with building the random forest, and
.the second is
to make predictions for each tree created in the first phase.
The following steps can be used
to demonstrate the working process:
Step 1: Pick M data points at
random from the training set.
Step 2: Create decision trees for
your chosen data points (Subsets).
Step 3: Each decision tree will
produce a result. Analyse it.
Step 4: For classification and
regression, accordingly, the final output is based on Majority Voting or Averaging, accordingly.
The flowchart
below will help you understand better:
Example - Consider the
following scenario: a dataset containing several fruits images. And the Random
Forest Classifier is given this dataset. Each decision tree is given a subset
of the dataset to work with. During the training phase, each decision tree
generates a prediction result. The Random Forest classifier predicts the final
decision based on most outcomes when a new data point appears.
Consider the
following illustration:
Clustering Overview:
Applications:
A Categorization of Major Clustering Methods:
➢
Partitioning Methods
➢
Hierarchical Methods
➢
Density-Based Methods
➢
Grid-Based Methods
➢
Model-Based Methods
Partitioning
Methods:
A partitioning method constructs k partitions of the data, where each partition represents a cluster and k <= n. That is, it classifies the data into k groups,
which together satisfy the following requirements:
A partitioning method
creates an initial partitioning. It then uses an iterative relocation technique
that attempts to improve the partitioning by moving objects from one group to another.
The general criterion
of a good partitioning is that objects
in the same cluster are close or related to each other,
whereas objects of different
clusters are far apart or very different.
Hierarchical
Methods:
A hierarchical method
creates a hierarchical decomposition of the
given set of data objects. A
hierarchical method can be classified as being either agglomerative or
divisive, based on how the hierarchical decomposition is
formed.
❖
The agglomerative approach, also called the bottom-up approach, starts with each object forming a
separate group. It successively merges the objects or groups that are close to
one another, until all of the groups are merged into one or until a termination condition holds.
❖
The divisive approach, also called the top-down approach, starts with all of the objects
in the same cluster. In each
successive iteration, a cluster is split up into smaller clusters,until eventually each object is in one cluster, or until a termination condition
holds.
Hierarchical methods suffer from the fact
that once a step (merge or split) is done, it can never be undone. This
rigidity is useful in that it leads to smaller computation costs by not having
to worry about a combinatorial
number of different choices.
Density-based methods:
❖
Most partitioning
methods cluster objects based on the distance between objects. Such methods can
find only spherical-shaped clusters and encounter difficulty at discovering clusters of arbitrary
shapes.
❖
Other clustering methods have been developed based on the notion of density. Their general idea is to continue growing
the given cluster as long as the density in the neighborhood exceeds some
threshold; that is, for each data point within a given cluster,
the neighborhood of a given
radius has to contain at least
a minimum number of points.
Such a method can be used to filter out noise (outliers) and discover clusters
of arbitrary shape.
❖
DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters according to a density-based connectivity analysis.
Grid-Based Methods:
❖
Grid-based
methods quantize the object space into a finite number of cells that form a
grid structure.
❖
All of the
clustering operations are performed on the grid structure i.e.,
on the quantized space. The main advantage of this approach
is its fast processing time, which is typically
independent of the number of data objects and dependent
only on the number of cells in
each dimension in the quantized space.
❖
STING is a typical
example of a grid-based method.
Wave Cluster applies
wavelet transformation for
clustering analysis and is both grid-based
and density-based.
Model-Based Methods:
❖
Model-based
methods hypothesize a model
for each of the clusters and find the best
fit of the data to the given model.
❖
A
model-based algorithm may locate clusters by constructing a density function
that reflects the spatial
distribution of the data points.
❖
It also leads to a way of automatically determining the number
of clusters based on standard statistics, taking ―noise or outliers into account
and thus yielding
robust clustering methods.
Classical Partitioning Methods:
The most well-known and commonly used partitioning methods
are
❖
k-Means Method
❖
k-Medoids Method
Partitioning Clustering: The K-Means Method:
The k-means algorithm takes the input
parameter, k, and partitions a set of n objects into k clusters so that the
resulting intracluster similarity is high but the intercluster similarity is low.
Cluster similarity is measured in
regard to the mean value of the
objects in a cluster, which can be viewed as the
cluster’s centroid or center of gravity.
Clustering of a set of objects based on the k-means
method
K-means Clustering Numerical
Example with Solution
Now that we have discussed the algorithm,
let us solve a numerical problem on k means clustering. The problem is as
follows. You are given 15 points in the Cartesian coordinate system as follows.
|
Point |
Coordinates |
|
A1 |
(2,10) |
|
A2 |
(2,6) |
|
A3 |
(11,11) |
|
A4 |
(6,9) |
|
A5 |
(6,4) |
|
A6 |
(1,2) |
|
A7 |
(5,10) |
|
A8 |
(4,9) |
|
A9 |
(10,12) |
|
A10 |
(7,5) |
|
A11 |
(9,11) |
|
A12 |
(4,6) |
|
A13 |
(3,10) |
|
A14 |
(3,8) |
|
A15 |
(6,11) |
Input Dataset
We are also given the information
that we need to make 3 clusters. It means we are given K=3.We will solve this
numerical on k-means clustering using the approach discussed below.
First, we will randomly choose 3
centroids from the given data. Let us consider A2 (2,6), A7 (5,10), and A15
(6,11) as the centroids of the initial clusters. Hence, we will consider that
- Centroid 1=(2,6) is associated with cluster 1.
- Centroid 2=(5,10) is associated with cluster
2.
- Centroid 3=(6,11) is associated with cluster
3.
Now we will find the euclidean
distance between each point and the centroids. Based on the minimum distance of
each point from the centroids, we will assign the points to a cluster. I have
tabulated the distance of the given points from the clusters in the following
table
|
Point |
Distance from Centroid 1 (2,6) |
Distance from Centroid 2 (5,10) |
Distance from Centroid 3 (6,11) |
Assigned Cluster |
|
A1 (2,10) |
4 |
3 |
4.123106 |
Cluster 2 |
|
A2 (2,6) |
0 |
5 |
6.403124 |
Cluster 1 |
|
A3 (11,11) |
10.29563 |
6.082763 |
5 |
Cluster 3 |
|
A4 (6,9) |
5 |
1.414214 |
2 |
Cluster 2 |
|
A5 (6,4) |
4.472136 |
6.082763 |
7 |
Cluster 1 |
|
A6 (1,2) |
4.123106 |
8.944272 |
10.29563 |
Cluster 1 |
|
A7 (5,10) |
5 |
0 |
1.414214 |
Cluster 2 |
|
A8 (4,9) |
3.605551 |
1.414214 |
2.828427 |
Cluster 2 |
|
A9 (10,12) |
10 |
5.385165 |
4.123106 |
Cluster 3 |
|
A10 (7,5) |
5.09902 |
5.385165 |
6.082763 |
Cluster 1 |
|
A11 (9,11) |
8.602325 |
4.123106 |
3 |
Cluster 3 |
|
A12 (4,6) |
2 |
4.123106 |
5.385165 |
Cluster 1 |
|
A13 (3,10) |
4.123106 |
2 |
3.162278 |
Cluster 2 |
|
A14 (3,8) |
2.236068 |
2.828427 |
4.242641 |
Cluster 1 |
|
A15 (6,11) |
6.403124 |
1.414214 |
0 |
Cluster 3 |
Results from 1st iteration of K means clustering
At this point, we have completed the
first iteration of the k-means clustering algorithm and assigned each point into
a cluster.
In the above table, you can observe
that the point that is closest to the centroid of a given cluster is assigned
to the cluster.
Now, we will calculate the new
centroid for each cluster.
- In cluster 1, we have 6 points i.e. A2 (2,6),
A5 (6,4), A6 (1,2), A10 (7,5), A12 (4,6), A14 (3,8). To calculate the new
centroid for cluster 1, we will find the mean of the x and y coordinates
of each point in the cluster. Hence, the new centroid for cluster 1 is
(3.833, 5.167).
- In cluster 2, we have 5 points i.e. A1 (2,10),
A4 (6,9), A7 (5,10) , A8 (4,9), and A13 (3,10). Hence, the new centroid
for cluster 2 is (4, 9.6)
- In cluster 3, we have 4 points i.e. A3
(11,11), A9 (10,12), A11 (9,11), and A15 (6,11). Hence, the new centroid
for cluster 3 is (9, 11.25).
Now that we have calculated new
centroids for each cluster, we will calculate the distance of each data point
from the new centroids. Then, we will assign the points to clusters based on
their distance from the centroids. The results for this process have been given
in the following table.
|
Point |
Distance from Centroid 1 (3.833, 5.167) |
Distance from centroid 2 (4, 9.6) |
Distance from centroid 3 (9, 11.25) |
Assigned Cluster |
|
A1 (2,10) |
5.169 |
2.040 |
7.111 |
Cluster 2 |
|
A2 (2,6) |
2.013 |
4.118 |
8.750 |
Cluster 1 |
|
A3 (11,11) |
9.241 |
7.139 |
2.016 |
Cluster 3 |
|
A4 (6,9) |
4.403 |
2.088 |
3.750 |
Cluster 2 |
|
A5 (6,4) |
2.461 |
5.946 |
7.846 |
Cluster 1 |
|
A6 (1,2) |
4.249 |
8.171 |
12.230 |
Cluster 1 |
|
A7 (5,10) |
4.972 |
1.077 |
4.191 |
Cluster 2 |
|
A8 (4,9) |
3.837 |
0.600 |
5.483 |
Cluster 2 |
|
A9 (10,12) |
9.204 |
6.462 |
1.250 |
Cluster 3 |
|
A10 (7,5) |
3.171 |
5.492 |
6.562 |
Cluster 1 |
|
A11 (9,11) |
7.792 |
5.192 |
0.250 |
Cluster 3 |
|
A12 (4,6) |
0.850 |
3.600 |
7.250 |
Cluster 1 |
|
A13 (3,10) |
4.904 |
1.077 |
6.129 |
Cluster 2 |
|
A14 (3,8) |
2.953 |
1.887 |
6.824 |
Cluster 2 |
|
A15 (6,11) |
6.223 |
2.441 |
3.010 |
Cluster 2 |
Results from 2nd iteration of K means clustering
Now, we have completed the second
iteration of the k-means clustering algorithm and assigned each point into an
updated cluster. In the above table, you can observe that the point closest to
the new centroid of a given cluster is assigned to the cluster.
Now, we will calculate the new
centroid for each cluster for the third iteration.
- In cluster 1, we have 5 points i.e. A2 (2,6),
A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To calculate the new
centroid for cluster 1, we will find the mean of the x and y coordinates
of each point in the cluster. Hence, the new centroid for cluster 1 is (4,
4.6).
- In cluster 2, we have 7 points i.e. A1 (2,10),
A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14 (3,8), and A15 (6,11).
Hence, the new centroid for cluster 2 is (4.143, 9.571)
- In cluster 3, we have 3 points i.e. A3
(11,11), A9 (10,12), and A11 (9,11). Hence, the new centroid for cluster 3
is (10, 11.333).
At this point, we have calculated new
centroids for each cluster. Now, we will calculate the distance of each data
point from the new centroids. Then, we will assign the points to clusters based
on their distance from the centroids. The results for this process have been
given in the following table.
|
Point |
Distance from Centroid 1 (4, 4.6) |
Distance from centroid 2 (4.143, 9.571) |
Distance from centroid 3 (10, 11.333) |
Assigned Cluster |
|
A1 (2,10) |
5.758 |
2.186 |
8.110 |
Cluster 2 |
|
A2 (2,6) |
2.441 |
4.165 |
9.615 |
Cluster 1 |
|
A3 (11,11) |
9.485 |
7.004 |
1.054 |
Cluster 3 |
|
A4 (6,9) |
4.833 |
1.943 |
4.631 |
Cluster 2 |
|
A5 (6,4) |
2.088 |
5.872 |
8.353 |
Cluster 1 |
|
A6 (1,2) |
3.970 |
8.197 |
12.966 |
Cluster 1 |
|
A7 (5,10) |
5.492 |
0.958 |
5.175 |
Cluster 2 |
|
A8 (4,9) |
4.400 |
0.589 |
6.438 |
Cluster 2 |
|
A9 (10,12) |
9.527 |
6.341 |
0.667 |
Cluster 3 |
|
A10 (7,5) |
3.027 |
5.390 |
7.008 |
Cluster 1 |
|
A11 (9,11) |
8.122 |
5.063 |
1.054 |
Cluster 3 |
|
A12 (4,6) |
1.400 |
3.574 |
8.028 |
Cluster 1 |
|
A13 (3,10) |
5.492 |
1.221 |
7.126 |
Cluster 2 |
|
A14 (3,8) |
3.544 |
1.943 |
7.753 |
Cluster 2 |
|
A15 (6,11) |
6.705 |
2.343 |
4.014 |
Cluster 2 |
Results from 3rd iteration of K
means clustering
Now, we have completed the third
iteration of the k-means clustering algorithm and assigned each point into an
updated cluster. In the above table, you can observe that the point that is
closest to the new centroid of a given cluster is assigned to the
cluster.
Now, we will calculate the new
centroid for each cluster for the third iteration.
- In cluster 1, we have 5 points i.e. A2 (2,6),
A5 (6,4), A6 (1,2), A10 (7,5), and A12 (4,6). To calculate the new
centroid for cluster 1, we will find the mean of the x and y coordinates
of each point in the cluster. Hence, the new centroid for cluster 1 is (4,
4.6).
- In cluster 2, we have 7 points i.e. A1 (2,10),
A4 (6,9), A7 (5,10) , A8 (4,9), A13 (3,10), A14 (3,8), and A15 (6,11).
Hence, the new centroid for cluster 2 is (4.143, 9.571)
- In cluster 3, we have 3 points i.e. A3
(11,11), A9 (10,12), and A11 (9,11). Hence, the new centroid for cluster 3
is (10, 11.333).
Here, you can observe that no point
has changed its cluster compared to the previous iteration. Due to this, the
centroid also remains constant. Therefore, we will say that the clusters have
been stabilized. Hence, the clusters obtained after the third iteration are the
final clusters made from the given dataset. If we plot the clusters on a graph,
the graph looks like as follows.
Plot for K-Means Clustering
In the above plot, points in the
clusters have been plotted using red, blue, and black markers. The centroids of
the clusters have been marked using green circles
The k-Medoids Method:
k-MedoidsAlgorithm:
The k-medoids algorithm for partitioning based on medoid
or central objects.The
k-medoids method is more robust than k-means
in the presence of noise and outliers, because a medoid is less influenced by
outliers or other extreme values than a
mean. However, its processing is more costly than the k-means method.
K-Medoids (also called
Partitioning Around Medoid) algorithm was proposed in 1987 by Kaufman and
Rousseeuw. A medoid can be defined as a point in the cluster, whose
dissimilarities with all the other points in the cluster are minimum.
The dissimilarity
of the medoid(Ci) and object(Pi) is calculated by using
E = |Pi – Ci|
The cost in
K-Medoids algorithm is given as
Algorithm:
1.
Initialize: select k random
points out of the n data points as the medoids.
2.
Associate each data point to the closest medoid by
using any common distance metric methods.
3.
While the cost decreases: For each medoid m, for
each data o point which is not a medoid:
·
Swap m and o, associate each data point to the
closest medoid, and recompute the cost.
·
If the total cost is more than that in the previous
step, undo the swap.
K-Medoids clustering with solved example
Let’s
consider the following example: If a graph is drawn using the above data
points, we obtain the following:
Step
1: Let the
randomly selected 2 medoids, so select k = 2, and let C1 - (4, 5) and
C2 - (8, 5) are the two medoids.
Step
2: Calculating cost. The dissimilarity of each non-medoid point with the medoids is
calculated and tabulated:
|
|
X |
Y |
Dissimilarity
from C1 |
Dissimilarity
from C2 |
Cluster
Assigned |
|
0 |
8 |
7 |
6 |
2 |
C2 |
|
1 |
3 |
7 |
3 |
7 |
C1 |
|
2 |
4 |
9 |
4 |
8 |
C1 |
|
3 |
9 |
6 |
6 |
2 |
C2 |
|
4 |
8 |
5 |
- |
- |
|
|
5 |
5 |
8 |
4 |
6 |
C1 |
|
6 |
7 |
3 |
5 |
3 |
C2 |
|
7 |
8 |
4 |
5 |
1 |
C2 |
|
8 |
4 |
5 |
- |
- |
|
|
9 |
7 |
5 |
3 |
1 |
C2 |
Here Manhattan
distance formula used to calculate the distance matrices between medoid and
non-medoid points. That formula is
Distance = |X1-X2| + |Y1-Y2|.
Each point is
assigned to the cluster of that medoid whose dissimilarity is less. So Points 1, 2, and 5
go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The Cost = (3 + 4 +
4) + (3 + 1 + 1 + 2 + 2) = 20
Step
3: randomly select one non-medoid point and recalculate the cost. Let the randomly selected point
be (8, 4). The dissimilarity of each non-medoid point with the medoids – C1
(4, 5) and C2 (8, 4) is calculated and tabulated.
Each point is assigned to that cluster
whose dissimilarity is less. So, points 1, 2, and 5 go to cluster C1 and 0, 3,
6, 7, 8 go to cluster C2.
The New cost = (3 + 4 + 4) + (2 + 2 + 1 +
3 + 3) = 22
Swap Cost = New Cost – Previous Cost = 22
– 20
and 2 >0 As
the swap cost is not less than zero, we undo the swap.
Hence
(4, 5) and (8, 5) are the final medoids.
The
clustering would be in the following way
Measures
of distance:
There are several types of
distance measures, each with its strengths and weaknesses. Here are some of the
most commonly used distance measures in clustering:
1.
Euclidean Distance
The Euclidean distance is the most widely used distance
measure in clustering. It calculates the
straight-line distance between two points in n-dimensional space. The formula for Euclidean distance is:
where,
·
p and q are two data points
·
and n is the number of dimensions.
2.
Manhattan Distance
The Manhattan distance, is the total of the absolute differences between their Cartesian
coordinates, sometimes referred to as the L1
distance or city block distance. This determines the absolute difference
among the pair of the coordinates. Suppose we have two points P and Q to
determine the distance between these points we simply have to calculate the perpendicular
distance of the points from X-Axis and Y-Axis. In a plane with P at coordinate
(x1, y1) and Q at (x2, y2). Manhattan distance between P and Q = |x1 – x2| +
|y1 – y2|
The formula is:
Here the total
distance of the Red line gives the Manhattan distance between both the
points.
3.
Cosine Similarity
Instead
than concentrating on the exact distance between data points, cosine similarity measure looks at their orientation. It calculates the cosine of the angle between two data points, with a higher cosine value
indicating greater similarity. This measure is often used for text data
analysis, where the order of features (words in a sentence) might not be as
crucial as their presence. It is used to determine how similar the vectors are,
irrespective of their magnitude.
Cosine distance
measure for clustering determines the cosine of the angle between two
vectors given by the following formula.
Here (theta)
gives the angle between two vectors and A, B are n-dimensional vectors.
4.
Minkowski Distance
Minkowski distance is a generalized form of both
Euclidean and Manhattan distances, controlled by a parameter p. In an N-dimensional
space, a point is represented as, (x1, x2, ..., xN)
Consider two points
P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
5.
Jaccard Index
The Jaccard
distance measures the similarity of the two data set items as the intersection
of those items divided by the union of the data items. This measure is ideal for binary data, where features can only
take values of 0 or 1
Agglomerative hierarchical clustering:
1. Agglomerative
Algorithm: Single Link
Single-nearest distance or single linkage is the
agglomerative method that uses the distance between the closest members of the
two clusters. We will now solve a problem to understand it
better:
Question. Find the
clusters using a single link technique. Use
Euclidean distance and draw the dendrogram.
Solution:
Step 1: Compute
the distance matrix by:
So we have to find the Euclidean distance
between each and every points, say we first find the euclidean distance between
P1 and P2
So the
DISTANCE MATRIX will look like this:
Similarly, find
the Euclidean distance for every point. But there is one point to focus on that
the diagonal of the above distance matrix is a special point for us. The
distance above and below the diagonal will be same. For eg: d (P2, P5) is equivalent to d(P5, P2). So we
will find the distance of the below section of the matrix.
2. Agglomerative
Algorithm: Complete Link
In this algorithm, complete farthest distance or
complete linkage is the agglomerative method that uses the distance between the
members that are the farthest apart.
Example. For the given set of points, identify
clusters using the complete link agglomerative clustering
Divisive hierarchical clustering:
This top-down strategy does the
reverse of agglomerative hierarchical clustering by starting with all objects in
one cluster. It subdivides the cluster into smaller and
smaller pieces, until each object
forms a cluster on its own or until it satisfies certain termination
conditions, such as a desired number of clusters is obtained or the diameter of each cluster is within a certain threshold.
It repeatedly
executes the subsequent steps:
1. Identify
the 2 clusters which can be closest together, and
2. Merge
the 2 maximum comparable clusters. We need to continue these steps until all
the clusters are merged together.
Divisive
Hierarchical clustering is precisely the opposite of
Agglomerative Hierarchical clustering. In Divisive Hierarchical clustering, we
take into account all of the data points as a single cluster and in every
iteration, we separate the data points from the clusters which aren’t
comparable. In the end, we are left with N clusters.
Divisive
Hierarchical clustering
Gaussian Mixture Models (GMM) Clustering:
Take an example of analysing customer buying
pattern at a shopping mall. Some shoppers buy luxury goods, some prefer
budget-friendly options, and others fall somewhere in between. Traditional
clustering methods like K-Means force these customers into rigid groups. But
what if a customer is partly a budget shopper and partly a luxury buyer? This
is where Gaussian Mixture Models (GMM) shines — allowing soft clustering based on probabilities
rather than hard assignments.
At its core, a
GMM is a combination of several Gaussian components. These components are defined by their mean, covariance
matrices, and weights, providing a comprehensive representation of data
distributions. The probability density function of a GMM is a sum of its
components, each weighted accordingly. It assumes that data is generated from a mixture of
multiple Gaussian (normal) distributions.
Notation:
·
K: Number of Gaussian components
·
N: Number of data points
·
D: Dimensionality of the data
GMM Parameters:
·
Means (μ): Center locations of Gaussian components.
·
Covariance Matrices (Σ): Define the shape and spread of each component.
·
Weights (π): Probability of selecting each component.
GMM is trained using the Expectation-Maximization
(EM) algorithm, which iteratively improves cluster assignments.
Step 1: Initialization — Randomly assign Gaussian components.
Step 2: Expectation Step (E-Step) — Compute the probability that
each point belongs to a Gaussian.
Step 3: Maximization Step (M-Step) — Update Gaussian parameters (μ,
Σ, π) to best fit the data.
Step 4: Repeat until Convergence.
Mathematical
Representation of GMM:
· The likelihood of a data point under a GMM is the
sum of the probabilities of that point belonging to each of the Gaussian
components.
· The log-likelihood of the entire
dataset is the sum of the logarithms of these probabilities for each data
point.
Example:
Comments
Post a Comment