0% found this document useful (0 votes)
3 views12 pages

ml 8

The document outlines the teaching practices for a course on Fundamentals of Machine Learning at GMR Institute of Technology, focusing on clustering algorithms like k-Means, Agglomerative, Divisive, and Self-Organizing Maps. It details the objectives, intended learning outcomes, teaching methodologies, and various data science tools used in the course. The content includes algorithms and methodologies for hierarchical clustering and introduces popular data science tools such as Python, R, and Apache Spark.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

ml 8

The document outlines the teaching practices for a course on Fundamentals of Machine Learning at GMR Institute of Technology, focusing on clustering algorithms like k-Means, Agglomerative, Divisive, and Self-Organizing Maps. It details the objectives, intended learning outcomes, teaching methodologies, and various data science tools used in the course. The content includes algorithms and methodologies for hierarchical clustering and introduces popular data science tools such as Python, R, and Apache Spark.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

GMR Institute of Technology GMRIT/ADM/F-44

Rajam, AP REV.: 00
(An Autonomous Institution Affiliated to JNTUGV, AP)

Cohesive Teaching – Learning Practices (CTLP)

Class 4th Sem. – B. Tech Department: CSE-AI&ML


Course Fundamentals of Machine Learning Course Code 21ML405
Prepared by Dr. S. Akila Agnes, Ms Manisha Das
Lecture Topic Clustering algorithms(k-Means, Agglomerative/Divisive, DBSCAN and Self
Organizing Maps) and Evaluation Metrics, Data Science Tools
Course Outcome (s) CO6 Program Outcome (s) PO1, PO2, PSO1, PSO2
Duration 50 Min Lecture 42-45 Unit – IV
Pre-requisite (s) Fundamentals of Python

1. Objective

 Understand different clustering techniques.


 Have some knowledge on different data science tools

2. Intended Learning Outcomes (ILOs)

At the end of this session the students will be able to:

1. Summarize different types of clustering algorithms.


2. Understand various data science tools.

3. 2D Mapping of ILOs with Knowledge Dimension and Cognitive Learning Levels of RBT

Cognitive Learning Levels


Knowledge
Remember Understand Apply Analyze Evaluate Create
Dimension
Factual  
Conceptual  
Procedural
Meta Cognitive

4. Teaching Methodology
 Power Point Presentation, Chalk Talk, visual presentation

5. Evocation
6. Deliverables

Lecture Notes-42:

Agglomerative Clustering:
Hierarchical clustering is a connectivity-based clustering model that groups the data points together
that are close to each other based on the measure of similarity or distance. The assumption is that
data points that are close to each other are more similar or related than data points that are farther
apart.

A dendrogram, a tree-like figure produced by hierarchical clustering, depicts the hierarchical


relationships between groups. Individual data points are located at the bottom of the dendrogram,
while the largest clusters, which include all the data points, are located at the top. In order to
generate different numbers of clusters, the dendrogram can be sliced at various heights.
The dendrogram is created by iteratively merging or splitting clusters based on a measure of
similarity or distance between data points. Clusters are divided or merged repeatedly until all data
points are contained within a single cluster, or until the predetermined number of clusters is
attained.
We can look at the dendrogram and measure the height at which the branches of the dendrogram
form distinct clusters to calculate the ideal number of clusters. The dendrogram can be sliced at this
height to determine the number of clusters.

Types of Hierarchical Clustering

Basically, there are two types of hierarchical Clustering:


1. Agglomerative Clustering
2. Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering (HAC). A
structure that is more informative than the unstructured set of clusters returned by flat clustering.
This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively agglomerate
pairs of clusters until all clusters have been merged into a single cluster that contains all data.
Algorithm :
given a dataset (d1, d2, d3, ..... dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains
Hierarchical Agglomerative Clustering

Steps:
 Consider each alphabet as a single cluster and calculate the distance of one cluster from
all the other clusters.
 In the second step, comparable clusters are merged together to form a single cluster.
Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge
them in the second step similarly to cluster (D) and (E) and at last, we get the clusters
[(A), (BC), (DE), (F)]
 We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
 Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
 At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].

Lecture Notes-43:

Hierarchical Divisive clustering


It is also known as a top-down approach. This algorithm also does not require to prespecify the
number of clusters. Top-down clustering requires a method for splitting a cluster that contains the
whole data and proceeds by splitting clusters recursively until individual data have been split into
singleton clusters.
Algorithm :
given a dataset (d1, d2, d3, ..... dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster

Hierarchical Divisive clustering

Computing Distance Matrix


While merging two clusters we check the distance between two every pair of clusters and merge
the pair with the least distance/most similarity. But the question is how is that distance determined.
There are different ways of defining Inter Cluster distance/similarity. Some of them are:
1. Min Distance: Find the minimum distance between any two points of the cluster.
2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.
4. Ward’s Method: The similarity of two clusters is based on the increase in squared error
when two clusters are merged.
For example, if we group a given data using different methods, we may get different results.
Lecture Notes-44:
Self Organizing Maps:
Self Organizing Map (or Kohonen Map or SOM) is a type of Artificial Neural Network which is
also inspired by biological models of neural systems from the 1970s. It follows an unsupervised
learning approach and trained its network through a competitive learning algorithm. SOM is
used for clustering and mapping (or dimensionality reduction) techniques to map
multidimensional data onto lower-dimensional which allows people to reduce complex
problems for easy interpretation. SOM has two layers, one is the Input layer and the other one
is the Output layer.

The architecture of the Self Organizing Map with two clusters and n input features of any sample
is given below:

How do SOM works?


Let’s say an input data of size (m, n) where m is the number of training examples and n is the
number of features in each example. First, it initializes the weights of size (n, C) where C is the
number of clusters. Then iterating over the input data, for each training example, it updates the
winning vector (weight vector with the shortest distance (e.g Euclidean distance) from training
example). Weight updation rule is given by :

wij = wij(old) + alpha(t) * (xik - wij(old))


where alpha is a learning rate at time t, j denotes the winning vector, i denotes the ith feature
of training example and k denotes the kth training example from the input data. After training
the SOM network, trained weights are used for clustering new examples. A new example falls
in the cluster of winning vectors.

Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning rate α.

Step 2: Calculate squared Euclidean distance.

D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m

Step 3: Find index J, when D(j) is minimum that will be considered as winning index.

Step 4: For each j within a specific neighborhood of j and for all i, calculate the new weight.

wij(new)=wij(old) + α[xi – wij(old)]

Step 5: Update the learning rule by using :

α(t+1) = 0.5 * t

Step 6: Test the Stopping Condition.

Lecture Notes-45:
Data Science Tools:
Data science tools are application software or frameworks that help data science
professionals to perform various data science tasks like analysis, cleansing, visualization,
mining, reporting, and filtering of data. Each of these tools comes with a set of some of these
usages. If you go for a data science with python certification, you will be trained on all the
current data science tools. Let us now get to know what these tools are and how do they help
data scientists and professionals.

General-purpose tools
1. MS Excel:
It is the most fundamental & essential tool that everyone should know. For freshers, this tool
helps in easy analysis and understanding of data. MS. Excel comes as a part of the MS Office
suite. Freshers and even seasoned professionals can get a basic idea of what the data wants to
say before getting into high-end analytics. It can help in quickly understanding the data, comes
with built-in formulae, and provides various types of data visualization elements like charts and
graphs. Through MS Excel, data science professionals can represent the data simply through
rows and columns. Even a non-technical user can understand this representation.

Cloud-based tools
2. BigML:
BigML is an online, cloud-based, event-driven tool that helps in data science and machine
learning operations. This GUI based tool allows beginners who have little or no previous
experience in creating models through drag and drop features. For professionals and
companies, BigML is a tool that can help blend data science and machine learning projects for
various business operations and processes. A lot of companies use BigML for risk reckoning,
threat analysis, weather forecasting, etc. It uses REST APIs for producing user-friendly web
interfaces. Users can also leverage it for generating interactive visualizations over data. It also
comes with lots of automation techniques that qualify users to eliminate manual data
workflows.

3. Google Analytics:
Google Analytics (GA) is a professional data science tool and framework that gives an in-depth
look at an enterprise website or app performance for data-driven insights. Data science
professionals are scattered across various industries. One of them is in digital marketing. This
data science tool helps in digital marketing & the web admin can easily access, visualize, and
analyze the website traffic, data, etc., via Google Analytics. It can help businesses understand
the way customers or end-users interact with the website. This tool can work in close tandem
with other products like Search Console, Google Ads, and Data Studio, which makes it a
widespread option for anyone using leveraging different Google products. Through Google
Analytics, data scientists and marketing leaders can make better marketing decisions. Even a
non-technical data science professional can utilize it to perform data analytics with its high-end
functionalities and easy-to-work interface.

Multipurpose Data science Tools


4. Apache Spark:
Apache Spark is a well-known data science tool, framework, and data science library, with a
robust analytics engine that can provide stream processing and batch processing. It can analyze
data in real-time and can perform cluster management. It is much faster than other analytic
workload tools like Hadoop. Apart from data analysis, it can also help in machine learning
projects. It caters to various built-in Machine Learning APIs that allow machine learning
engineers and data scientists to create predictive models. Along with all these, Apache spark
caters to different APIs that are Python, Java, R, and Scala programmers can leverage in their
program.

5. Matlab:
Matlab is a closed-source, high-performing, numerical, computational, simulation-making,
multi-paradigm data science tool for processing mathematical and data-driven tasks. Through
this tool, researchers and data scientists can perform matrix operations, analyze algorithmic
performance, and render data statistical modeling. This tool is an amalgamation of
visualization, mathematical computation, statistical analysis, and programming, all under an
easy-to-use ecosystem. Data scientists find various applications of Matlab, especially for signal
and image processing, simulation of the neural network, or testing of different data science
models.

6. SAS:
SAS is a popular data science tool designed by the SAS Institute for advanced analysis,
multivariate analysis, business intelligence (BI), data management operations, and predictive
analytics for future insights. This closed-source software caters to a wide range of data science
functionalities through its graphical interface, along with its SAS programming language, and
via Base SAS. A lot of MNCs and Fortune 500 companies are utilizing this tool for statistical
modeling and data analysis. This tool allows easy accessing of data from database files, online
databases, SAS tables, and Microsoft Excel tables. It is also used for manipulating existing data
sets to get data-driven insights by leveraging its statistical libraries and tools.
7. KNIME:
KNIME is another widely used open-source and free data science tool that helps in data
reporting, data analysis, and data mining. With this tool, data science professionals can quickly
extract and transform data. It allows integrating various data analysis & data-related
components for machine learning (ML) and data mining objective by leveraging its modular
data pipelining concept. It caters to an excellent graphical interface through which data science
professionals can more likely define the workflow between the various predefined nodes
provided in its repository. Because of this, data science professionals require minimum
programming expertise to carry out data-driven analysis and operations. It has visual data
pipelines that help in rendering interactive visuals for the given dataset.

8. Apache Flink:
Flink is another of Apache's data science software that helps perform real-time data analysis. It
is one of the most popular open-source batch-processing data science tools and frameworks
that utilizes its distributed stream processing engine to perform various data science
operations. A lot of time, data scientists & professionals require performing real-time analysis
& computation on data such as data from users' web activities, measuring data emitted from
the Internet of Things (IoT) devices, location-tracking feeds, financial transactions from apps,
or services, etc. That is where Flink can deliver both parallel and pipelined execution of data
flow at a lower latency. It uses batch processing to handle this flow of enormous data streams
(that are unbounded - i.e., they do not have a fixed start and endpoint) as well as stored datasets
(that are bounded). Apache Flink has a reputation for rendering high-speed processing and
analysis while reducing the complexity of dealing with real-time data processing.

Programming Language-driven Tools


9. Python:
Python is, by far, the most widely used data science programming language. Also considered a
data science tool, Python helps data science professionals to perform data analysis over large
datasets and data of different sorts (structured, semi-structured, and unstructured). This high-
level, general-purpose, dynamic, interpreted programming language has a built-in data
structure and a massive collection of libraries that can help in data analysis, data cleaning, data
visualization, etc. It has a simple syntax and is easy to learn. It also reduces the cost of
maintaining data science programs. Since this programming language helps develop mobile,
desktop, and web applications along with data science capabilities - many prefer to learn this
to leverage both data science and software development capabilities that this tool renders. It
has outstanding community support and contributors keep on developing modules and
libraries that can make data science and programming tasks easier.

10. R Programming:
R is a robust programming language that competes with Python when it comes to data science.
Professionals and companies widely use it for statistical computing and data analysis. It has an
excellent user interface and spontaneously updates its interface for better programming and
data analysis experience. It has exceptional team contribution and community support that
make it a valuable tool for data science. It is scalable because it has a huge collection of data
science packages and libraries such as tidyr, dplyr, readr, SparkR, data.table, ggplot2, etc. Apart
from statistical and data science operations, R also leverages powerful machine learning
algorithms in a simple and fast manner. This open-source programming language comes with
7800 packages and object-oriented features. The entire language runs on RStudio.

11. Jupyter Notebook:


This computational notebook is a popular data science and web application that helps manage
and interact with data effectively. Apart from data science professionals, researchers,
mathematicians, and even beginners in Python also leverage this tool. It is prevalent for its easy
data visualization feature and computational abilities mostly. Data science professionals and
analysts can run a single line or multiple lines of code. It is a spin-off of the IPython project and
supports programming languages like Julia, Python, and R.

12. MongoDB:
MongoDB is a cross-platform, open-source, document-oriented NoSQL database management
software that allows data science professionals to manage semi-structured and unstructured
data. It acts as an alternative to a traditional database management system where all the data
has to be structured. MongoDB is a data science tool that helps data science professionals in
managing document-oriented data, store & retrieve information as and when required. It can
easily handle large volumes of data and caters to all the capabilities of SQL and more. It also
endorses executing dynamic queries. MongoDB caches data in the JSON-like format as
documents and delivers high-level data replications capabilities. Handling Big Data has become
easier with the advent of MongoDB as it enables increased data availability. Apart from
necessary database queries, MongoDB has the potential to execute advanced analytics. It also
allows data scalability, making it one of the widely used Data Science tools

7. Keywords
 Clustering
 Agglomerative
 DBSCAN

8. Sample Questions

Remember:
1. List any four data science tools.
2. What is DBSCAN?
Understand:
1. Explain about Self organizing maps with an example.
2. Explain about Divisive algorithm with an example.
Apply:
1. Apply Agglomerative single link clustering for the following data matrix and draw the
dendrogram.

A B C D E F
A 0
B 5 0
C 14 9 0
D 11 20 13 0
E 18 15 6 3 0
F 10 16 8 10 11 0

9. Stimulating Question (s)


1. What is the need of clustering?
10. Mind Map
11. Student Summary

At the end of this session, the facilitator (Teacher) shall randomly pick-up few students to
summarize the deliverables.

12. Reading Materials


1. Stephen Marsland, "Machine Learning -An Algorithmic Perspective ", CRC Press, 2009.
2. Tom M. Mitchell, "Machine Learning ", Tata McGraw Hill, 1997

13. Scope for Mini Project

NIL

---------------

You might also like