ml 8
ml 8
Rajam, AP REV.: 00
(An Autonomous Institution Affiliated to JNTUGV, AP)
1. Objective
3. 2D Mapping of ILOs with Knowledge Dimension and Cognitive Learning Levels of RBT
4. Teaching Methodology
Power Point Presentation, Chalk Talk, visual presentation
5. Evocation
6. Deliverables
Lecture Notes-42:
Agglomerative Clustering:
Hierarchical clustering is a connectivity-based clustering model that groups the data points together
that are close to each other based on the measure of similarity or distance. The assumption is that
data points that are close to each other are more similar or related than data points that are farther
apart.
Steps:
Consider each alphabet as a single cluster and calculate the distance of one cluster from
all the other clusters.
In the second step, comparable clusters are merged together to form a single cluster.
Let’s say cluster (B) and cluster (C) are very similar to each other therefore we merge
them in the second step similarly to cluster (D) and (E) and at last, we get the clusters
[(A), (BC), (DE), (F)]
We recalculate the proximity according to the algorithm and merge the two nearest
clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]
Repeating the same process; The clusters DEF and BC are comparable and merged
together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].
At last, the two remaining clusters are merged together to form a single cluster
[(ABCDEF)].
Lecture Notes-43:
The architecture of the Self Organizing Map with two clusters and n input features of any sample
is given below:
Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning rate α.
Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate the new weight.
α(t+1) = 0.5 * t
Lecture Notes-45:
Data Science Tools:
Data science tools are application software or frameworks that help data science
professionals to perform various data science tasks like analysis, cleansing, visualization,
mining, reporting, and filtering of data. Each of these tools comes with a set of some of these
usages. If you go for a data science with python certification, you will be trained on all the
current data science tools. Let us now get to know what these tools are and how do they help
data scientists and professionals.
General-purpose tools
1. MS Excel:
It is the most fundamental & essential tool that everyone should know. For freshers, this tool
helps in easy analysis and understanding of data. MS. Excel comes as a part of the MS Office
suite. Freshers and even seasoned professionals can get a basic idea of what the data wants to
say before getting into high-end analytics. It can help in quickly understanding the data, comes
with built-in formulae, and provides various types of data visualization elements like charts and
graphs. Through MS Excel, data science professionals can represent the data simply through
rows and columns. Even a non-technical user can understand this representation.
Cloud-based tools
2. BigML:
BigML is an online, cloud-based, event-driven tool that helps in data science and machine
learning operations. This GUI based tool allows beginners who have little or no previous
experience in creating models through drag and drop features. For professionals and
companies, BigML is a tool that can help blend data science and machine learning projects for
various business operations and processes. A lot of companies use BigML for risk reckoning,
threat analysis, weather forecasting, etc. It uses REST APIs for producing user-friendly web
interfaces. Users can also leverage it for generating interactive visualizations over data. It also
comes with lots of automation techniques that qualify users to eliminate manual data
workflows.
3. Google Analytics:
Google Analytics (GA) is a professional data science tool and framework that gives an in-depth
look at an enterprise website or app performance for data-driven insights. Data science
professionals are scattered across various industries. One of them is in digital marketing. This
data science tool helps in digital marketing & the web admin can easily access, visualize, and
analyze the website traffic, data, etc., via Google Analytics. It can help businesses understand
the way customers or end-users interact with the website. This tool can work in close tandem
with other products like Search Console, Google Ads, and Data Studio, which makes it a
widespread option for anyone using leveraging different Google products. Through Google
Analytics, data scientists and marketing leaders can make better marketing decisions. Even a
non-technical data science professional can utilize it to perform data analytics with its high-end
functionalities and easy-to-work interface.
5. Matlab:
Matlab is a closed-source, high-performing, numerical, computational, simulation-making,
multi-paradigm data science tool for processing mathematical and data-driven tasks. Through
this tool, researchers and data scientists can perform matrix operations, analyze algorithmic
performance, and render data statistical modeling. This tool is an amalgamation of
visualization, mathematical computation, statistical analysis, and programming, all under an
easy-to-use ecosystem. Data scientists find various applications of Matlab, especially for signal
and image processing, simulation of the neural network, or testing of different data science
models.
6. SAS:
SAS is a popular data science tool designed by the SAS Institute for advanced analysis,
multivariate analysis, business intelligence (BI), data management operations, and predictive
analytics for future insights. This closed-source software caters to a wide range of data science
functionalities through its graphical interface, along with its SAS programming language, and
via Base SAS. A lot of MNCs and Fortune 500 companies are utilizing this tool for statistical
modeling and data analysis. This tool allows easy accessing of data from database files, online
databases, SAS tables, and Microsoft Excel tables. It is also used for manipulating existing data
sets to get data-driven insights by leveraging its statistical libraries and tools.
7. KNIME:
KNIME is another widely used open-source and free data science tool that helps in data
reporting, data analysis, and data mining. With this tool, data science professionals can quickly
extract and transform data. It allows integrating various data analysis & data-related
components for machine learning (ML) and data mining objective by leveraging its modular
data pipelining concept. It caters to an excellent graphical interface through which data science
professionals can more likely define the workflow between the various predefined nodes
provided in its repository. Because of this, data science professionals require minimum
programming expertise to carry out data-driven analysis and operations. It has visual data
pipelines that help in rendering interactive visuals for the given dataset.
8. Apache Flink:
Flink is another of Apache's data science software that helps perform real-time data analysis. It
is one of the most popular open-source batch-processing data science tools and frameworks
that utilizes its distributed stream processing engine to perform various data science
operations. A lot of time, data scientists & professionals require performing real-time analysis
& computation on data such as data from users' web activities, measuring data emitted from
the Internet of Things (IoT) devices, location-tracking feeds, financial transactions from apps,
or services, etc. That is where Flink can deliver both parallel and pipelined execution of data
flow at a lower latency. It uses batch processing to handle this flow of enormous data streams
(that are unbounded - i.e., they do not have a fixed start and endpoint) as well as stored datasets
(that are bounded). Apache Flink has a reputation for rendering high-speed processing and
analysis while reducing the complexity of dealing with real-time data processing.
10. R Programming:
R is a robust programming language that competes with Python when it comes to data science.
Professionals and companies widely use it for statistical computing and data analysis. It has an
excellent user interface and spontaneously updates its interface for better programming and
data analysis experience. It has exceptional team contribution and community support that
make it a valuable tool for data science. It is scalable because it has a huge collection of data
science packages and libraries such as tidyr, dplyr, readr, SparkR, data.table, ggplot2, etc. Apart
from statistical and data science operations, R also leverages powerful machine learning
algorithms in a simple and fast manner. This open-source programming language comes with
7800 packages and object-oriented features. The entire language runs on RStudio.
12. MongoDB:
MongoDB is a cross-platform, open-source, document-oriented NoSQL database management
software that allows data science professionals to manage semi-structured and unstructured
data. It acts as an alternative to a traditional database management system where all the data
has to be structured. MongoDB is a data science tool that helps data science professionals in
managing document-oriented data, store & retrieve information as and when required. It can
easily handle large volumes of data and caters to all the capabilities of SQL and more. It also
endorses executing dynamic queries. MongoDB caches data in the JSON-like format as
documents and delivers high-level data replications capabilities. Handling Big Data has become
easier with the advent of MongoDB as it enables increased data availability. Apart from
necessary database queries, MongoDB has the potential to execute advanced analytics. It also
allows data scalability, making it one of the widely used Data Science tools
7. Keywords
Clustering
Agglomerative
DBSCAN
8. Sample Questions
Remember:
1. List any four data science tools.
2. What is DBSCAN?
Understand:
1. Explain about Self organizing maps with an example.
2. Explain about Divisive algorithm with an example.
Apply:
1. Apply Agglomerative single link clustering for the following data matrix and draw the
dendrogram.
A B C D E F
A 0
B 5 0
C 14 9 0
D 11 20 13 0
E 18 15 6 3 0
F 10 16 8 10 11 0
At the end of this session, the facilitator (Teacher) shall randomly pick-up few students to
summarize the deliverables.
NIL
---------------