0% found this document useful (0 votes)
7 views

Week 15 Lecture Notes

Uploaded by

findinngclosure
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Week 15 Lecture Notes

Uploaded by

findinngclosure
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

MACHINE LEARNING

A JOURNEY FROM DATA TO DECISIONS

DEPARTMENT OF COMPUTER SCIENCE


Unsupervised Learning

2
Types of Unsupervised
Learning Algorithm
Clustering: Clustering is a method of
grouping the objects into clusters such
that objects with most similarities remains
into a group and has less or no similarities
with the objects of another group.

Association: An association rule is an


unsupervised learning method which is
used for finding the relationships between
variables in the large database.

3
K-Means Clustering

4
What is K-Means Clustering?

It is an iterative algorithm that divides


the unlabeled dataset into k different
clusters in such a way that each
dataset belongs only one group that
has similar properties.
5
How does the K-Means Algorithm
Work?

6
How does the K-Means Algorithm
Work?

7
How does the K-Means Algorithm
Work?

8
How does the K-Means Algorithm
Work?

9
How does the K-Means Algorithm
Work?

10
Final Clusters

11
Example K-Means Clustering

12
Types of Clustering Methods

▰Partitioning Clustering

▰Density-Based Clustering

▰Distribution Model-Based Clustering

▰Hierarchical Clustering
13
Partitioning Clustering

In this type, the dataset is divided into


a set of k groups, where K is used to
define the number of pre-defined
groups.

The cluster center is created in such a


way that the distance between the
data points of one cluster is minimum
as compared to another cluster
14
centroid.
Density-Based Clustering

It connects the highly-dense areas into


clusters, and the arbitrarily shaped
distributions are formed as long as the
dense region can be connected.

This algorithm does it by identifying


different clusters in the dataset and
connects the areas of high densities into
clusters.

The dense areas in data space are divided


from each other by sparser areas. 15
Distribution Model-Based
Clustering

The data is divided based on


the probability of how a
dataset belongs to a
particular distribution.

The grouping is done by


assuming some distributions
commonly Gaussian
Distribution.
16
Hierarchical Clustering

In this technique, the dataset is


divided into clusters to create a tree-
like structure, which is also called
a dendrogram.

The observations or any number of


clusters can be selected by cutting
the tree at the correct level.

17
Machine Learning Process

18
Association

19
Apriori Algorithm

20
Steps for Apriori Algorithm

▰Step-1: Determine the support of item sets in the transactional


database, and select the minimum support and confidence.
▰Step-2: Take all supports in the transaction with higher support
value than the minimum or selected support value.
▰Step-3: Find all the rules of these subsets that have higher
confidence value than the threshold or minimum confidence.
▰Step-4: Sort the rules as the decreasing order of lift.
21
Apriori Algorithm Working

Suppose we have the following


dataset that has various
transactions, and from this
dataset, we need to find the
frequent item sets and generate
the association rules using the
Apriori algorithm

22
Step-1: Calculating C1 and L1

Candidate set or C1. frequent item set L1

23
Step-2: Candidate Generation C2,
and L2

Candidate Generation C2 frequent item set L2

24
Step-3: Candidate generation C3,
and L3

Candidate Generation C3
As we can see from the above C3
table, there is only one combination
of item set that has support count
equal to the minimum support
count.
So, the L3 will have only one
combination, i.e., {A, B, C}.

25
Step-4: Finding the association
rules for the subsets

Rules Support Confidence As the given threshold


or minimum confidence
A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50% is 50%, so the first
three rules
B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50% A ^B → C,
B^C → A,
A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50% and
A^C → B
C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40% can be considered as
the strong association
A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%
rules for the given
problem.
B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

26
Splitting the Dataset - Holdout

27
Stratified Sampling

28
Underfitting and Overfitting

29
Bias vs Variance

• Bias is the difference between observed value and the predicted value.
• Variance is defined as the difference in performance on the training set vs on the
test set.
30
31
32
33
34
Bias vs Variance

We generally want to minimize both


bias and variance i.e build a model
which not only fits the training data
well but also generalizes well on
test/validation data.
35
Enrich the Dataset

36
Improve Model Efficiency –
K-Fold Testing

37
Model Selection

38
Anaconda Environment

39
Value Addition

40
Sample Dataset - Iris

41
Dataset Types

42
Facets of data

■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming
43
Data Preprocessing
Techniques - Missing Data
Two ways to deal
with missing data:

1. By deleting the
particular row.
2. By calculating the
mean.

44
Encoding Categorical Data

45
Feature Scaling

• Scaling data means transforming it so that the values fit within some range or
scale, such as 0–100 or 0–1.

• Imagine you have an image represented as a set of RGB values ranging from 0 to
255. We can scale the range of the values from 0–255 down to a range of 0–1.

• This scaling process will not affect the algorithm output since every value is scaled
in the same way.

• But it can speed up the training process, because now the algorithm only needs to
handle numbers less than or equal to 1.

46
Example Dataset

47
Machine Learning with R

48
Datasets Resources

49
Open Data Resources

50
Technologies
Tools for Data Science

52
Applications
Image Processing

54
Banking and Finance

55
Sports

56
Digital Advertisements

57
Health Care

58
Speech Recognition

59
Internet Search

60
Recommender
System

61
Gaming

62
Augmented Reality

63
Self-Driving Cars

64
Robots

65
Questions & Answers Session

66

You might also like