0% found this document useful (0 votes)

38 views28 pages

DS Chapter 5

Uploaded by

aychewchernet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views28 pages

DS Chapter 5

Uploaded by

aychewchernet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Madda Walabu University

College of computing
Department of Information Science
Introduction to Data Science
Dibaba A. (MSc)

Chapter 5:Standard Data Science Tasks

2 Discussion Outline
 Clustering (or segmentation)
 Anomaly (or outlier) detection
 Association-rule mining

 Prediction (including the sub problems of classification and

regression)
3 Clustering

 Clustering or cluster analysis is a machine learning technique,

which groups the unlabeled dataset.
 It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects
with the possible similarities remain in a group that has less or no
similarities with another group.“
 It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabeled dataset.
 After applying this clustering technique, each cluster or group is
provided with a cluster-ID.
4 Clustering Cont’d…

 The clustering technique can be widely used in various tasks.

Some most common uses of this technique are:

• Market Segmentation

• Statistical data analysis

• Social network analysis

• Image segmentation

• Anomaly detection, etc.

5 Clustering Cont’d….

 The below diagram explains the working of the clustering

algorithm. We can see the different fruits are divided into several
groups with similar properties.
6 When and why would we want to do clustering?

 Useful for:

• Automatically organizing data.

• Representing high-dimensional data in a low-dimensional space

(e.g., for visualization purposes).

• Understanding hidden structure in data.

• Preprocessing for further analysis.

7 Types of Clustering Methods

 Broadly speaking, clustering can be divided into two subgroups :

Hard Clustering: In hard clustering, each data point either

belongs to a cluster completely or not (datapoint belongs to
only one group).

Soft Clustering: In soft clustering, instead of putting each

data point into a separate cluster, a probability or likelihood of
that data point to be in those clusters is assigned (data points
can belong to another group also).
8 Cont’d…

 There are also other clustering methods in machine learnings.

Below are the main clustering methods used in Machine learning:

Partitioning Clustering

Density-Based Clustering

Distribution Model-Based Clustering

Hierarchical Clustering

Fuzzy Clustering
9 Clustering Algorithms

Popular clustering algorithms in machine leaning are:

 K-Means algorithm: The k-means algorithm is one of the most
popular clustering algorithms. It classifies the dataset by dividing
the samples into different clusters of equal variances. The number
of clusters must be specified in this algorithm.
 Mean-shift algorithm: Mean-shift algorithm tries to find the
dense areas in the smooth density of data points. It is an example
of a centroid-based model, that works on updating the candidates
for centroid to be the center of the points within a given region.
10 Clustering Algorithms Cont’d…

 DBSCAN Algorithm: It stands for Density-Based Spatial

Clustering of Applications with Noise. It is an example of a
density-based model similar to the mean-shift, but with some
remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this,
the clusters can be found in any arbitrary shape.
 Expectation-Maximization Clustering using GMM: This
algorithm can be used as an alternative for the k-means algorithm
or for those cases where K-means can be failed. In GMM, it is
assumed that the data points are Gaussian distributed.
11 Clustering Algorithms Cont’d…

 Agglomerative Hierarchical algorithm: The Agglomerative

hierarchical algorithm performs the bottom-up hierarchical
clustering. In this, each data point is treated as a single cluster at
the outset and then successively merged. The cluster hierarchy
can be represented as a tree-structure.
 Affinity Propagation: It is different from other clustering
algorithms as it does not require to specify the number of
clusters. In this, each data point sends a message between the pair
of data points until convergence.
12 Anomaly Detection

 Anomaly detection is a process of finding those rare items, data

points, events, or observations that make suspicions by being
different from the rest data points or observations.
 Anomaly detection is also known as outlier detection.

 Finding an anomaly is the ability to define what is normal?

13 Anomaly Detection Cont’d…

 In the below image green vehicle is anomaly from all red

vehicles.
14 Anomaly Detection Cont’d…

 Common reasons for outliers are:

• Data preprocessing errors;

• Noise;

• Fraud;

• Attacks.
15 Types of Anomalies or Outliers

 Point Anomaly (Global Anomaly)

 A tuple within the dataset can be said as a Point anomaly if it is far

away from the rest of the data.

 Example: An example of a point anomaly is a sudden transaction

of a huge amount from a credit card.
 Contextual Anomaly

 Contextual anomaly is also known as conditional outliers. If a

particular observation is different from other data points, then it is
known as a contextual Anomaly. In such types of anomalies, an
anomaly in one context may not be an anomaly in another context.
16 Types of Anomalies or Outliers Cont’d…

 Collective Anomaly

Collective anomalies occur when a data point within a set is

anomalous for the whole dataset, and such values are known as
collective outliers.

In such anomalies, specific or individual values are not

anomalous as a whole or contextually.
17 Prediction (Classification and Regression)

 There are two forms of data analysis that can be used to extract
models describing important classes or predict future data trends.
 These two forms are as follows:

1. Classification

2. Prediction
 We use classification and prediction to extract a model,
representing the data classes to predict future data trends.
 Classification predicts the categorical labels of data with the
prediction models.
 This analysis provides us with the best understanding of the data at
18 Cont’d…

 Classification models predict categorical class labels, and

prediction models predict continuous-valued functions.
 For example, we can build a classification model to categorize
bank loan applications as either safe or risky or a prediction model
to predict the expenditures in dollars of potential customers on
computer equipment given their income and occupation.
19 Cont’d…

 Below figure shows the classification and prediction process:

20 What is Classification?

 Classification is to identify the category or the class label of a new

observation.
 First, a set of data is used as training data.

 The set of input data and the corresponding outputs are given to the
algorithm. So, the training data set includes the input data and their
associated class labels.
21 Classification Cont’d…

 Using the training dataset, the algorithm derives a model or the

classifier. The derived model can be a decision tree, mathematical
formula, or a neural network.
 In classification, when unlabeled data is given to the model, it
should find the class to which it belongs.
 The new data provided to the model is the test data set.

 Sometimes there can be more than two classes to classify. That is

called multiclass classification.
22 How does Classification Works?

 There are two stages in the data classification system: classifier

or model creation and classification classifier.
23 Cont’d…

 Developing the Classifier or model creation: This level is the

learning stage or the learning process. The classification
algorithms construct the classifier in this stage.
 A classifier is constructed from a training set composed of the
records of databases and their corresponding class names.
 Each category that makes up the training set is referred to as a
category or class.
24 Cont’d…

 Applying classifier for classification: The classifier is used for

classification at this level. The test data are used here to estimate
the accuracy of the classification algorithm. If the consistency is
deemed sufficient, the classification rules can be expanded to
cover new data records. It includes:

Sentiment Analysis: Sentiment analysis is highly helpful in

social media monitoring. We can use it to extract social media
insights. We can build sentiment analysis models to read and
analyze misspelled words with advanced machine learning
algorithms.
25 Cont’d…

Document Classification: We can use document classification

to organize the documents into sections according to the content.
Document classification refers to text classification; we can
classify the words in the entire document. And with the help of
machine learning classification algorithms, we can execute it
automatically.

Image Classification: Image classification is used for the

trained categories of an image. These could be the caption of the
image, a statistical value, a theme. You can tag images to train
your model for relevant categories by applying supervised
learning algorithms.
26 Regression

 Regression is generally used for prediction. Predicting the value

of a house depending on the facts such as the number of rooms,
the total area, etc., is an example for prediction.
 Regression is a process of finding the correlations between
dependent and independent variables. It helps in predicting the
continuous variables such as prediction of Market Trends,
prediction of House prices, etc.
 The task of the Regression algorithm is to find the mapping
function to map the input variable(x) to the continuous output
variable(y).
27 Difference between Regression and Classification
28

Unit 5
No ratings yet
Unit 5
27 pages
Presentation Guideline
No ratings yet
Presentation Guideline
1 page
Short For Test
No ratings yet
Short For Test
9 pages
DEED OF SALE - Sale of Property - Template
No ratings yet
DEED OF SALE - Sale of Property - Template
2 pages
Android Based JavaMCQS A Mobile Learning Platform
No ratings yet
Android Based JavaMCQS A Mobile Learning Platform
8 pages
Toe 3.12 BP2TL Jakarta
No ratings yet
Toe 3.12 BP2TL Jakarta
1 page
NB57 - Watertight Sliding Doors Technical Specification-Contract Issue - v0 - 21092017
No ratings yet
NB57 - Watertight Sliding Doors Technical Specification-Contract Issue - v0 - 21092017
14 pages
Icse
No ratings yet
Icse
5 pages
Class Lecture Nootte
No ratings yet
Class Lecture Nootte
33 pages
Major Repair and Alteration (Airframe, Powerplant, Propeller, or Appliance)
No ratings yet
Major Repair and Alteration (Airframe, Powerplant, Propeller, or Appliance)
3 pages
Unit Iv DM
No ratings yet
Unit Iv DM
15 pages
Taller 4 Procesos
No ratings yet
Taller 4 Procesos
2 pages
DM - Unit I-Updated
No ratings yet
DM - Unit I-Updated
65 pages
Ge3 - Reviewer
No ratings yet
Ge3 - Reviewer
15 pages
Gruas Kranco
No ratings yet
Gruas Kranco
13 pages
DM Unit 1
No ratings yet
DM Unit 1
10 pages
C++ Chapter3
No ratings yet
C++ Chapter3
20 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
How The Market Makers Extract Millions of Dollars A Day and How To Grab Your Share Guide Book
No ratings yet
How The Market Makers Extract Millions of Dollars A Day and How To Grab Your Share Guide Book
136 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
Classification
No ratings yet
Classification
34 pages
Employee Relationship Management (MCQ)
67% (6)
Employee Relationship Management (MCQ)
46 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
60 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
18 pages
DLD Assignment 9
No ratings yet
DLD Assignment 9
29 pages
KLAXEDDM PDF 09jan23
No ratings yet
KLAXEDDM PDF 09jan23
17 pages
Chapter 4
No ratings yet
Chapter 4
60 pages
Classification in Data Mining 12
No ratings yet
Classification in Data Mining 12
7 pages
DSand ML
No ratings yet
DSand ML
76 pages
3 DM Classification
No ratings yet
3 DM Classification
62 pages
Module 3 - Classification
No ratings yet
Module 3 - Classification
9 pages
Screenshot 2024-06-03 at 11.59.21 PM
No ratings yet
Screenshot 2024-06-03 at 11.59.21 PM
45 pages
CMS User Manual V-7.0
No ratings yet
CMS User Manual V-7.0
19 pages
Overview Basics
No ratings yet
Overview Basics
16 pages
3
No ratings yet
3
6 pages
Unit 3
No ratings yet
Unit 3
15 pages
Digest Evardone V Comelec
No ratings yet
Digest Evardone V Comelec
2 pages
Confidential For Gospell Internal Use Only Confidential For Gospell Internal Use Only
No ratings yet
Confidential For Gospell Internal Use Only Confidential For Gospell Internal Use Only
75 pages
Classification Basic Concept - Data Mining
No ratings yet
Classification Basic Concept - Data Mining
20 pages
Unit 6
No ratings yet
Unit 6
22 pages
Rockwell CompactLogix
No ratings yet
Rockwell CompactLogix
50 pages
Topic 08 - Data Modelling - Part II
No ratings yet
Topic 08 - Data Modelling - Part II
59 pages
Data Mining Jntuh Cse R18
No ratings yet
Data Mining Jntuh Cse R18
20 pages
Unit 4 Introduction To Algorithm
No ratings yet
Unit 4 Introduction To Algorithm
10 pages
Classification and Clustering
No ratings yet
Classification and Clustering
8 pages
Fam Question Bank CT
No ratings yet
Fam Question Bank CT
14 pages
L1 Intro
No ratings yet
L1 Intro
32 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Day 3 - Content
No ratings yet
Day 3 - Content
50 pages
DMlecture 1
No ratings yet
DMlecture 1
39 pages
A Technical Explanation of T-Reinforcement For Trusses PDF
No ratings yet
A Technical Explanation of T-Reinforcement For Trusses PDF
5 pages
Human Factors Assignment
No ratings yet
Human Factors Assignment
5 pages
Sharmeine B. Tenedero: E-Mail
No ratings yet
Sharmeine B. Tenedero: E-Mail
3 pages
DWM Unit 3 Final Notes
No ratings yet
DWM Unit 3 Final Notes
47 pages
Classification Clustering Overview
No ratings yet
Classification Clustering Overview
7 pages
Machine Learning Clustering AlgorithmsI
No ratings yet
Machine Learning Clustering AlgorithmsI
129 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Lect 1
No ratings yet
Lect 1
38 pages
RTGS & Messaging
No ratings yet
RTGS & Messaging
26 pages
Clustering
No ratings yet
Clustering
3 pages
Lecture 1 Clustering PDF
No ratings yet
Lecture 1 Clustering PDF
8 pages
DM - MOD - 1 Part II
No ratings yet
DM - MOD - 1 Part II
14 pages
Classification
No ratings yet
Classification
50 pages
DWBI4
No ratings yet
DWBI4
10 pages
Haryana Roadways Training PDF
No ratings yet
Haryana Roadways Training PDF
41 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Classification
No ratings yet
Classification
15 pages
Top Engineering School
No ratings yet
Top Engineering School
6 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Data Mining For BI - Part 5
No ratings yet
Data Mining For BI - Part 5
34 pages
Unit-4 AML (1. Basics and K-NN)
No ratings yet
Unit-4 AML (1. Basics and K-NN)
25 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Dijkstra's Algorithm
No ratings yet
Dijkstra's Algorithm
16 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
My Resume (Fizi)
No ratings yet
My Resume (Fizi)
3 pages
Lect 12
No ratings yet
Lect 12
80 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
Otis
No ratings yet
Otis
3 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
IT326 - Ch1
100% (1)
IT326 - Ch1
17 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Buckling Pin Pressure Relief Technology
No ratings yet
Buckling Pin Pressure Relief Technology
8 pages
Conceptual Framework of Model
No ratings yet
Conceptual Framework of Model
29 pages
Airtel Vodafone
100% (2)
Airtel Vodafone
27 pages
Chapter8-Basic Cluster Analysis2018
No ratings yet
Chapter8-Basic Cluster Analysis2018
143 pages
Intelligent System: Lecture Notes For Chapter 7
No ratings yet
Intelligent System: Lecture Notes For Chapter 7
25 pages
Unit Iii Classification
No ratings yet
Unit Iii Classification
57 pages
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 7 Introduction To Data Mining, 2 Edition
108 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
1.1 Project Overview: Data Mining
No ratings yet
1.1 Project Overview: Data Mining
74 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Discovering Knowledge in Data: Lecture Review of
No ratings yet
Discovering Knowledge in Data: Lecture Review of
20 pages

DS Chapter 5

Uploaded by

DS Chapter 5

Uploaded by

Madda Walabu University

Chapter 5:Standard Data Science Tasks

 Prediction (including the sub problems of classification and

 Clustering or cluster analysis is a machine learning technique,

 The clustering technique can be widely used in various tasks.

• Statistical data analysis

• Social network analysis

• Anomaly detection, etc.

 The below diagram explains the working of the clustering

• Automatically organizing data.

• Representing high-dimensional data in a low-dimensional space

• Understanding hidden structure in data.

• Preprocessing for further analysis.

 Broadly speaking, clustering can be divided into two subgroups :

Hard Clustering: In hard clustering, each data point either

Soft Clustering: In soft clustering, instead of putting each

 There are also other clustering methods in machine learnings.

Distribution Model-Based Clustering

Popular clustering algorithms in machine leaning are:

 DBSCAN Algorithm: It stands for Density-Based Spatial

 Agglomerative Hierarchical algorithm: The Agglomerative

 Anomaly detection is a process of finding those rare items, data

 Finding an anomaly is the ability to define what is normal?

 In the below image green vehicle is anomaly from all red

 Common reasons for outliers are:

• Data preprocessing errors;

 Point Anomaly (Global Anomaly)

 A tuple within the dataset can be said as a Point anomaly if it is far

 Example: An example of a point anomaly is a sudden transaction

 Contextual anomaly is also known as conditional outliers. If a

Collective anomalies occur when a data point within a set is

In such anomalies, specific or individual values are not

 Classification models predict categorical class labels, and

 Below figure shows the classification and prediction process:

 Classification is to identify the category or the class label of a new

 Using the training dataset, the algorithm derives a model or the

 Sometimes there can be more than two classes to classify. That is

 There are two stages in the data classification system: classifier

 Developing the Classifier or model creation: This level is the

 Applying classifier for classification: The classifier is used for

Sentiment Analysis: Sentiment analysis is highly helpful in

Document Classification: We can use document classification

Image Classification: Image classification is used for the

 Regression is generally used for prediction. Predicting the value

You might also like