0% found this document useful (0 votes)
38 views28 pages

DS Chapter 5

Uploaded by

aychewchernet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views28 pages

DS Chapter 5

Uploaded by

aychewchernet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Madda Walabu University

College of computing
Department of Information Science
Introduction to Data Science
Dibaba A. (MSc)

Chapter 5:Standard Data Science Tasks


2 Discussion Outline
 Clustering (or segmentation)
 Anomaly (or outlier) detection
 Association-rule mining

 Prediction (including the sub problems of classification and


regression)
3 Clustering

 Clustering or cluster analysis is a machine learning technique,


which groups the unlabeled dataset.
 It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The objects
with the possible similarities remain in a group that has less or no
similarities with another group.“
 It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabeled dataset.
 After applying this clustering technique, each cluster or group is
provided with a cluster-ID.
4 Clustering Cont’d…

 The clustering technique can be widely used in various tasks.


Some most common uses of this technique are:

• Market Segmentation

• Statistical data analysis

• Social network analysis

• Image segmentation

• Anomaly detection, etc.


5 Clustering Cont’d….

 The below diagram explains the working of the clustering


algorithm. We can see the different fruits are divided into several
groups with similar properties.
6 When and why would we want to do clustering?

 Useful for:

• Automatically organizing data.

• Representing high-dimensional data in a low-dimensional space


(e.g., for visualization purposes).

• Understanding hidden structure in data.

• Preprocessing for further analysis.


7 Types of Clustering Methods

 Broadly speaking, clustering can be divided into two subgroups :

Hard Clustering: In hard clustering, each data point either


belongs to a cluster completely or not (datapoint belongs to
only one group).

Soft Clustering: In soft clustering, instead of putting each


data point into a separate cluster, a probability or likelihood of
that data point to be in those clusters is assigned (data points
can belong to another group also).
8 Cont’d…

 There are also other clustering methods in machine learnings.


Below are the main clustering methods used in Machine learning:

Partitioning Clustering

Density-Based Clustering

Distribution Model-Based Clustering

Hierarchical Clustering

Fuzzy Clustering
9 Clustering Algorithms

Popular clustering algorithms in machine leaning are:


 K-Means algorithm: The k-means algorithm is one of the most
popular clustering algorithms. It classifies the dataset by dividing
the samples into different clusters of equal variances. The number
of clusters must be specified in this algorithm.
 Mean-shift algorithm: Mean-shift algorithm tries to find the
dense areas in the smooth density of data points. It is an example
of a centroid-based model, that works on updating the candidates
for centroid to be the center of the points within a given region.
10 Clustering Algorithms Cont’d…

 DBSCAN Algorithm: It stands for Density-Based Spatial


Clustering of Applications with Noise. It is an example of a
density-based model similar to the mean-shift, but with some
remarkable advantages. In this algorithm, the areas of high
density are separated by the areas of low density. Because of this,
the clusters can be found in any arbitrary shape.
 Expectation-Maximization Clustering using GMM: This
algorithm can be used as an alternative for the k-means algorithm
or for those cases where K-means can be failed. In GMM, it is
assumed that the data points are Gaussian distributed.
11 Clustering Algorithms Cont’d…

 Agglomerative Hierarchical algorithm: The Agglomerative


hierarchical algorithm performs the bottom-up hierarchical
clustering. In this, each data point is treated as a single cluster at
the outset and then successively merged. The cluster hierarchy
can be represented as a tree-structure.
 Affinity Propagation: It is different from other clustering
algorithms as it does not require to specify the number of
clusters. In this, each data point sends a message between the pair
of data points until convergence.
12 Anomaly Detection

 Anomaly detection is a process of finding those rare items, data


points, events, or observations that make suspicions by being
different from the rest data points or observations.
 Anomaly detection is also known as outlier detection.

 Finding an anomaly is the ability to define what is normal?


13 Anomaly Detection Cont’d…

 In the below image green vehicle is anomaly from all red


vehicles.
14 Anomaly Detection Cont’d…

 Common reasons for outliers are:

• Data preprocessing errors;

• Noise;

• Fraud;

• Attacks.
15 Types of Anomalies or Outliers

 Point Anomaly (Global Anomaly)

 A tuple within the dataset can be said as a Point anomaly if it is far


away from the rest of the data.

 Example: An example of a point anomaly is a sudden transaction


of a huge amount from a credit card.
 Contextual Anomaly

 Contextual anomaly is also known as conditional outliers. If a


particular observation is different from other data points, then it is
known as a contextual Anomaly. In such types of anomalies, an
anomaly in one context may not be an anomaly in another context.
16 Types of Anomalies or Outliers Cont’d…

 Collective Anomaly

Collective anomalies occur when a data point within a set is


anomalous for the whole dataset, and such values are known as
collective outliers.

In such anomalies, specific or individual values are not


anomalous as a whole or contextually.
17 Prediction (Classification and Regression)

 There are two forms of data analysis that can be used to extract
models describing important classes or predict future data trends.
 These two forms are as follows:

1. Classification

2. Prediction
 We use classification and prediction to extract a model,
representing the data classes to predict future data trends.
 Classification predicts the categorical labels of data with the
prediction models.
 This analysis provides us with the best understanding of the data at
18 Cont’d…

 Classification models predict categorical class labels, and


prediction models predict continuous-valued functions.
 For example, we can build a classification model to categorize
bank loan applications as either safe or risky or a prediction model
to predict the expenditures in dollars of potential customers on
computer equipment given their income and occupation.
19 Cont’d…

 Below figure shows the classification and prediction process:


20 What is Classification?

 Classification is to identify the category or the class label of a new


observation.
 First, a set of data is used as training data.

 The set of input data and the corresponding outputs are given to the
algorithm. So, the training data set includes the input data and their
associated class labels.
21 Classification Cont’d…

 Using the training dataset, the algorithm derives a model or the


classifier. The derived model can be a decision tree, mathematical
formula, or a neural network.
 In classification, when unlabeled data is given to the model, it
should find the class to which it belongs.
 The new data provided to the model is the test data set.

 Sometimes there can be more than two classes to classify. That is


called multiclass classification.
22 How does Classification Works?

 There are two stages in the data classification system: classifier


or model creation and classification classifier.
23 Cont’d…

 Developing the Classifier or model creation: This level is the


learning stage or the learning process. The classification
algorithms construct the classifier in this stage.
 A classifier is constructed from a training set composed of the
records of databases and their corresponding class names.
 Each category that makes up the training set is referred to as a
category or class.
24 Cont’d…

 Applying classifier for classification: The classifier is used for


classification at this level. The test data are used here to estimate
the accuracy of the classification algorithm. If the consistency is
deemed sufficient, the classification rules can be expanded to
cover new data records. It includes:

Sentiment Analysis: Sentiment analysis is highly helpful in


social media monitoring. We can use it to extract social media
insights. We can build sentiment analysis models to read and
analyze misspelled words with advanced machine learning
algorithms.
25 Cont’d…

Document Classification: We can use document classification


to organize the documents into sections according to the content.
Document classification refers to text classification; we can
classify the words in the entire document. And with the help of
machine learning classification algorithms, we can execute it
automatically.

Image Classification: Image classification is used for the


trained categories of an image. These could be the caption of the
image, a statistical value, a theme. You can tag images to train
your model for relevant categories by applying supervised
learning algorithms.
26 Regression

 Regression is generally used for prediction. Predicting the value


of a house depending on the facts such as the number of rooms,
the total area, etc., is an example for prediction.
 Regression is a process of finding the correlations between
dependent and independent variables. It helps in predicting the
continuous variables such as prediction of Market Trends,
prediction of House prices, etc.
 The task of the Regression algorithm is to find the mapping
function to map the input variable(x) to the continuous output
variable(y).
27 Difference between Regression and Classification
28

You might also like