0% found this document useful (0 votes)
22 views

Classification Clustering Recommender System

Uploaded by

Fun Online
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Classification Clustering Recommender System

Uploaded by

Fun Online
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Classification

Classification is a task that requires the use of machine learning algorithms that learn
how to assign a class label to examples from the problem domain. An easy to
understand example is classifying emails as “spam” or “not spam.”
There are many different types of classification tasks that you may encounter in machine
learning and specialized approaches to modelling that may be used for each.

• Classification predictive modelling involves assigning a class label to input


examples.
• Binary classification refers to predicting one of two classes and multi-class
classification involves predicting one of more than two classes.
• Multi-label classification involves predicting one or more classes for each
example and imbalanced classification refers to classification tasks where the
distribution of examples across the classes is not equal.

Classification Predictive Modelling


In machine learning, classification refers to a predictive modelling problem where a class
label is predicted for a given example of input data.
Examples of classification problems include:

• Given an example, classify if it is spam or not.


• Given a handwritten character, classify it as one of the known characters.
• Given recent user behaviour, classify as churn or not.

From a modelling perspective, classification requires a training dataset with many


examples of inputs and outputs from which to learn.
A model will use the training dataset and will calculate how to best map examples of
input data to specific class labels. As such, the training dataset must be sufficiently
representative of the problem and have many examples of each class label.

There are perhaps four main types of classification tasks:-


• Binary Classification
• Multi-Class Classification
• Multi-Label Classification
• Imbalanced Classification

Binary Classification
Binary classification refers to those classification tasks that have two class labels.

Examples include:
• Email spam detection (spam or not).
• Churn prediction (churn or not).
• Conversion prediction (buy or not).
Typically, binary classification tasks involve one class that is the normal state and
another class that is the abnormal state.

For example “not spam” is the normal state and “spam” is the abnormal state. Another
example is “cancer not detected” is the normal state of a task that involves a medical
test and “cancer detected” is the abnormal state.

The class for the normal state is assigned the class label 0 and the class with the
abnormal state is assigned the class label 1.

It is common to model a binary classification task with a model that predicts a Bernoulli
probability distribution for each example.
The Bernoulli distribution is a discrete probability distribution that covers a case where
an event will have a binary outcome as either a 0 or 1. For classification, this means that
the model predicts a probability of an example belonging to class 1, or the abnormal
state.

Popular algorithms that can be used for binary classification include:

• Logistic Regression
• k-Nearest Neighbors
• Decision Trees
• Support Vector Machine
• Naive Bayes
Multi-Class Classification
Multi-class classification refers to those classification tasks that have more than two
class labels.
Examples include:

• Face classification.
• Plant species classification.
• Optical character recognition.
Unlike binary classification, multi-class classification does not have the notion of normal
and abnormal outcomes. Instead, examples are classified as belonging to one among a
range of known classes.

The number of class labels may be very large on some problems. For example, a model
may predict a photo as belonging to one among thousands or tens of thousands of faces
in a face recognition system.

Problems that involve predicting a sequence of words, such as text translation models,
may also be considered a special type of multi-class classification. Each word in the
sequence of words to be predicted involves a multi-class classification where the size of
the vocabulary defines the number of possible classes that may be predicted and could
be tens or hundreds of thousands of words in size.

It is common to model a multi-class classification task with a model that predicts


a Multinoulli probability distribution for each example.
The Multinoulli distribution is a discrete probability distribution that covers a case where
an event will have a categorical outcome, e.g. K in {1, 2, 3, …, K}. For classification, this
means that the model predicts the probability of an example belonging to each class
label.

Popular algorithms that can be used for multi-class classification include:

• k-Nearest Neighbors.
• Decision Trees.
• Naive Bayes.
• Random Forest.
• Gradient Boosting.
Scatter Plot of Multi-Class Classification Dataset

Multi-Label Classification
Multi-label classification refers to those classification tasks that have two or more class
labels, where one or more class labels may be predicted for each example.

Consider the example of photo classification, where a given photo may have multiple
objects in the scene and a model may predict the presence of multiple known objects in
the photo, such as “bicycle,” “apple,” “person,” etc.

This is unlike binary classification and multi-class classification, where a single class
label is predicted for each example.

It is common to model multi-label classification tasks with a model that predicts multiple
outputs, with each output taking predicted as a Bernoulli probability distribution. This is
essentially a model that makes multiple binary classification predictions for each
example.

Classification algorithms used for binary or multi-class classification cannot be used


directly for multi-label classification. Specialized versions of standard classification
algorithms can be used, so-called multi-label versions of the algorithms, including:

• Multi-label Decision Trees


• Multi-label Random Forests
• Multi-label Gradient Boosting

Imbalanced Classification
Imbalanced classification refers to classification tasks where the number of examples in
each class is unequally distributed.
Typically, imbalanced classification tasks are binary classification tasks where the
majority of examples in the training dataset belong to the normal class and a minority of
examples belong to the abnormal class.

Examples include:

• Fraud detection.
• Outlier detection.
• Medical diagnostic tests.
These problems are modelled as binary classification tasks, although may require
specialized techniques.

Specialized techniques may be used to change the composition of samples in the


training dataset by under sampling the majority class or oversampling the minority class.

Learners in Classification Problems:


In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis
of the most related data stored in the training dataset. It takes less time in
training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a
training dataset before receiving a test dataset. Opposite to Lazy learners, Eager
Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms:


Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Clustering
It is basically a type of unsupervised learning method. An unsupervised
learning method is a method in which we draw references from datasets
consisting of input data without labeled responses. Generally, it is used as a
process to find meaningful structure, explanatory underlying processes,
generative features, and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group and dissimilar to the data points in other
groups. It is basically a collection of objects on the basis of similarity and
dissimilarity between them.
For ex– The data points in the graph below clustered together can be
classified into one single group. We can distinguish the clusters, and we can
identify that there are 3 clusters in the below picture.

The clustering technique is commonly used for statistical data analysis.

Example: Let's understand the clustering technique with the real-world example of
Mall: When we visit any shopping mall, we can observe that the things with similar
usage are grouped together. Such as the t-shirts are grouped in one section, and
trousers are at other sections, similarly, at vegetable sections, apples, bananas,
Mangoes, etc., are grouped in separate sections, so that we can easily find out the
things. The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.

The clustering technique can be widely used in various tasks. Some most common uses
of this technique are:

o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.

Apart from these general usages, it is used by the Amazon in its recommendation
system to provide the recommendations as per the past search of
products. Netflix also uses this technique to recommend the movies and web-series
to its users as per the watch history.

The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering (datapoint belongs
to only one group) and Soft Clustering (data points can belong to another group
also).

Below are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the
distance between the data points of one cluster is minimum as compared to another
cluster centroid.

Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters. The dense areas in data space are
divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.

Distribution Model-Based Clustering


In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done
by assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering algorithm that


uses Gaussian Mixture Models (GMM).
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as
there is no requirement of pre-specifying the number of clusters to be created. In this
technique, the dataset is divided into clusters to create a tree-like structure, which is
also called a dendrogram. The observations or any number of clusters can be selected
by cutting the tree at the correct level. The most common example of this method is
the Agglomerative Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more
than one group or cluster. Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster. Fuzzy C-means algorithm is
the example of this type of clustering; it is sometimes also known as the Fuzzy k-means
algorithm.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few are
commonly used. The clustering algorithm is based on the kind of data that we are
using. Such as, some algorithms need to guess the number of clusters in the given
dataset, whereas some are required to find the minimum distance between the
observation of the dataset.

1. K-Means algorithm: The k-means algorithm is one of the most popular


clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be specified
in this algorithm. It is fast with fewer computations required, with the linear
complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in
the smooth density of data points. It is an example of a centroid-based model,
that works on updating the candidates for centroid to be the center of the
points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar to
the mean-shift, but with some remarkable advantages. In this algorithm, the
areas of high density are separated by the areas of low density. Because of this,
the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be
used as an alternative for the k-means algorithm or for those cases where K-
means can be failed. In GMM, it is assumed that the data points are Gaussian
distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does
not require to specify the number of clusters. In this, each data point sends a
message between the pair of data points until convergence. It has O(N 2T) time
complexity, which is the main drawback of this algorithm.

Recommender Systems
Recommender systems are the systems that are designed to recommend things to the user based
on many different factors. These systems predict the most likely product that the users are most
likely to purchase and are of interest to. Companies like Netflix, Amazon, etc. use
recommender systems.

Benefits of Recommender systems

• Benefits users in finding items of their interest.

• Help item providers in delivering their items to the right user.

• Identity products that are most relevant to users.

• Personalized content.

• Help websites to improve user engagement.

Methods for Building Recommender systems

1. Content-Based Recommendation: The goal of this content based recommendation


is to predict the scores for unrated items of the users. The basic idea behind content
filtering is that each items have some feature x. For example, a movie has a high score
for a feature x1 but a low score for feature x2

2. Collaborative filtering: The disadvantage of the content filtering is that it need the
side information for each item. The collaborative filtering are done based on the user's
behaviour. History of the user plays an important role. It is of two types.

I. User-User collaborative filtering: In this, user vector include all the item purchased
by the user and the rating given for each particular product.

II. Item-Item collaborative filtering: In this, rather than considering similar users,
similar items are considered.

You might also like