0% found this document useful (0 votes)
24 views70 pages

404-BA-chapter IV

Uploaded by

Parag Pardeshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views70 pages

404-BA-chapter IV

Uploaded by

Parag Pardeshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

4.

Machine Learning and


Data Mining:
Machine Learning
• Tom Mitchell: A computer program is said to learn from experience ‘E’ with respect to some class of tasks ‘T’
and performance measure ‘P’, if its performance at tasks in ‘T’, as measured by ‘P’, improves with experience
‘E’.
• Investopedia: Machine learning is the concept that a computer program can learn and adapt to new data
without human interference. Machine learning is a field of artificial intelligence (AI) that keeps a computer’s
built-in algorithms current regardless of changes in the worldwide economy.
• There are various approaches to implementing machine learning. We list the three main approaches:

• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
• What is Supervised Learning?
• Supervised learning algorithms create a mathematical model for a set of data that contains both the inputs
and the desired outputs. We train the machine using data which is “labelled.” It means the data tags are
available with the required answer. It is like the learning process which takes place in the presence of a
supervisor. A supervised learning algorithm learns from labeled training data, helping you to predict
outcomes from new data.
• What is Unsupervised Learning?
• Unsupervised learning algorithms take a set of data that contains only inputs, and identify structure in the
data, like grouping or clustering of data points. Unsupervised learning is a machine learning technique,
where you need not supervise the model. Instead, you need to allow the model to work on its own to
discover information. It mainly deals with the unlabelled data. Unsupervised learning algorithms allow you
to perform more complex processing tasks compared to supervised learning.
• Applications of Supervised Learning
• Natural language processing, computer vision, data analytics are filled with supervised learning tasks. The growth in the
field of AI is mainly attributed to the field of supervised learning. Machines are getting better at learning what is being
taught to them. They do this better than humans. Many tasks that come under this category are as follow:
• Image Recognition
• Image Segmentation
• Sentiment Analysis
• Text Prediction on Search Engines
• Applications of unsupervised learning
• A common application of Unsupervised Learning is the Recommendation System. The purpose of a recommender system
is to suggest relevant items to users. These are algorithms aimed at suggesting relevant items to users which can include
movies to watch, text to read, products to buy, or anything relevant which makes the customer engage more with the
platform. The output variable i.e. the recommended product/content to populate is learned by the machine by
categorizing the user profile and past behavior.
• Supervised Learning algorithms
• K-Nearest Neighbours
• SVM
• Linear Regression
• Logistic Regression
• Linear Discriminant analysis
• Decision Trees
• Naive Bayes
• Neural Networks
• Unsupervised Learning algorithms
• Principal component analysis (PCA)
• K- Means
• Singular value decomposition
• Apriori algorithm for association rule learning problems
Reinforcement Learning

• Reinforcement Learning is a subfield of machine learning that teaches an agent how to choose an action from its action
space, within a particular environment, in order to maximize rewards over time.
• Reinforcement learning is a machine learning training method based on rewarding desired behaviors and/or punishing
undesired ones. In general, a reinforcement learning agent is able to perceive and interpret its environment, take actions
and learn through trial and error.
• Reinforcement Learning has four essential elements:
• Agent. The program you train, with the aim of doing a job you specify.
• Environment. The world, real or virtual, in which the agent performs actions.
• Action. A move made by the agent, which causes a status change in the environment.
• Rewards. The evaluation of an action, which can be positive or negative.
• Example
• Creating A Personalized Learning System
• Agent: The program that decides what to show next in an online learning catalog.
• Environment: The learning system.
• Action: Playing a new class video and an advertisement.
• Reward: Positive if the user chooses to click the class video presented; greater positive reward if the user chooses to click the
advertisement; negative if the user goes away.
• This program can make a personalized class system more valuable. The user can benefit from more effective learning and the system
can benefit through more effective advertising.
• Correlation — It is the phenomenon which helps us understand the strength of relationship between the 2 variables. It
essentially indicates how the variation in one variable affects the other variable. It indicates the degree of relationship
between two variables.
• 1 — Positive Correlation:
In this case, an increase in the value of one variable tends to an increase in the value of other variable consistently. For
instance, weight of a person increases as he consumes more and more calories.
• 2 — Negative Correlation:
In this case, an increase in the value of one variable tends to a decrease in the value of other variable consistently. For
instance, GPA of a student decreases with an increase in the number of hours video games played by that student.
• 3 — No Correlation:
Non-linear relationship is one in which the value of one variable doesn’t change with respect to the other variable by
a constant factor.For example, the happiness of a person doesn’t depend upon the amount of money he/she has.

• Causation or Causal relationship — It is the phenomenon where the occurrence of one event causes the occurrence of the
second event. In this case, this phenomenon is also called cause and effect relationship because the first event is called the
cause and the second event is called the effect. For instance, the more time a student spends studying, it results in high
grades.
Regression
• Regression comes from the term ‘Regress’ which means predicting one variable from a set of variables. The
variable to be predicted is called dependent variable and the variables used in predictions are
called independent variables. Regression typically involves identifying the relationships or correlations
between the dependent and independent variables, however it cannot explain any causal relationship
between the variables. Example of regression would be one where we are trying to predict the price of
houses in Florida based on different independent variables such as size of the house and number of rooms in
the house.
What is Linear Regression?

• Linear regression attempts to model the relationship between two variables by fitting a linear equation to
observed data. One variable is considered to be an explanatory variable or an independent variable, and the
other is considered to be a dependent variable.
• Business Applications
• Linear regression is used for a wide array of business prediction problems:
• Stock prediction
• Predict future prices/costs
• Predict future revenue
• Comparing performance of new products
• Benefits of Linear Regression
• Ease
• Interpretability
• Scalability
• Deploys and Performs well on Online Settings
• Machine Learning approaches to Linear Regression
• Simple and Multiple Linear Regression
• Polynomial Regression
• Ridge Regression and Lasso Regression (upgrades to Linear Regression)
• Decision Trees Regression
• Support Vector Regression (SVR)
Example Linear Regression

You might want to sell your house and you have information regarding the number of bedrooms (X) in that house and here,
the catch is to find the price of the house (Y).
Linear regression creates an equation in which you input your given numbers (X) and it outputs the target variable that you
want to find out (Y).
In this case, we would use a dataset containing historic records of house purchases in the form of (“number of bedrooms”,
“selling price”):
Let’s now visualize the same data
Looking at the scatter plot, it seems that there is a trend: the more bedrooms that a house has, the higher its selling price
(which is not surprising, to be honest).

Now, let’s say that we trained a linear regression model to get an equation in the form:
Selling price = 77,143 * (Number of bedrooms) — 74,286

The equation acts as a prediction. If you input the number of bedrooms, you get the predicted value for the price at which the
house is sold.

For the specific example above:


Your selling price = 77,143 * 2 bedrooms — 74,286 = 80,000

In other words, you could sell your 2-bedroom house for approximately $80,000. But linear regression does more than just
that. We can also visualize it graphically to see what the price would be for houses with a different number of bedrooms —
This is because linear regression tries to find a straight line that best fits the data. Linear regression is not limited
to real-estate problems: it can also be applied to a variety of business use cases.
Mathematical Interpretation:Simple Linear Regression
• Linear regression is such a useful and established algorithm, that it is both a statistical model and a machine
learning model. Here, we will focus mainly on the machine learning side, but we will also draw some parallels
to statistics in order to paint a complete picture.
• Once trained, the model takes the form of a linear regression equation of this type:
• Terms in this equation —
• y is the output variable. It is also called the target variable in machine learning, or the dependent variable in statistical
modeling. It represents the continuous value that we are trying to predict.
• x is the input variable. In machine learning, x is referred to as the feature, while in statistics, it is called the independent
variable. It represents the information given to us at any given time.
• w0 is the bias term or y-axis intercept.
• w1 is the regression coefficient or scale factor. In classical statistics, it is the equivalent of the slope on the best-fit straight
line that is produced after the linear regression model has been fitted.
• wi are called weights in general.
Multiple Linear Regression

• Both simple and multiple linear regressions assume that there is a linear relationship between the input
variable(s) and the output target variable.
• The main difference is the number of independent variables that they take as inputs. Simple linear regression just
takes a single feature, while multiple linear regression takes multiple x values. The above formula can be
rewritten for a model with n-input variables as:

• The simple linear regression model can be represented graphically as a best-fit line between the data points, while the
multiple linear regression model can be represented as a plane (in 2-dimensions) or a hyperplane (in higher
dimensions).
• Despite their differences, both the simple and multiple regression models are linear models — they adopt the form of
a linear equation. This is called the linear assumption.
• Quite simply, it means that we assume that the type of relationship between the set of independent variables and
K-Nearest Neighbor

K-nearest neighbors (KNN) is a type of supervised learning algorithm used for both regression and classification. KNN tries to predict the
correct class for the test data by calculating the distance between the test data and all the training points. Then select the K number of
points which is closet to the test data. The KNN algorithm calculates the probability of the test data belonging to the classes of ‘K’ training
data and class holds the highest probability will be selected. In the case of regression, the value is the mean of the ‘K’ selected training
points.
• Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a
cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our
KNN model will find the similar features of the new data set to the cats and dogs images and based on the
most similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?
• Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this
data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With
the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below
diagram:
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

• Step-1: Select the number K of the neighbors

• Step-2: Calculate the Euclidean distance of K number of neighbors

• Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

• Step-4: Among these k neighbors, count the number of the data points in each category.

• Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

• Step-6: Our model is ready.


• Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between
two points, which we have already studied in geometry. It can be calculated as:
• By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

• As we can see the 3 nearest neighbors are from category A, hence this new
data point must belong to category A.
How to choose a K value?
• K value indicates the count of the nearest neighbors. We have to compute distances between test points
and trained labels points. Updating distance metrics with every iteration is computationally expensive, and
that’s why KNN is a lazy learning algorithm.

• As you can verify from the above image, if we proceed with K=3, then we predict that test input belongs to
class B, and if we continue with K=7, then we predict that test input belongs to class A.
• That’s how you can imagine that the K value has a powerful effect on KNN performance.
Then how to select the optimal K value?
• There are no pre-defined statistical methods to find the most favorable value of K.
• Initialize a random K value and start computing.
• Choosing a small value of K leads to unstable decision boundaries.
• The substantial K value is better for classification as it leads to smoothening the decision boundaries.
• Derive a plot between error rate and K denoting values in a defined range. Then choose the K value as
having a minimum error rate.
• Now you will get the idea of choosing the optimal K value by implementing the model.
Calculating distance:
• The first step is to calculate the distance between the new point and each training point. There are various methods for
calculating this distance, of which the most commonly known methods are — Euclidian, Manhattan (for continuous) and
Hamming distance (for categorical).
• Euclidean Distance: Euclidean distance is calculated as the square root of the sum of the squared differences between a
new point (x) and an existing point (y).
• Manhattan Distance: This is the distance between real vectors using the sum of their absolute difference.

• Hamming Distance: It is used for categorical variables. If the value (x) and the value (y) are the same, the distance D will be
equal to 0 . Otherwise D=1.
K-means Clustering
Clustering : grouping data based on similarity patterns based on distance
Clustering
• As mentioned above : based on similarity patterns
What is K-means ?
• K-means clustering is a simple and elegant approach for partitioning a
data set into K distinct, non-overlapping clusters. To perform K-means
clustering, we must first specify the desired number of clusters K; then the
K-means algorithm will assign each observation to exactly one of the K
clusters

• The K-means algorithm clusters data by trying to separate samples in n groups of equal
variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. The K-
means algorithm aims to choose centroid that minimise the inertia, or within-cluster sum-
of-squares criterion
How does it work?
• There are several steps to explain it. For instance, there is data provided below

• Here, I will explain step by step how k-means works


• Step 1. Determine the value “K”, the value “K” represents the number of clusters.
• in this case, we’ll select K=3. That is to say, we want to identify 3 clusters. Is there any way to determine the value of K?
yes there is, but we’ll talk later about it.
• Step 2. Randomly select 3 distinct centroid (new data points as cluster initialization)
• for example — attempts 1. “K” is equal 3 so there are 3 centroid, in which case it will be the cluster initialization
• Step 3. Measure the distance (euclidean distance) between each point and the centroid
• for example, measure the distance between first point and the centroid.

• Step 4. Assign the each point to the nearest cluster


• for example, measure the distance between first point and the centroid.
• Do the same treatment for the other unlabeled point, until we get this

• Step 5. Calculate the mean of each cluster as new centroid

• Update the centroid with mean of each cluster


• Step 6. Repeat step 3–5 with the new center of cluster

• Repeat until stop:


• Convergence. (No further changes)
• Maximum number of iterations.
• Since the clustering did not change at all during the last iteration, we’re done.

• Has this process been completed? of course not


• Remember, the K-means algorithm aims to choose centroid that minimise the inertia, or within-cluster sum-of-squares
criterion
K-Means Clustering flowchart
Decision Tree
• A Decision Tree is a simple representation for classifying examples. It is a Supervised Machine Learning where
the data is continuously split according to a certain parameter.
• Decision Tree consists of :
• Nodes : Test for the value of a certain attribute.
• Edges/ Branch : Correspond to the outcome of a test and connect to the next node or leaf.
• Leaf nodes : Terminal nodes that predict the outcome (represent class labels or class distribution).

• To understand the concept of Decision Tree consider the above example. Let’s say you want to predict
whether a person is fit or unfit, given their information like age, eating habits, physical activity, etc. The
decision nodes are the questions like ‘What’s the age?’, ‘Does he exercise?’, ‘Does he eat a lot of pizzas’? And
the leaves represent outcomes like either ‘fit’, or ‘unfit’.
• There are two main types of Decision Trees:
• Classification Trees.
• Regression Trees.
• 1. Classification trees (Yes/No types) :
• What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or ‘unfit’. Here the
decision variable is Categorical/ discrete.
• Such a tree is built through a process known as binary recursive partitioning. This is an iterative process of splitting the
data into partitions, and then splitting it up further on each of the branches.
• 2. Regression trees (Continuous data types) :
• Decision trees where the target variable can take continuous values (typically real numbers) are
called regression trees. (e.g. the price of a house, or a patient’s length of stay in a hospital)
• Advantages of Classification with Decision Trees:
• Inexpensive to construct.
• Extremely fast at classifying unknown records.
• Easy to interpret for small-sized trees
• Accuracy comparable to other classification techniques for many simple data sets.
• Excludes unimportant features.
• Disadvantages of Classification with Decision Trees:
• Easy to overfit.
• Decision Boundary restricted to being parallel to attribute axes.
• Decision tree models are often biased toward splits on features having a large number of levels.
• Small changes in the training data can result in large changes to decision logic.
• Large trees can be difficult to interpret and the decisions they make may seem counter intuitive.
Random Forest Classifier
• It is an ensemble tree-based learning algorithm. The Random Forest Classifier is a set of decision trees from randomly
selected subset of training set. It aggregates the votes from different decision trees to decide the final class of the test
object.
• Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems.
It builds decision trees on different samples and takes their majority vote for classification and average in case of
regression.
• One of the most important features of the Random Forest Algorithm is that it can handle the data set
containing continuous variables as in the case of regression and categorical variables as in the case of classification. It
performs better results for classification problems.
• Ensemble Algorithm
• Ensemble algorithms are those which combines more than one algorithms of same or different kind for classifying objects. For
example, running prediction over Naive Bayes, SVM and Decision Tree and then taking vote for final consideration of class for test
object.
Ensemble Technique
• Before understanding the working of the random forest we must look into the ensemble
technique. Ensemble simply means combining multiple models. Thus a collection of models is used to make predictions
rather than an individual model.
• Ensemble uses two types of methods:
• 1. Bagging– It creates a different training subset from sample training data with replacement & the final output is based
on majority voting. For example, Random Forest.
• 2. Boosting– It combines weak learners into strong learners by creating sequential models such that the final model has
the highest accuracy. For example, ADA BOOST, XG BOOST
• Bagging
• Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging chooses a
random sample from the data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the
Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap.
Now each model is trained independently which generates results. The final output is based on majority voting after
combining the results of all models. This step which involves combining all the results and generating output based on
majority voting is known as aggregation.

• Now let’s look at an example by breaking it down with the help of the following figure. Here the bootstrap sample is taken
from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with a replacement which means
there is a high possibility that each sample won’t contain unique data. Now the model (Model 01, Model 02, and Model 03)
obtained from this bootstrap sample is trained independently. Each model generates results as shown. Now Happy emoji is
having a majority when compared to sad emoji. Thus based on majority voting final output is obtained as Happy emoji.
• Steps involved in random forest algorithm:
• Step 1: In Random forest n number of random records are taken from the data set having k number of records.
• Step 2: Individual decision trees are constructed for each sample.
• Step 3: Each decision tree will generate an output.
• Step 4: Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.
• For example: consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken from
the fruit basket and an individual decision tree is constructed for each sample. Each decision tree will generate an output as
shown in the figure. The final output is considered based on majority voting. In the below figure you can see that the
majority decision tree gives output as an apple when compared to a banana, so the final output is taken as an apple.
• Features and Advantages of Random Forest :
• 1. Diversity- Not all attributes/variables/features are considered while making an individual tree, each tree is different.
• 2. Immune to the curse of dimensionality- Since each tree does not consider all the features, the feature space is
reduced.
• 3. Parallelization-Each tree is created independently out of different data and attributes. This means that we can make full
use of the CPU to build random forests.
• 4. Train-Test split- In a random forest we don’t have to segregate the data for train and test as there will always be 30% of
the data which is not seen by the decision tree.
• 5. Stability- Stability arises because the result is based on majority voting/ averaging.

• Disadvantages of Random Forest :

• Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.
• For data including categorical variables with different number of levels, random forests are biased in favor of those
attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of
data.
Difference Between Decision Tree & Random Forest

Decision trees Random Forest


1. Decision trees normally suffer from the problem 1. Random forests are created from subsets of data and
of overfitting if it’s allowed to grow without any the final output is based on average or majority ranking
control. and hence the problem of overfitting is taken care of.

2. A single decision tree is faster in computation. 2. It is comparatively slower.

3. When a data set with features is taken as input by 3. Random forest randomly selects observations, builds a
a decision tree it will formulate some set of rules to decision tree and the average result is taken. It doesn’t
do prediction. use any set of formulas.
Underfitting VS Overfitting
• Underfitting
• A statistical model is said to have underfitting when it cannot capture the underlying trend of the data. It’s like, what if I
send a 3rd grade kid to a Differential Calculus Class, the kid is only familiar with the basic arithmetic operations. That is
what it is! If the data contains too much information that the model cannot take, the model is going to underfit for sure.
• A model is said to be underfit if it is unable to learn the patterns in the data properly. An underfit model doesn’t fully learn
each and every example in the dataset. In such cases, we see a low score on both the training set and test/validation set.
• It usually happens if we have less data to train our model, but quite high amount of features, Or when we try to build a
linear model with a non-linear data. In such cases the rules of the machine learning model are too easy and flexible to be
applied on such a minimal data and therefore the model will probably make a lot of wrong predictions.
• Specifically, underfitting occurs if the model or algorithm shows low variance but high bias. Underfitting is often a result of
an excessively simple model.
• Overfitting
• A model is said to be overfitt if it is over trained on the data such that, it even learns the noise from it. An overfit model
learns each and every example so perfectly that it misclassifies an unseen/new example. For a model that’s overfit, we have
a perfect/close to perfect training set score while a poor test/validation score.
• Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively,
overfitting occurs when the model or the algorithm fits the data too well. Specifically, overfitting occurs if the model or
algorithm shows low bias but high variance. Overfitting is often a result of an excessively complicated model applied to a
not so complicated dataset.
• Reasons behind underfitting:
• Using a simple model for a complex problem which doesn’t learn all the patterns in the data. Example: Using
a logistic regression for image classification
• The underlying data has no inherent pattern. Example, trying to predict a student’s marks with his father’s
weight.
• Reasons behind overfitting:
• Using a complex model for a simple problem which picks up the noise from the data. Example: Fitting a
neural network to the Iris dataset.
• Small datasets, as the training set may not be a right representation of the universe.
How to avoid them-Underfitting VS Overfitting

• Well, Underfitting is quite simple to overcome, it can be avoided by using more data and also reducing the features by
feature selection.
• But when it comes to Overfitting, we can either try to build a simpler model with more bias or there are various methods
used, some of the common methodologies are:
• Cross-validation
• Cross-validation is a powerful preventative measure against overfitting.
• The idea is clever: Use your initial training data to generate multiple mini train-test splits. Use these splits to tune your
model.
• In standard k-fold cross-validation, we partition the data into k subsets, called folds. Then, we iteratively train the algorithm
on k-1 folds while using the remaining fold as the test set (called the “holdout fold”).
• Train with more data
• It won’t work every time, but training with more data can help algorithms detect the signal better. Of course, that’s not always the case. If
we just add more noisy data, this technique won’t help. That’s why you should always ensure your data is clean and relevant.
• Remove features
• You can manually improve their generalizability by removing irrelevant input features.
• Early stopping
• When you’re training a learning algorithm iteratively, you can measure how well each iteration of the model performs.
• Up until a certain number of iterations, new iterations improve the model. After that point, however, the model’s ability to generalize can
weaken as it begins to overfit the training data.
• Early stopping refers stopping the training process before the learner passes that point.

• Epoch refers to the the number of iteration the model have been trained through. this technique is mostly
used in deep learning while other techniques (e.g. regularization) are preferred for classical machine learning.
Deep Learning
• Deep learning is a sub-field of machine learning dealing with algorithms inspired by the structure and
function of the brain called artificial neural networks. In other words, It mirrors the functioning of our brains.
Deep learning algorithms are similar to how nervous system structured where each neuron connected each
other and passing information.
• One of differences between machine learning and deep learning model is on the feature extraction area.
Feature extraction is done by human in machine learning whereas deep learning model figure out by itself.
• Deep learning models work in layers and a typical model atleast have three layers. Each layer accepts the
information from previous and pass it on to the next one .

• Deep learning models tend to perform well with amount of data where as old machine learning models stops
improving after a saturation point.
• Deep learning algorithms are constructed with three layers,
• The first layer is called the Input Layer
• The last layer is called the Output Layer
• All layers in between are called Hidden Layers.
• Each Hidden layer is composed of neurons. The neurons are connected to each other. The neuron will process
and then propagate the input signal it receives the layer above it. The strength of the signal given the neuron
in the next layer depends on the weight, bias and activation function.
Convolutional Neural Networks (CNN)
• CNN (ConvNet) is a type of feed-forward artificial neural network with a unique architecture designed to extract increasingly complex
features of the data at each layer to determine the output. CNN’s are well suited for perceptual tasks.
• CNN models are being used across different applications and domains and especially prevalent in image and video processing projects.
• CNN consists of four layers,
• Convolution layer : First layer to extract features from an input image
• RELU : apply element wise activation function such as max(0,x) thresholding at zero
• Pooling : down-sample the previous layers feature map
• Fully connected layer : Flattens a matrix into vector and feed into a fully connected layer
Recurrent Neural Network
• RNN are a type of Neural Network where the output from previous step are fed as input to the current step.
• The main and most important feature of RNN is Hidden state, which remembers some information about a
sequence.
• RNN is commonly used in speech recognition and Natural Language processing.
Naive Bayes Classifier

• This algorithm is called “Naive” because it makes a naive assumption that each feature is independent of
other features which is not true in real life.
• As for the “Bayes” part, it refers to the statistician and philosopher, Thomas Bayes and the theorem named
after him, Bayes’ theorem, which is the base for Naive Bayes Algorithm.
• What is Naive Bayes Algorithm?
• On Summarizing the above mentioned points Naive Bayes algorithm can be defined as a supervised classification algorithm
which is based on Bayes theorem with an assumption of independence among features.
• Bayes Theorem helps us to find the probability of a hypothesis given our prior knowledge.
• As per wikipedia,In probability theory and statistics, Bayes’ theorem (alternatively Bayes’ law or Bayes’ rule, also written
as Bayes’s theorem) describes the probability of an event, based on prior knowledge of conditions that might be related to
the event.
• Lets look at the equation for Bayes Theorem,
• Where,
• P(A|B) is the probability of hypothesis A given the data B. This is called the posterior probability.
• P(B|A) is the probability of data B given that the hypothesis A was true.
• P(A) is the probability of hypothesis A being true (regardless of the data). This is called the prior probability of A.
• P(B) is the probability of the data (regardless of the hypothesis).
• If you are thinking what is P(A|B) or P(B|A)?These are conditional probabilities having formula :

• If you still have confusion,this image summarizes Bayes Theorem-


How does Naive Bayes Algorithm work?
• Let us take an example to understand how does Naive Bayes Algorithm work.
• Suppose we have a training dataset of 1025 fruits.The feature in the dataset are these :
Yellow_color,Big_Size,Sweet_Taste.There are three different classes apple,banana & others.
• Step 1: Create a frequency table for all features against all classes
• What can we conclude from the above table?
• Out of 1025 fruits, 400 are apples, 525 are bananas, and 100 are others.
• 175 of the total 400 apples are Yellow and the rest are not and so on.
• 400 fruits are Yellow, 425 are big in size and 200 are sweet from a total of 600 fruits.
• Step 2: Draw the likelihood table for the features against the classes.

• In our likelihood table Total_Probability of banana is maximum(0.1544) when the fruit is of Yellow_Color,Big in size and Sweet
in taste.Therefore as per Naive Bayes algorithm a fruit which is Yellow in color,big in size and sweet in taste is Banana.
• In a nutshell, we say that a new element will belong to the class which will have the maximum conditional probability
described above.
Clustering

• As the name suggests, it involves dividing the data points into groups and each group consist of similar data
points. In theory, data points that are in the same group should have similar properties, while data points in
different groups should have highly dissimilar properties. Clustering is an unsupervised learning problem, it
deals with finding a structure in collection of unlabeledd data.
• WHY CLUSTERING?
• The purpose of clustering is to make sense of and extract values from large sets of structured and unstructured data. It
helps you to glance through the data to pull out some patterns and structures before going deeper into nuts-and-bolts
analysis. For example, clustering can be used in the field of medical science and can also be used in customer classification
in marketing research.
Clustering Methods
• Hierarchical clustering: It is a tree based clustering method where the observations are divided into a tree like structure
using distance as a measure.
• Centroid based clustering: In this method, the observations are partitioned into a predefined number of clusters (k)
such that within cluster variances is minimum. k-means clustering is the most commonly used centroid based clustering
algorithm
• Density-based clustering: In this method, the observations are grouped together based on density of points that are
closely packed. DBSCAN and OPTICS are two popular density-based algorithms
• Distribution-based clustering: In this method, the observations are grouped based on the distribution models and
observations in the same cluster are most likely coming from the same distribution. Gaussian mixture algorithm is an
example of Distribution-based clustering
• The clustering types 2,3, and 4 described in the above list are also categorized as Non-Hierarchical Clustering.
• Hierarchical Clustering it bridges the gap made by k means clustering, it takes away the problem of having to pre define the number of
clusters at the beginning of the model. Hierarchical clustering, as the name suggest is an algorithm that builds hierarchy of clusters.
• There are mainly two types of Hierarchical clustering :
• Agglomerative Clustering begins with each data point as a separate cluster and at each iteration we merge the closest pair of clusters
and repeat this step until only a single cluster is left. Basically it uses a bottom-up approach. In hierarchical clustering we have concept
called as proximity matrix which stores the distance between each points. There are multiple distance metrics which can be used for
deciding the closeness of two clusters.
• Euclidean distance: distance((x, y), (a, b)) = √(x — a)² + (y — b)²
• Manhattan Distance: distance (x1, y1) and (x2, y2) = |x1 — x2| + |y1 — y2|
• 2. Divisive Clustering or the top bottom approach works in the opposite way. Instead of starting with n clusters ( in case of n
observations), it starts with a single cluster and assign all the points to that cluster. Therefore, it doesn’t matter if we have 10 0r 1000
data points, all these data points will belong to the same cluster at the beginning. Now at each iteration, we split the farthest point in the
cluster and repeat the process until each cluster contains only a single data point.
• Dendrogram is another concept in hierarchical clustering, it is a tree like diagram that records the sequences of merges or
splits. Whenever two clusters or data points are merged, we will join them in the dendrogram and the height of the join
will be the distance between these points. Now, we can set a threshold distance and draw a horizontal line, the number of
cluster will be the number of vertical lines which are being intersected by the line drawn using threshold.

Image above shows the number of


cluster using dendrogram
• 3. Density-Based Spatial Clustering Of Application With Noise or DBSCAN is not just able to cluster the data points
correctly but it also perfectly detects noise in the dataset. It groups densely grouped data points in the form of concentric
circles and the most important feature is that it is robust to outliers. DBSCAN clustering algorithm is used to find
associations and structures in data that are hard to find manually.
• Now the first question which comes to our mind is why do we need DBSCAN Clustering ? Answer — K-Means and
Hierarchical clustering both fail in creating clusters of arbitrary shapes, they are not able to form cluster based on varying
densities.
• DBSCAN clustering requires only two parameters :
• eps: it is the radius of the circle to be created around each data point to check the density.
• minPoints: it is the minimum number of data points required inside the circle which is created around every data point.
• DBSCAN clustering consist of creating circle of epsilon radius around every data point and classifies them into Core
point, Border point and Noise. A data point is a Core point if the circle around it contains at least ‘minPoints’ number of
points. I f the no of points present inside the circle is less than ‘minPoints’ then the point is classified as Border point. And if
there is no data point around any data point, it is treated as Noise.
• Advantages of Hierarchical Clustering:
• We do not have to specify the number of clusters as input to the algorithm
• Easy to interpret the results in the dendrogram
• Limitations of Hierarchical Clustering:
• It is slower than k-means clustering
• It is affected by noise and outliers in the data
• It cannot be used for large datasets and is not scalable
• It is difficult to handle different sized clusters

You might also like