Big-Data Unit-3
Big-Data Unit-3
Machine Learning
Unit-3
Introduction
• It is subfield of artificial intelligence.
• The goal of machine learning is to understand the structure of data and
fit that data into models that can be understood and utilized by people.
• In traditional computing , algorithms are set of instruction used by
computer to calculate or problem solve for simple tasks assigned to
computers.
• It is used by many industries for automating tasks and doing complex
data analysis.
• It focus mainly on designing of system thereby allowing them to learn
and make predictions based on some set of matrices in machines.
Definition
• Machine learning is the science of getting computers to learn and act
like humans do and improve their learning over time in autonomous
fashion, by feeding them data and information in the form of
observations and real-world interactions.
Artificial Intelligence
A program that can sense, reason, act and adapt.
Machine learning
Algorithms whose performance improve
as they are exposed to more data over
time.
Deep Learning
Subset of machine learning
in which multi-layered
Neural networks learn
from large amount of data.
• Machine Learning-It is branch of artificial intelligence which aim to
create intelligent system which do human like jobs by learning from
lot of relevant data.
• Deep learning- It is subset of machine learning in artificial lntelligance
that has network capable of learning unsupervised from data that is
unstructured. Also known as deep neural learning or deep neural
network.
• Artificial Intelligence- it refer to the simulation of human intelligence
in machines that are programmed to think like human and mimic their
action.
How Machine learning Differs from
Traditional Programming.
• 1. Traditional Computing- Algorithms are set of explicitly programmed
instructions used by computers to calculate or solve problem.
• In Machine learning algorithms instead allow for computers to train
on data inputs and use statistical analysis in order to output values
that fall within a specific range.
• Traditional Programming- Data & Program is run on the computer to
produce the output
• Machine learning: Data & output is run on the computer to create a
program. This program can be used in traditional programming.
Goals of Machine learning
• Primary goal of machine learning is to allow the computer learn
automatically without human intervention or assistance and adjust
actions accordingly.
• The goal of machine learning generally is to understand the structure
of data and fit that data into models that can be understood and
utilized by people.
• The goal of machine learning is to facilities computers in building
models from sample data.
• The goal of machine learning is to develop general purpose
algorithms of practical value.
Application of Machine Learning
• Image recognition
• Speech recognition
• Online fraud detection
• Stock Market trading
• Automatic Language translation.
Machine Learning life cycle
• Machine learning life cycle is a cyclic process to build an efficient
machine learning project.
• Main purpose of the life cycle to find a solution to the problem or
project.
• Life cycle steps:
• Gathering Data
• Data preparation
• Data wrangling
• Analyse data
• Train the model
• Test the model
• Deployment
Gathering Data
• This step to identify and obtain all data related problems.
• Identify different data sources, as data can be collected from various
source such as files, database, internet or mobile devices.
• The quality and quantity of the collected data will determine the
efficiency of the output.
• The more will be data , the more accurate will be the prediction.
Data Preparation
• It is a step where we put our data into suitable place and prepare it to
use in our machine learning training.
• In this step, we put all data together and then randomize the ordering
of data.
• This step can be further divided into two processes:
• Data Exploration: understand the nature of data that we have to work
with. We need to understand the characteristic , format and quality of
the data. In this step we find correlations, general trends and outliers.
• Data pre-processing: Now the next step is pre-processing of data for its
analysis
Data Wrangling
• It is the process of cleaning and converting raw data into a useable
format.
• It is process of cleaning the data , selecting the variable to use &
transforming the data in a proper format to make it more suitable for
analysis in the next step.
• Collected data may have various issues like Missing values, Duplicate
Data, Invalid Data, Noise. So, we use various filtering techniques to
clean the data.
• It is mandatory remove these issues, because its negative impact
affect on outcomes.
Data Analysis
• This step involves:
• Selection of analytical techniques
• Building models
• Review results.
• It starts with the determination of the type of the problems where
we select the machine learning techniques such as classification,
Regression, Cluster analysis, Association etc. then build the model
using prepared data, and evaluate the model.
Train Model
• We train our model to improve its performance for better outcome of
the problem.
• We use datasets to train the model using the machine learning
algorithms.
• Training a model is required so that it can understand the various
pattern , rules and features.
Test Model
• We check the accuracy of our model by providing a test dataset to it.
• Testing model determines the percentages accuracy of the model as
per the requirement of project or problem.
Deployment
• The last step of machine learning life cycle is deployment, where we
deploy the model in the real world system.
Advantages of Machine Learning
• Identifies trends & patterns easily.
• No human interference is required- automation
• Continuous improvement
• Handle Multidimensional & large amount of multi-variety data.
Disadvantages of Machine Learning
• Data acquisition
• Time & resources
• Interpretation of results
• High error-susceptibility
Types of Machine learning
• Supervised Machine Learning
• Unsupervised Machine Learning
Supervised Machine Learning
• The type of learning algorithm where the input & the desired output
are provided is known as the supervised learning algorithm.
• It used labeled data to train machines in order to make them learn and
establish relationships between given inputs & outputs.
• The objective of a supervised learning model is to predict the correct
label for newly presented input data.
• Y=f(x)
• Where Y is the predicted output i.e. determined by a mapping function
that assigns a class to an input value x.
• It is fast learning mechanism with high accuracy.
Labeled Dataset
• The dataset with outputs known for a given input is called a Labeled
Dataset.
• E.g- An Image of fruit with the fruit name is known.so when a new
image of fruit is shown, it compares with the training set to predict
the answer.
How Supervised machine learning
works?
• In this type the output is already known. There is a mapping of input with
desired output. Hence to create a model, the machine is fed with lots of
training input data.
• 1. The training data helps to achieve the accuracy for the created model. The
generated model is now ready to be fed with new input data and predict the
outcomes.
• 2. During training the algorithm will search for pattern in the data that match
with the desired output. The training process continues until the model
achieves a desired level of accuracy on the training data.
• 3. After training a supervised learning algorithm will take in new unseen inputs
and will determine which label the new inputs will be classified as based on
prior training data.
Supervised Learning Classified
• Classification
• Regression
Classification
• It means to group the output into a class.
• If data set is discrete, Boolean or categorical then it is a classification
Problem.
• Classification problems require the algorithm to predict a discrete value,
identifying the input data as belonging to a specific category or group.
• This technique can be used to classify products by Department, category,
subcategory etc.
• Email Spam Detection
• Problem: You want to build a model that can automatically classify
incoming emails as either "Spam" or "Not Spam" (also known as "Ham").
Regression
• Regression problem has a real number (number with decimal point)
as its output.
• It is mostly used for finding out the relationship between variables &
forecasting.
• Regression is a fundamental concept in supervised learning used to
predict a continuous outcome or target variable based on one or
more input features. Unlike classification, which predicts discrete class
labels, regression models predict numerical values.
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
• K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning
technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case
into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This
means when new data appears then it can be easily classified into a well suite category by using K- NN
algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying(core)
data.
• It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it
stores the dataset and at the time of classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that
data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want
to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it
works on a similarity measure. Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it will put it in either cat or dog
category.
W
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of problem,
we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class
of a particular dataset. Consider the below diagram:
How does K-NN work?
• The K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
• Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It can
be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.
• Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.
• Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex
some time.
• The computation cost is high because of calculating the distance
between the data points for all the training samples.
Naive Bayes
• Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional
training dataset.
• Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models
that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of
the probability of an object.
• Some popular examples of Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:
Where,
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
Why use Decision Trees?
• There are various algorithms in Machine learning, so choosing the
best algorithm for the given dataset and problem is the main point to
remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:
• Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
• The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more
homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into
sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from
the tree.
• Parent/Child node: The root node of the tree is called the parent node, and
other nodes are called the child nodes.
How does the Decision Tree
algorithm Work?
• Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
• Step-3: Divide the S into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Advantages of the Decision Tree
• It is simple to understand as it follows the same process which a
human follow while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other
algorithms.
Disadvantages of the Decision Tree
• The decision tree contains lots of layers, which makes it complex.
• It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.
• For more class labels, the computational complexity of the decision
tree may increase.
Support Vector Machine Algorithm
• Support Vector Machine or SVM is one of the most popular
Supervised Learning algorithms, which is used for Classification as
well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
• The idea behind SVM is to find the optimal boundary (called a
hyperplane) that best separates data points of different classes in the
feature space.
Components of the SVM
• Hyperplane: In SVM, a hyperplane is a decision boundary that separates
different classes in the feature space. The hyperplane is a line in 2D
space, a plane in 3D space, and a hyperplane in higher-dimensional
spaces.
• Support Vectors: Support vectors are the data points that are closest to
the hyperplane. These points are critical in defining the position and
orientation of the hyperplane. The SVM algorithm seeks to maximize
the margin (distance) between the hyperplane and the support vectors.
• Margin: The margin is the distance between the hyperplane and the
nearest data points from either class. SVM aims to find the hyperplane
that maximizes this margin, ensuring that the model is as general as
possible.
Types of SVM
• Linear SVM: Used when the data can be separated by a straight line
(or a hyperplane in higher dimensions). The algorithm finds the
optimal hyperplane that separates the classes.
• Non-Linear SVM: Used when the data cannot be separated by a
straight line. In this case, SVM uses a technique called the kernel trick
to map the data into a higher-dimensional space where a hyperplane
can separate the classes.
Advantages of SVM:
• Effective in High Dimensions: SVM works well when the number of
dimensions (features) is greater than the number of data points.
• Memory Efficient: SVM uses a subset of training points (support
vectors) to make decisions, which is memory efficient.
• Robust to Overfitting: Especially in high-dimensional space, if
appropriately regularized.
Disadvantages of SVM:
• Not Suitable for Large Datasets: SVM is computationally intensive and
may not perform well with very large datasets.
• Difficult to Choose the Right Kernel: The performance of SVM is highly
dependent on the choice of kernel and parameters.
• Interpretability: SVM models are not as easily interpretable as some
other models like decision trees.
Unsupervised Machine learning
• In this type of machine learning models are trained using unlabelled
dataset and are allowed to act on that data without any supervision.
• Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have
the input data but no corresponding output data. The goal of
unsupervised learning is to find the underlying structure of dataset,
group that data according to similarities, and represent that dataset
in a compressed format.
• Here, we have taken an unlabeled input data, which means it is not
categorized and corresponding outputs are also not given. Now, this
unlabeled input data is fed to the machine learning model in order to
train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as
k-means clustering, Decision tree, etc.
• Once it applies the suitable algorithm, the algorithm divides the data
objects into groups according to the similarities and difference
between the objects.