0% found this document useful (0 votes)
14 views

Machine Learning

Uploaded by

mahinpatel140904
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Machine Learning

Uploaded by

mahinpatel140904
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 195

Silver Oal College Of Engineering And Technology

Unit 1 :
Introduction to Machine Learning

1 Prof. Monali Suthar (SOCET-CE)


Outline
 Overview of Human Learning and Machine Learning,
 Types of Machine Learning,
 Applications of Machine Learning ,
 Tools and Technology for Machine Learning

2 Prof. Monali Suthar (SOCET-CE)


What is Human Learning ?

3 Prof. Monali Suthar (SOCET-CE)


What is Machine Learning ?

4 Prof. Monali Suthar (SOCET-CE)


What Is Machine Learning?

 Machine learning is an application


of artificial intelligence (AI) that
provides systems the ability to
automatically learn and improve
from experience without being
explicitly programmed.
 Machine learning focuses on the
development of computer
programs that can access data and
use it learn for themselves.
 The primary aim is to allow the
computers learn
automatically without human
intervention or assistance and
adjust actions accordingly.

5 Prof. Monali Suthar (SOCET-CE)


Introduction To Analytics,
Machine Learning and deep learning

 Artificial Intelligence: Algorithms and systems that exhibit human-


like intelligence.
 Machine Learning: Subset of AI that can learn to perform a task with
extracted data and/or models.
 Deep Learning: Subset of machine learning that imitate the
functioning of human brain to solve problems.

6 Prof. Monali Suthar (SOCET-CE)


Types of Machine learning algorithms
1. Supervised Learning Algorithms: Require the knowledge of both the
outcome variable (dependent variable) and the features (independent
variable or input variables). E.G., Linear regression, logistic regression,
discriminant analysis.
In the case of multiple linear regression, the regression parameters
are estimated by
2. Unsupervised Learning Algorithms: Set of algorithms which do not have
the knowledge of the outcome variable in the dataset. E.g., clustering,
principal component analysis.
3. Reinforcement Learning Algorithms: Algorithms that have to take
sequential actions (decisions) to maximize a cumulative reward. E.g.,
techniques such as Markov chain and Markov decision process.
4. Evolutionary Learning Algorithms: Algorithms that imitate natural
evolution to solve a problem. E.g., techniques such as genetic algorithm and
ant colony optimization.
7 Prof. Monali Suthar (SOCET-CE)
Types of Machine learning algorithms

 Machine learning algorithms are often categorized as


supervised or unsupervised.

Learning System

Unsupervised Semi supervised


Supervised Learning Reinforcement
Learning Learning

8 Prof. Monali Suthar (SOCET-CE)


Supervised machine learning algorithms
 Supervised machine learning algorithms can apply what has been learned
in the past to new data using labeled examples to predict future events.
 Starting from the analysis of a known training dataset, the learning
algorithm produces an inferred function to make predictions about the
output values.
 The system is able to provide targets for any new input after sufficient
training.
 The learning algorithm can also compare its output with the correct,
intended output and find errors in order to modify the model accordingly.

9 Prof. Monali Suthar (SOCET-CE)


10 Prof. Monali Suthar (SOCET-CE)
Supervised machine learning algorithms

11 Prof. Monali Suthar (SOCET-CE)


Types of supervised Machine learning

 Regression: Regression algorithms are used


if there is a relationship between the input
variable and the output variable. It is used Supervised
for the prediction of continuous variables, Learning
such as Weather forecasting, Market Trends,
etc. Below are some popular Regression
algorithms which come under supervised
learning: Classificati
Regression
 Linear Regression on

 Regression Trees
 Non-Linear Regression
 Bayesian Linear Regression
 Polynomial Regression

12 Prof. Monali Suthar (SOCET-CE)


Types of supervised Machine learning

 Classification : Classification algorithms are


used when the output variable is Supervised
categorical, which means there are two Learning
classes such as Yes-No, Male-Female, True-
false, etc.
 Spam Filtering,
Classificati
 Random Forest Regression
on
 Decision Trees
 Logistic Regression
 Support vector Machines

13 Prof. Monali Suthar (SOCET-CE)


Advantages and Disadvantages
 Advantages:
 With the help of supervised learning, the model can predict the output on
the basis of prior experiences.
 In supervised learning, we can have an exact idea about the classes of
objects.
 Supervised learning model helps us to solve various real-world problems
such as fraud detection, spam filtering, etc.
 Disadvantages :
 Supervised learning models are not suitable for handling the complex tasks.
 Supervised learning cannot predict the correct output if the test data is
different from the training dataset.
 Training required lots of computation times.
 In supervised learning, we need enough knowledge about the classes of
object.

14 Prof. Monali Suthar (SOCET-CE)


Unsupervised machine learning
Algorithms
 In contrast, unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labeled.
 Unsupervised learning studies how systems can infer a function to describe a
hidden structure from unlabeled data.
 The system doesn’t figure out the right output, but it explores the data and can
draw inferences from datasets to describe hidden structures from unlabeled
data.
 “Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data without any
supervision.”

15 Prof. Monali Suthar (SOCET-CE)


Unsupervised Learning

16 Prof. Monali Suthar (SOCET-CE)


Types of Unsupervised Learning
 Clustering:
 Clustering is a method of grouping the
objects into clusters such that objects
with most similarities remains into a
group and has less or no similarities
with the objects of another group.
 Cluster analysis finds the
commonalities between the data
objects and categorizes them as per
the presence and absence of those
commonalities.

17 Prof. Monali Suthar (SOCET-CE)


Types of Unsupervised Learning
 Association :
 An association rule is an unsupervised
learning method which is used for
finding the relationships between
variables in the large database.
 It determines the set of items that
occurs together in the dataset.
Association rule makes marketing
strategy more effective.
 Such as people who buy X item
(suppose a bread) are also tend to
purchase Y (Butter/Jam) item.
 A typical example of Association rule
is Market Basket Analysis.

18 Prof. Monali Suthar (SOCET-CE)


Unsupervised Learning Algorithm
 Below is the list of some popular unsupervised learning algorithms:
 K-means clustering
 KNN (k-nearest neighbors)
 Hierarchal clustering
 Anomaly detection
 Neural Networks
 Principle Component Analysis
 Independent Component Analysis

19 Prof. Monali Suthar (SOCET-CE)


Advantages & Disadvantages
 Advantages :

 Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don't have labeled
input data.
 Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

 Disadvantages :

 Unsupervised learning is intrinsically more difficult than supervised learning


as it does not have corresponding output.
 The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.

20 Prof. Monali Suthar (SOCET-CE)


Difference between SL and USL

21 Prof. Monali Suthar (SOCET-CE)


Difference between SL and USL
SUPERVISED LEARNING UNSUPERVISED
LEARNING

Input Data Uses Known and Labeled Data Uses Unknown Data as input
as input
Computational
Complexity Simpler method computationally complex

Real Time Uses off-line analysis Uses Real Time Analysis of


Data
Number of Classes are known Number of Classes are not
Number of Classes known
Moderate Accurate and
Accuracy of Results Accurate and Reliable Results Reliable Results

22 Prof. Monali Suthar (SOCET-CE)


Semi-supervised machine learning
algorithms
 Semi-supervised machine learning algorithms fall somewhere in between
supervised and unsupervised learning, since they use both labeled and
unlabeled data for training – typically a small amount of labeled data and a
large amount of unlabeled data.
 The systems that use this method are able to considerably improve learning
accuracy.
 Usually, semi-supervised learning is chosen when the acquired labeled data
requires skilled and relevant resources in order to train it / learn from it.
 Otherwise, acquiring unlabeled data generally doesn’t require additional
resources.

23 Prof. Monali Suthar (SOCET-CE)


Reinforcement machine learning
algorithms
 Reinforcement machine learning algorithms is a learning method that interacts
with its environment by producing actions and discovers errors or rewards.
 Trial and error search and delayed reward are the most relevant characteristics
of reinforcement learning.
 This method allows machines and software agents to automatically determine
the ideal behavior within a specific context in order to maximize its
performance.
 Simple reward feedback is required for the agent to learn which action is best;
this is known as the reinforcement signal.

24 Prof. Monali Suthar (SOCET-CE)


Reinforcement machine learning
algorithms
 Though both supervised and reinforcement learning use mapping between
input and output, unlike supervised learning where the feedback provided to
the agent is correct set of actions for performing a task, reinforcement learning
uses rewards and punishments as signals for positive and negative behavior.
 Application:
 RL is quite widely used in building AI for playing computer games.
 In robotics and industrial automation, RL is used to enable the robot to
create an efficient adaptive control system for itself which learns from its
own experience and behavior.

25 Prof. Monali Suthar (SOCET-CE)


Reinforcement machine learning

26 Prof. Monali Suthar (SOCET-CE)


Methods of ML

Learning System

Semi supervised Unsupervised


Supervised Learning Reinforcement
Learning Learning

Regression Classification Clustering Dimension reduction

27 Prof. Monali Suthar (SOCET-CE)


Applications of ML

28 Prof. Monali Suthar (SOCET-CE)


Applications of ML
1. Image Recognition:
 Image recognition is one of the most common applications of machine learning. It is
used to identify objects, persons, places, digital images, etc. The popular use case of
image recognition and face detection is, Automatic friend tagging suggestion:
 Facebook provides us a feature of auto friend tagging suggestion. Whenever we
upload a photo with our Facebook friends, then we automatically get a tagging
suggestion with name, and the technology behind this is machine learning's face
detection and recognition algorithm.
 It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.

29 Prof. Monali Suthar (SOCET-CE)


Applications of ML
2. Speech Recognition
 While using Google, we get an option of "Search by voice," it comes
under speech recognition, and it's a popular application of machine
learning.
 Speech recognition is a process of converting voice instructions into text,
and it is also known as "Speech to text", or "Computer speech
recognition." At present, machine learning algorithms are widely used
by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition
technology to follow the voice instructions.

30 Prof. Monali Suthar (SOCET-CE)


Applications of ML
3.Traffic prediction:
 If we want to visit a new place, we take help of Google Maps, which
shows us the correct path with the shortest route and predicts the
traffic conditions.
 It predicts the traffic conditions such as whether traffic is cleared, slow-
moving, or heavily congested with the help of two ways:
 Real Time location of the vehicle form Google Map app and sensors
 Average time has taken on past days at the same time.
 Everyone who is using Google Map is helping this app to make it better.
It takes information from the user and sends back to its database to
improve the performance.

31 Prof. Monali Suthar (SOCET-CE)


Applications of ML
4. Product recommendations:
 Machine learning is widely used by various e-commerce and
entertainment companies such as Amazon, Netflix, etc., for product
recommendation to the user. Whenever we search for some product on
Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of
machine learning.
 Google understands the user interest using various machine learning
algorithms and suggests the product as per customer interest.
 As similar, when we use Netflix, we find some recommendations for
entertainment series, movies, etc., and this is also done with the help of
machine learning.

32 Prof. Monali Suthar (SOCET-CE)


Applications of ML
5. Email Spam and Malware Filtering:
 Whenever we receive a new email, it is filtered automatically as
important, normal, and spam. We always receive an important mail in
our inbox with the important symbol and spam emails in our spam box,
and the technology behind this is Machine learning. Below are some
spam filters used by Gmail:Content Filter
 Header filter
 General blacklists filter
 Rules-based filters
 Permission filters
 Some machine learning algorithms such as Multi-Layer
Perceptron, Decision tree, and Naïve Bayes classifier are used for email
spam filtering and malware detection.

33 Prof. Monali Suthar (SOCET-CE)


Applications of ML
 6. Self-driving cars:
 7.Virtual Personal Assistant:
 Google assistant, Alexa, Cortana, Siri.
 8. Online Fraud Detection:
 9. Stock Market trading:
 10. Medical Diagnosis:
 11. Automatic Language Translation:
 Google's GNMT (Google Neural Machine Translation)

34 Prof. Monali Suthar (SOCET-CE)


Tools for Machine Learning
 Scikit-learn:
 Scikit-learn is for machine learning development in python. It provides a
library for the Python programming language.
 Features:
 It helps in data mining and data analysis.
 It provides models and algorithms for Classification, Regression,
Clustering, Dimensional reduction, Model selection, and Pre-
processing.
 Pros:
 Easily understandable documentation is provided.
 Parameters for any specific algorithm can be changed while calling
objects.
 Official Website: http://scikit-learn.org/stable/

35 Prof. Monali Suthar (SOCET-CE)


Tools for Machine Learning
 PyTorch:
 PyTorch is a Torch based, Python machine learning library. The torch is a
Lua based computing framework, scripting language, and machine
learning library.

 Features:
 Helps in training and building your models.
 You can run your existing models with the help of TensorFlow.js which
is a model converter.
 It helps in the neural network.
 Pros:
 You can use it in two ways, i.e. by script tags or by installing through
NPM.
 It can even help for human pose estimation.

36 Prof. Monali Suthar (SOCET-CE)


Tools for Machine Learning
 TensorFlow:
 TensorFlow provides a JavaScript library which helps in machine learning.
APIs will help you to build and train the models.
 Features:
 It helps in building neural networks through Autograd Module.
 It provides a variety of optimization algorithms for building neural
networks.
 PyTorch can be used on cloud platforms.
 It provides distributed training, various tools, and libraries.
 Pros:
 It helps in creating computational graphs.
 Ease of use because of the hybrid front-end.
 Official Website: https://www.tensorflow.org/
 Cons:
 It is difficult to learn.

37 Prof. Monali Suthar (SOCET-CE)


Tools for Machine Learning
 Weka
 It is also open-source software.
 One can access it through a graphical user interface.
 The software is very user-friendly.
 The application of this tool is in research and teaching.
 Along with this, Weka lets you access other machine learning tools as well.
 For example, R, Scikit-learn, etc.

38 Prof. Monali Suthar (SOCET-CE)


Tools for Machine Learning
 Jupyter Notebook
 Jupyter notebook is one of the most widely used machine learning tools
among all.
 It is a very fast processing as well as an efficient platform.
 Thus the name of Jupyter is formed by the combination of these three
programming languages.
 Jupyter Notebook allows the user to store and share the live code in the
form of notebooks. One can also access it through a GUI.
 For example, winpython navigator, anaconda navigator, etc.
 Moreover, it supports three languages viz. Julia, R, Python.

39 Prof. Monali Suthar (SOCET-CE)


Tools for Machine Learning
 Accord.net
 Accord.net is a computational machine learning framework. It comes with
an image as well as audio packages.
 Such packages assist in training the models and in creating interactive
applications.
 For example, audition, computer vision, etc.
 As .net is present in the name of the tool, the base library of this framework
is C# language.
 Accord libraries are very much useful in testing as well as manipulating
audio files.

40 Prof. Monali Suthar (SOCET-CE)


Programming Language for Machine
Learning
 Python Programming Language
 With over 8.2 million developers across the world using Python for coding,
Python ranks first in the latest annual ranking of popular programming
languages by IEEE Spectrum with a score of 100.
 Stack overflow programming language trends clearly show that it’s the only
language on rising for the last five years.
 Features:
 Extensive Collection of Libraries and Packages
 Code Readability
 Flexibility

41 Prof. Monali Suthar (SOCET-CE)


Programming Language for Machine
Learning
 Python Programming Language
 Python Libraries:
 Working with textual data – use NLTK, SciKit, and NumPy
 Working with images – use Sci-Kit image and OpenCV
 Working with audio – use Librosa
 Implementing deep learning – use TensorFlow, Keras, PyTorch
 Implementing basic machine learning algorithms – use Sci-Kit- learn.
 Want to do scientific computing – use Sci-Py
 Want to visualise the data clearly – use Matplotlib, Sci-Kit, and Seaborn.

42 Prof. Monali Suthar (SOCET-CE)


Core Python Libraries for Machine Learning

43 Prof. Monali Suthar (SOCET-CE)


Getting Started With Anaconda Platform
 Step 1: Go to Anaconda Site
 Go to https://www.Anaconda.Com/distribution/ using your browser
window.
 Step 2: Download Anaconda Installer for your Environment
 Select your OS environment and choose Python 3.7 version to download
the installation files as shown in figure.

44 Prof. Monali Suthar (SOCET-CE)


Getting Started With Anaconda Platform
 Step 3: Install Anaconda
 Double click on the downloaded file and follow the on-screen installation instructions, leaving
options as set by default. This will take a while and complete the installation process.

 Step 4: Start Jupyter Notebook


 Open the command terminal window as per your OS environment and type the following
command:
jupyter notebook -- ip=*
This should start the Jupyter notebook. Open a browser window in your default browser
software.
 Step 5: Create a New Python Program
 On the browser window, select “New” for a menu. Clicking on the “Folder” will create a
directory in the current directory. To create a Python program, click on “Python 3”. It will open
a new window, which will be the program editor for the new Python program as shown in the
figure.
45 Prof. Monali Suthar (SOCET-CE)
Getting Started With Anaconda Platform
 Step 6: Rename the Program
 By default, the program name will be
“Untitled”. Click on it to rename the
program and name as per your
requirement. For example, we have
renamed it to “My First Program”.
 Step 7: Write and Execute Code
 Write Python code in the cell and then
press SHIFT+ENTER to execute the
cell.
 Step 8: Basic Commands for Working
with Jupyter Notebook
 Click on “User Interface Tour” for a
quick tour of Jupyter notebook
features. Or click on “Keyboard
Shortcuts” for basic editor commands
as shown in the figure.
46 Prof. Monali Suthar (SOCET-CE)
Programming Language for Machine
Learning
 R Programming Language
 R is an incredible programming language for machine learning written by
a statistician for statisticians.
 R language can also be used by non-programmer including data miners,
data analysts, and statisticians.

47 Prof. Monali Suthar (SOCET-CE)


Programming Language for Machine
Learning
 R Programming Language
 R language provides a variety of tools to train and evaluate machine
learning algorithms for predicting future events making machine learning
easy and approachable. R has an exhaustive list of packages for machine
learning –
 MICE for dealing with missing values.
 CARET for working with classification and regression problems.
 PARTY and rpart for creating data partitions.
 randomFOREST for creating decision trees.
 dplyr and tidyr for data manipulation.
 ggplot2 for creating beautiful visualisations.
 Rmarkdown and Shiny for communicating insights through reports.

48 Prof. Monali Suthar (SOCET-CE)


Programming Language for Machine
Learning
 JAVA and JAVAScriptProgramming Language
 Using Java for machine learning projects makes it easier for machine
learning engineers to integrate with existing code repositories.
 Features like the ease of use, package services, better user interaction,
easy debugging, and graphical representation of data make it a machine
learning language of choice
 Java has plenty of third party libraries for machine learning
 JavaML : provides a collection of machine learning algorithms
implemented in Java
 Arbiter Java library for hyper parameter tuning which is an integral part
of making ML algorithms run effectively
 Deeplearning4J library which supports popular machine learning
algorithms like K-Nearest Neighbor and Neurop

49 Prof. Monali Suthar (SOCET-CE)


50 Prof. Monali Suthar (SOCET-CE)
Reference
 https://www.javatpoint.com/applications-of-machine-
learning
 https://www.edureka.co/blog/machine-learning-
applications/
 https://www.geeksforgeeks.org/machine-learning-
introduction/
 https://towardsdatascience.com/10-most-popular-
machine-learning-software-tools-in-2019-678b80643ceb
 https://in.springboard.com/blog/best-language-for-
machine-learning/
 Machine Learning :A Probabilistic Perspective,Kevin P.
Murphy

51 Prof. Monali Suthar (SOCET-CE)


52 Prof. Monali Suthar (SOCET-CE)
Silver Oal College Of Engineering And Technology

Unit 2 :
Preparing to Model

1 Prof. Monali Suthar (SOCET-CE)


Outline
 Machine Learning activities,
 Types of data in Machine Learning,
 Structures of data,
 Data quality and remediation,
 Data Pre-Processing: Dimensionality reduction, Feature
subset selection

2 Prof. Monali Suthar (SOCET-CE)


Framework For Developing Machine Learning Models

Problem or Opportunity
Identification

Feature Extraction

Data Preprocessing

Model Building

Communication and deployment of


Data Analysis

3 Prof. Monali Suthar (SOCET-CE)


Machine Learning activities

4 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Most data can be categorized into
2 basic types from a Machine
Learning perspective:
1. Qualitative Data
Type/Categorical data
2. Quantitative Data
Type/Numerical data

5 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Qualitative Data
/Categorical data
 Qualitative or Categorical Data
describes the object under
consideration using a finite set of
discrete classes.
 It means that this type of data can‘t be
counted or measured easily using
numbers and therefore divided into
categories.
 Ex: The gender of a person (male,
female, or others).
 There are two subcategories under
this:
 Nominal data
 Ordinal data

6 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Qualitative Data /Categorical
data
 Nominal data
 These are the set of values that
don‘t possess a natural ordering.
 nominal data type there is no
comparison among the
categories
 Ex: The color of a smartphone
can be considered as a nominal
data type as we can‘t compare
one color with others.
 Ex: The gender of a person is
another one where we can‘t
differentiate between male,
female, or others

7 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Qualitative Data /Categorical data
 Ordinal data
 These types of values have a natural
ordering while maintaining their
class of values.
 These categories help us deciding
which encoding strategy can be
applied to which type of data.
 Ex: nominal data type where there is
no comparison among the
categories small < medium < large.
 Data encoding for Qualitative data is
important because machine learning
models can‘t handle these values
directly and needed to be converted
to numerical types as the models
are mathematical in nature.

8 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Quantitative/ Numeric
Data
 This data type tries to
quantify things and it does
by considering numerical
values that make it
countable in nature.
 Discrete
 Continuous

9 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Quantitative/ Numeric
Data
 Discrete
 The numerical values
which fall under are
integers or whole
numbers are placed under
this category. The number
of speakers in the phone,
cameras, cores in the
processor, the number of
sims supported all these
are some of the examples
of the discrete data type.

10 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning
 Quantitative/ Numeric
Data
 Continuous
 The fractional numbers
are considered as
continuous values. These
can take the form of the
operating frequency of the
processors, the android
version of the phone, wifi
frequency, temperature of
the cores, and so on.

11 Prof. Monali Suthar (SOCET-CE)


Types of data in Machine Learning

12 Prof. Monali Suthar (SOCET-CE)


Structures of data
 The term structured data refers to data that resides in a
fixed field within a file or record. Structured data is
typically stored in a relational database (RDBMS).
 It can consist of numbers and text, and sourcing can
happen automatically or manually, as long as it's within an
RDBMS structure.
 It depends on the creation of a data model, defining what
types of data to include and how to store and process it.

13 Prof. Monali Suthar (SOCET-CE)


Data quality and remediation
 Data quality is an assessment or a perception
of data's fitness to fulfill its purpose. Simply put, data is said
to be high quality if it satisfies the requirements of its
intended purpose.
 There are many aspects to data quality, including consistency,
integrity, accuracy, and completeness.
 Achieving the data quality required for machine learning
 This includes checking for consistency, accuracy, compatibility,
completeness, timeliness, and duplicate or corrupted records.
 At the scale required for a typical ML project, adequately
cleansing training or production data manually is a near
impossibility.

14 Prof. Monali Suthar (SOCET-CE)


Importance of Data quality
 Data Quality matters for machine learning.
Unsupervised machine learning is a savior when the
desired quality of data is missing to reach the requirements
of the business.
 It is capable of delivering precise business insights by
evaluating data for AI-based programs.
 Improved data quality leads to better decision-making across
an organization.
 The more high-quality data you have, the more confidence
you can have in your decisions.
 Data quality is of critical importance especially in the era of
automated decisions, ML, and continuous process optimization

15 Prof. Monali Suthar (SOCET-CE)


Importance of Data quality
 Confusion, limited trust, poor decisions
 Data quality issues explain limited trust in data from corporate
users, waste of resources, or even poor decisions.
 Failures due to low data quality
 Users need to trust the data — if they don‘t, they will gradually
abandon the system impacting its major KPIs and success
criteria.

16 Prof. Monali Suthar (SOCET-CE)


Data quality issues
 Data quality issues can take many forms, for example:
 particular properties in a specific object have invalid or missing
values
 a value coming in an unexpected or corrupted format
 duplicate instances
 inconsistent references or unit of measures
 incomplete cases
 broken URLs
 corrupted binary data
 missing packages of data
 gaps in the feeds
 incorrectly -mapped properties

17 Prof. Monali Suthar (SOCET-CE)


Data quality
 Data quality issues are typically the result of:
 poor software implementations: bugs or improper
handling of particular cases
 system-level issues: failures in certain processes
 changes in data formats, impacting the source and/or
target data stores

18 Prof. Monali Suthar (SOCET-CE)


Data remediation
 Data remediation is the process of cleansing,
organizing and migrating data so that it's properly
protected and best serves its intended purpose. ... Since
the core initiative is to correct data, the data
remediation process typically involves replacing,
modifying, cleansing or deleting any ―dirty‖ data.
 It can be performed manually, with cleansing tools, as a
batch process (script), through data migration or a
combination of these methods.

19 Prof. Monali Suthar (SOCET-CE)


Data remediation
 Need for data remediation: Consider these additional
factors that will drive the need for data remediation
 Moving to a new system or environment
 Eliminating personally identifiable information (a.k.a.
PII)
 Dealing with mergers and acquisitions activity
 Addressing human errors
 Remedying errors in reports
 Other business drivers

20 Prof. Monali Suthar (SOCET-CE)


Data remediation terminology
 Data Migration – The process of moving data between two or more systems, data formats
or servers.
 Data Discovery – A manual or automated process of searching for patterns in data sets to
identify structured and unstructured data in an organization‘s systems.
 ROT – An acronym that stands for redundant, obsolete and trivial data. According to the
Association for Intelligent Information Management, ROT data accounts for nearly 80 percent
of the unstructured data that is beyond its recommended retention period and no longer
useful to an organization.
 Dark Data – Any information that businesses collect, process and store, but do not use for
other purposes. Some examples include customer call records, raw survey data or email
correspondences. Often, the storing and securing of this type of data incurs more expense
and sometimes even greater risk than it does value.
 Dirty Data – Data that damages the integrity of the organization‘s complete dataset. This can
include data that is unnecessarily duplicated, outdated, incomplete or inaccurate.
 Data Overload – This is when an organization has acquired too much data, including low-
quality or dark data. Data overload makes the tasks of identifying, classifying and remediating
data laborious.
 Data Cleansing – Transforming data in its native state to a predefined standardized format.
 Data Governance – Management of the availability, usability, integrity and security of the
21data stored within an organization. Prof. Monali Suthar (SOCET-CE)
Stages of data remediation
 Data remediation is an involved process. After all, it‘s more
than simply purging your organization‘s systems of dirty data.
 It requires knowledgeable assessment on how to most
effectively resolve unclean data.
 Assessment:
 you need to have a complete understanding of the data you possess.
 Organizing and segmentation:
 Not all data is created equally, which means that not all pieces of
data require the same level of protection or storage features.
 when creating segments is determining which historical data is
essential to business operations and needs to be stored in an archive
system versus data that can be safely deleted.

22 Prof. Monali Suthar (SOCET-CE)


Stages of data remediation
 Indexation and classification:
 These steps build off of the data segments you have created and
helps you determine action steps.
 organizations will focus on segments containing non-ROT data
and classify the level of sensitivity of this remaining data.
 Migrating:
 If an organization‘s end goal is to consolidate their data into a new,
cleansed storage environment, then migration is an essential step in
the data remediation process.
 Data cleansing:
 The final task for your organization‘s data may not always involve
migration.
 There may be other actions better suited for the data depending on
what segmentation group it falls under and its classification.
 A few vital actions that a team may proceed with include shredding,
redacting, quarantining, ACL removal and script execution to clean up
data.
23 Prof. Monali Suthar (SOCET-CE)
Benefits of data remediation
 Reduced data storage costs
 Protection for unstructured sensitive data
 Reduced sensitive data footprint
 Adherence to compliance laws and regulations
 Increased staff productivity
 Minimized cyberattack risks
 Improved overall data security

24 Prof. Monali Suthar (SOCET-CE)


Dimensionality reduction
 Dimensionality reduction
 The number of input variables or features for a dataset is referred to as its
dimensionality.

 Dimensionality reduction refers to techniques that reduce the number of


input variables in a dataset.

 More input features often make a predictive modeling task more challenging
to model, more generally referred to as the curse of dimensionality.

 High-dimensionality statistics and dimensionality reduction techniques are


often used for data visualization. Nevertheless these techniques can be used
in applied machine learning to simplify a classification or regression dataset
in order to better fit a predictive model.

25 Prof. Monali Suthar (SOCET-CE)


Why dimensionality reduction needed?
 Some features (dimensions) bear little or nor useful
information (e.g. color of hair for a car selection)
 Can drop some features
 Have to estimate which features can be dropped from data

 Several features can be combined together without loss


or even with gain of information (e.g. income of all family
members for loan application)
 Some features can be combined together
 Have to estimate which features to combine from data

26 Prof. Monali Suthar (SOCET-CE)


Feature selection vs extraction

 Feature selection: Choosing k<d important features,


ignoring the remaining d – k
 Subset selection algorithms
 Feature extraction: Project the original xi , i =1,...,d
dimensions to new k<d dimensions, zj , j =1,...,k
 Principal Components Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Factor Analysis (FA)

27 Prof. Monali Suthar (SOCET-CE)


Principal Components Analysis (PCA)
 The Principal Component Analysis is a popular unsupervised learning
technique for reducing the dimensionality of data.
 It increases interpretability yet, at the same time, it minimizes information
loss.
 It helps to find the most significant features in a dataset and makes the data
easy for plotting in 2D and 3D.
 PCA helps in finding a sequence of linear combinations of variables.

28 Prof. Monali Suthar (SOCET-CE)


Principal Components
 The Principal Components are a straight line that captures most of the
variance of the data.
 They have a direction and magnitude. Principal components are orthogonal
projections (perpendicular) of data onto lower-dimensional space.

29 Prof. Monali Suthar (SOCET-CE)


Application of PCA
• PCA is used to visualize multidimensional data.
• It is used to reduce the number of dimensions in healthcare data.
• PCA can help resize an image.
• It can be used in finance to analyze stock data and forecast returns.
• PCA helps to find patterns in the high-dimensional datasets.

30 Prof. Monali Suthar (SOCET-CE)


Uses of PCA
• To reduce the number of dimensions in the dataset.
• To find patterns in the high-dimensional dataset
• To visualize the data of high dimensionality
• To ignore noise
• To improve classification
• To gets a compact description
• To captures as much of the original variance in the
data as possible

31 Prof. Monali Suthar (SOCET-CE)


objective of PCA
• Find an orthonormal basis for the data.
• Sort dimensions in the order of importance.
• Discard the low significance dimensions.
• Focus on uncorrelated and Gaussian components.

32 Prof. Monali Suthar (SOCET-CE)


Working of PCA
1. Normalize the data
Standardize the data before performing PCA. This will ensure that each feature has a mean = 0 and variance = 1.

2. Build the covariance matrix


Construct a square matrix to express the correlation between two or more features in a multidimensional dataset.

33 Prof. Monali Suthar (SOCET-CE)


Working of PCA
3. Find the Eigenvectors and Eigenvalues
Calculate the eigenvectors/unit vectors and eigenvalues. Eigenvalues are
scalars by which we multiply the eigenvector of the covariance matrix.

4. Sort the eigenvectors in highest to lowest order and select the number
of principal components.

34 Prof. Monali Suthar (SOCET-CE)


PCA in python
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)

print(pca.explained_variance_ratio_)

print(pca.singular_values_)

35 Prof. Monali Suthar (SOCET-CE)


PCA in python
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)

print(pca.explained_variance_ratio_)

print(pca.singular_values_)

36 Prof. Monali Suthar (SOCET-CE)


LDA : Linear Discriminant Analysis
 LDA is most commonly used as dimensionality reduction
technique.
 It is very similar to PCA
 If PCA involves finding the components axes that
maximize the variance of the entire data, LDA involves
finding the axes that maximize the separation between
multiple classes.
 LDA projects a features space of size n onto a smaller
subspace k [where k  (n-)] while maintaining the class-
discriminatory information.
 Avoid overfitting
 Reduce computational cost.

37 Prof. Monali Suthar (SOCET-CE)


ICA : Independent Component Analysis
 ICA is the method for finding underlying factors or
components from multivariate statistical data.
 It looks for components that are both statistically
independent and non-Gaussian.
 Statistically independent and non-Gaussian components
can be separated using a blind source separation method.
Nonlinear decorre-lation and maximum non-Gaussianity
are the methods used in ICA.
 Methods for Estimating ICA
1. Nonlinear decorrelation:
2. Maximum non-Gaussianity:

38 Prof. Monali Suthar (SOCET-CE)


ICA : Independent Component Analysis
 Methods for Estimating ICA
1. Nonlinear decorrelation:
 Nonlinear decorrelation method involves fnding the matrix W so that
for any i ≠ j, the components yi and yj are completely uncorrelated.
 The transformed components g(yi) and h(yj) are also uncorrelated,
where g and h are some suitable nonlinear functions.
 In this method of estimating ICA, if the nonlinearities are properly
chosen, the method does find the independent components.
 The main problem in this method is to address how the
nonlinearities g and h are chosen. One of the approaches to select
the nonlinear functions is to use maximum likelihood method in
information theory.

39 Prof. Monali Suthar (SOCET-CE)


ICA : Independent Component Analysis
 Methods for Estimating ICA
2. Maximum non-Gaussianity:
 The maximum non-Gaussianity approach for estimating independent
component involves finding the local maxima of non-Gaussianity of a
linear combination, y = Σbixi, under the constraint that the variance of
y is constant.
 Each local maximum corresponds to one independent component. In
practice, kurtosis is used to measure non-Gaussianity. Kurtosis is a
higher order cumulate method, which involves some ways of
generalizations of variance using higher order polynomials.
 Cumulants are used for ICA as they have important algebraic and
statistical properties.

40 Prof. Monali Suthar (SOCET-CE)


High Dimensional data
 High Dimensional refers to the high number of variables or
attributes or features present in certain data sets, more so in
the domains.
 Ex: DNA analysis, geographic information system (GIS), etc.
 A model built on an extremely high number of features may be
very difficult to understand.
 So, Start with Feature selection
 Benefits :
1. Having a faster and more cost-effective (less need for
computational resources) learning model
2. Having a better understanding of the underlying model that
generates the data.
3. Improving the efficacy of the learning model.

41 Prof. Monali Suthar (SOCET-CE)


Feature subset selection
 Feature Selection is the most critical pre-
processing activity in any machine learning process.
 It intends to select a subset of attributes or
features that makes the most meaningful
contribution to a machine learning activity. In order
to understand it, let us consider a small example
i.e.
 Predict the weight of students based on the
past information about similar students,
which is captured inside a ‗Student Weight‘ data set.
 The data set has 04 features like Roll Number,
Age, Height & Weight.
 Roll Number has no effect on the weight of the
students, so we eliminate this feature. Reduced Data Set
 So now the new data set will be having only 03
features.
 This subset of the data set is expected to give
better results than the full set.

42 Prof. Monali Suthar (SOCET-CE)


Feature Subset Selection
 The Goal of Feature Subset Selection is to find the optimal
feature subset.
 Feature Subset Selection Methods can be classified into three
broad categories:
 Filter Methods
 Wrapper Methods
 Embedded Methods
 Requirements:
 A measure for assessing the goodness of a feature subset
(scoring function)
 A strategy to search the space of possible feature subsets
 Finding a minimal optimal feature set for an arbitrary
target concept is hard. It would need Good Heuristics.

43 Prof. Monali Suthar (SOCET-CE)


Filter Methods
 In this method, select subsets of variables as a pre-processing
step, independently of the used classifier
 It would be worthwhile to note that Variable Ranking-
Feature Selection is a Filter Method

 Key features of Filter Methods


 Filter Methods are usually fast
 Filter Methods provide generic selection of features, not tuned by given
learner (universal)
 Filter Methods are also often criticized (feature set not optimized for
used classifier)
 Filter Methods are sometimes used as a pre-processing step for other
44 methods. Prof. Monali Suthar (SOCET-CE)
Wrapper Methods
 In Wrapper Methods, the
Learner is considered a
black-box. Interface of the
black-box is used to score
subsets of variables
according to the predictive
power of the learner when
using the subsets.
 Results vary for different
learners
 One needs to define: – how
to search the space of all
possible variable subsets ?–
how to assess the prediction
performance of a learner ?

45 Prof. Monali Suthar (SOCET-CE)


Embedded Methods
 Embedded Methods are specific to a given learning
machine
 Performs variable selection (implicitly) in the process of
training
 E.g. WINNOW-algorithm (linear unit with multiplicative
updates).

46 Prof. Monali Suthar (SOCET-CE)


47 Prof. Monali Suthar (SOCET-CE)
 https://medium.com/ml-research-lab/chapter-2-data-and-
its-different-types-3dfebcbb4dbe
 https://blog.statsbot.co/data-structures-related-to-
machine-learning-algorithms-
5edf77c8bbf4#:~:text=Array,mathematical%20tool%20at%
20your%20disposal.
 https://www.upgrad.com/blog/types-of-data/
 https://www.spirion.com/data-remediation/

48 Prof. Monali Suthar (SOCET-CE)


49 Prof. Monali Suthar (SOCET-CE)
Silver Oal College Of Engineering And Technology

Unit 3 :
Modeling and Evaluation:

1 Prof. Monali Suthar (SOCET-CE)


Outline
 Selecting a Model: Predictive/Descriptive,
 Training a Model for supervised learning,
 Model representation and interpretability,
 Evaluating performance of a model,
 Improving performance of a model.

2 Prof. Monali Suthar (SOCET-CE)


Model Selection
 Model selection is the process of selecting one final
machine learning model from among a collection of
candidate machine learning models for a training dataset.
 Model selection is a process that can be applied both
across different types of models (e.g. logistic regression,
SVM, KNN, etc.)
 Model selection is the process of choosing one of the
models as the final model that addresses the problem.
 Model selection is different from model assessment.

What is good enough model ?

3 Prof. Monali Suthar (SOCET-CE)


Model Selection
 A “good enough” model may refer to many things and is
specific to your project, such as:
1. A model that meets the requirements and constraints of
project stakeholders.
2. A model that is sufficiently skillful given the time and
resources available.
3. A model that is skillful as compared to naive models.
4. A model that is skillful relative to other tested models.
5. A model that is skillful relative to the state-of-the-art.

4 Prof. Monali Suthar (SOCET-CE)


Model Selection Techniques
 The best approach to model selection requires “sufficient”
data, which may be nearly infinite depending on the
complexity of the problem.
 There are two main classes of techniques to approximate
the ideal case of model selection; they are:
1. Probabilistic Measures: Choose a model via in-sample
error and complexity.
2. Resampling Methods: Choose a model via estimated out-
of-sample error.

5 Prof. Monali Suthar (SOCET-CE)


Model Selection Techniques
 Probabilistic Measures
 Probabilistic measures involve analytically scoring a candidate model using both its
performance on the training dataset and the complexity of the model.
 It is known that training error is optimistically biased, and therefore is not a good
basis for choosing a model.
 The performance can be penalized based on how optimistic the training error is
believed to be.
 This is typically achieved using algorithm-specific methods, often linear, that penalize
the score based on the complexity of the model.
 A model with fewer parameters is less complex, and because of this, is preferred
because it is likely to generalize better on average.
 Four commonly used probabilistic model selection measures include:
 Akaike Information Criterion (AIC).
 Bayesian Information Criterion (BIC).
 Minimum Description Length (MDL).
 Structural Risk Minimization (SRM).
 Probabilistic measures are appropriate when using simpler linear models like linear
regression or logistic regression where the calculating of model complexity penalty
(e.g. in sample bias) is known and tractable.

6 Prof. Monali Suthar (SOCET-CE)


Model Selection Techniques
 Resampling Methods
 Resampling methods seek to estimate the performance of a model (or
more precisely, the model development process) on out-of-sample data.
 This is achieved by splitting the training dataset into sub train and test sets,
fitting a model on the sub train set, and evaluating it on the test set.
 This process may then be repeated multiple times and the mean
performance across each trial is reported.
 It is a type of Monte Carlo estimate of model performance on out-of-
sample data, although each trial is not strictly independent as depending on
the resampling method chosen, the same data may appear multiple times in
different training datasets, or test datasets.
 Three common resampling model selection methods include:
 Random train/test splits.
 Cross-Validation (k-fold, LOOCV, etc.).
 Bootstrap.
 Most of the time probabilistic measures (described in the previous section)
are not available, therefore resampling methods are used.

7 Prof. Monali Suthar (SOCET-CE)


Types of model

8 Prof. Monali Suthar (SOCET-CE)


Types of model
 Predictive Analytics
 Predictive Analytics is to say something about future results not of
current behavior.
 Predictive Analytics will help an organization to know what might
happen next, it predicts future based on present data available.
 It will analyze the data and provide statements that have not
happened yet. It makes all kinds of predictions that you want to
know and all predictions are probabilistic in nature.
 It uses the supervised learning functions which are used to predict
the target value.
 The methods come under this type of mining category are called
classification, time-series analysis and regression.
 Modeling of data is the necessity of the predictive analysis, and it
works by utilizing a few variables of the present to predict the
future not known data values for other variables.

9 Prof. Monali Suthar (SOCET-CE)


Predictive Model
 There are different models developed for design-specific
functions.
1. Forecast models
2. Classification models
3. Outliers Detection Models
4. Time series model
5. Clustering Model
6. Neural Network algorithms
7. Decision Trees Algorithms

10 Prof. Monali Suthar (SOCET-CE)


PREDICTIVE MODELLING PROCESS
 The process involves running algorithms on the data set in which the
prediction is going to take place.
 The process involves training the model, multiple models being used on the
same data set and finally arriving on the model which is the best fit based
on the business data understanding.
 The predictive models’ category includes predictive, descriptive, and
decision models.
 The predictive modelling process goes as follows:
1. Pre-processing.
2. Data mining.
3. Results validation.
4. Understand business & data.
5. Prepare data.
6. Model data.
7. Evaluation.
8. Deployment.
9. Monitor & improve.

11 Prof. Monali Suthar (SOCET-CE)


PREDICTIVE MODELLING
 FEATURES :
1. Data analysis & manipulation: Create new data sets, tools for data
analysis, categorize, club, merge and filter data sets.
2. Visualization: This includes interactive graphics and reports.
3. Statistics: To confirm and create relationships between variables in
the data.
4. Hypothesis testing: Creating models, evaluating and choosing the
right models.
 Limitations
1. Errors in data labeling
2. Shortage of massive data sets needed to train machine learning
3. The machine’s inability to explain what and why it did what it did
4. Generalizability of learning, or rather lack thereof
5. Bias in data and algorithms

12 Prof. Monali Suthar (SOCET-CE)


Types of model
 Descriptive Analytics
 Descriptive Analytics will help an organization to know what has
happened in the past, it would give you the past analytics using the
data that are stored.
 For a company, it is necessary to know the past events that help
them to make decisions based on the statistics using historical data.
 This term is basically used to produce correlation, cross-tabulation,
frequency etc.
 These technologies are used to determine the similarities in the
data and to find existing patterns.
 One more application of descriptive analysis is to develop the
captivating subgroups in the major part of the data available.
 This analytics emphasis on the summarization and transformation of
the data into meaningful information for reporting and monitoring.
 For example, you might want to know how much money you lost
due to fraud and many more.
13 Prof. Monali Suthar (SOCET-CE)
Descriptive Model
 The descriptive analysis uses mainly unsupervised learning
approaches for summarizing, classifying, extracting rules to answer
what happens was happened in the past.
 The descriptive models are different in nature from predictive
models since they don’t need to perform as accurately as the
predictive models need to.
 Since predictions are for a potential future event and business wants
to exploit that knowledge and take actions on the predictions, the
reliability of the prediction matters a lot.
 It describes data in clusters or association rules so it doesn’t need
to be accurate, just approximate.
 Ex: Association rules or market-basket analysis,Clustering, Feature
extraction

14 Prof. Monali Suthar (SOCET-CE)


Descriptive Vs Predictive
Comparison Descriptive Predictive

It determines, what happened in the past It determines, what can happen in the
Basic
by analyzing stored data. future with the help past data analysis.

It produces results does not ensure


Preciseness It provides accurate data.
accuracy.

Standard reporting, query/drill down and Predictive modelling, forecasting,


Practical analysis methods
ad-hoc reporting. simulation and alerts.

It requires data aggregation and data It requires statistics and forecasting


Require
mining methods

Type of approach Reactive approach Proactive approach

Carry out the induction over the current


Describes the characteristics of the data
Describe and past data so that predictions can be
in a target data set.
made.

•what will happen next?


•what happened?
•what is the outcome if these trends
Methods(in general) •where exactly is the problem?
continue?
•what is the frequency of the problem?
•what actions are required to be taken?

15 Prof. Monali Suthar (SOCET-CE)


Train a Supervised Machine Learning Model
 Steps Involved in Supervised Learning:
1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and
validation dataset.
4. Determine the input features of the training dataset, which should
have enough knowledge so that the model can accurately predict the
output.
5. Determine the suitable algorithm for the model, such as support
vector machine, decision tree, etc.
6. Execute the algorithm on the training dataset. Sometimes we need
validation sets as the control parameters, which are the subset of
training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the
model predicts the correct output, which means our model is
accurate.

16 Prof. Monali Suthar (SOCET-CE)


Supervised Machine Learning
 Regression
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
 Classification
 Random Forest
 Decision Trees
 Logistic Regression
 Support vector Machines

17 Prof. Monali Suthar (SOCET-CE)


18
Predictive model
Predictive modeling is the process of taking known
results and developing a model that can predict values
for new occurrences.
It uses historical data to predict future events.
There are many different types of predictive modeling
techniques including linear regression (ordinary least
squares), logistic regression, ridge regression, time
series, decision trees, neural network.

19
Application of Predictive method

20
Process of Predictive model
Step 1:Data collection and purification: Data is
accumulated from all the sources to extract the
required information by cleaning data with some
operations that eliminate loud data to get accurate
estimations. Various sources are included Transaction
and customer assistance data, survey and economic
data.

21
Process of Predictive model
Step 2: Data transformation: Data need to be
transformed through accurate processing to get
normalized data. The values are scaled in a provided
range of normalized data, extraneous elements get
removed by correlation analysis to conclude the final
decision.

22
Process of Predictive model
Step 3: Formulation of the predictive model: Any
predictive model often employs regression techniques
to design a predictive model by using the classification
algorithm. During this process, test data is recognized,
classification decisions get implemented on test data
to determine the performance of the model.

23
Process of Predictive model
Step 4: Performance analysis or conclusion: At
last, inferences are drawn from the model, for this,
cluster analysis is performed. After building the model
analysis is important for the maintaining.

24
Steps in building regression model
STEP 1: Collect/Extract Data
The first step in building a regression model is to collect or extract data on the dependent
(outcome) vari-able and independent (feature) variables from different data sources. Data
collection in many cases can be time-consuming and expensive, even when the organization has
well-designed enterprise resource planning (ERP) system.
STEP 2: Pre-Process the Data
Before the model is built, it is essential to ensure the quality of the data for issues such as
reliability, completeness, usefulness, accuracy, missing data, and outliers.
1. Data imputation techniques may be used to deal with missing data. Use of descriptive statistics
and visualization (such as box plot and scatter plot) may be used to identify the existence of
outliers and variability in the dataset.

25
Steps in building regression model
2. Many new variables (such as the ratio of variables or product of variables) can be derived (aka
feature engineering) and also used in model building.
3. Categorical data has must be pre-processed using dummy variables (part of feature engineering)
before it is used in the regression model.

STEP 3: Dividing Data into Training and Validation Datasets


In this stage the data is divided into two subsets (sometimes more than two subsets): training
dataset and validation or test dataset. The proportion of training dataset is usually between 70%
and 80% of the data and the remaining data is treated as the validation data.

STEP 4: Perform Descriptive Analytics or Data Exploration


It is always a good practice to perform descriptive analytics before moving to building a
predictive analytics model. Descriptive statistics will help us to understand the variability in the
model and visualization of the data through, say, a box plot which will show if there are any
outliers in the data

26
Steps in building regression model
STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. The method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.

STEP 6:
Perform Model Diagnostics Regression is often misused since many times the modeler fails to
perform necessary diagnostics tests before applying the model. Before it can be applied, it is
necessary that the model created is validatedfor all model assumptions including the definition of
the function form. If the model assumptions are violated, then the modeler must use remedial
measure.

STEP 7: Validate the Model and Measure Model Accuracy


A major concern in analytics is over-fitting, that is, the model may perform very well on the
training dataset, but may perform badly in validation dataset. It is important to ensure that the
model performance is consistent on the validation dataset as is in the training dataset. In fact, the
model may be cross-validated using multiple training and test datasets.

27
Steps in building regression model
STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. Te method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.

STEP 6:
Perform Model Diagnostics Regression is often misused since many times the modeler fails to
perform necessary diagnostics tests before applying the model. Before it can be applied, it is
necessary that the model created is validated for all model assumptions including the definition of
the function form. If the model assumptions are violated, then the modeler must use remedial
measure.

STEP 7: Validate the Model and Measure Model Accuracy


A major concern in analytics is over-fitting, that is, the model may perform very well on the
training dataset, but may perform badly in validation dataset. It is important to ensure that the
model performance is consistent on the validation dataset as is in the training dataset. In fact, the
model may be cross-validated using multiple training and test datasets.

28
linear regression model
 Linear regression is a quiet and simple statistical regression method used for
predictive analysis and shows the relationship between the continuous variables.
 Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), consequently called linear regression.
 If there is a single input variable (x), such linear regression is called simple linear
regression. And if there is more than one input variable, such linear regression is
called multiple linear regression.
 The linear regression model gives a sloped straight line describing the relationship
within the variables.

29
Cost function
A cost function, also called a loss function, is used to define and measure the error of a model. The
differences between the prices predicted by the model and the observed prices of the pizzas in the
training set are called residuals or training errors.
Cost function optimizes the regression coefficients or weights and measures how a linear
regression model is performing. The cost function is used to find the accuracy of the mapping
function that maps the input variable to the output variable. This mapping function is also known
as the Hypothesis function.

in Linear Regression, Mean Squared Error (MSE) cost function is used, which is the average of
squared error that occurred between the predicted values and actual values.

By simple linear equation y=mx+b we can calculate MSE as:


Let’s y = actual values, yi = predicted values

30
EXAMPLE:
 Let's assume that you have recorded the diameters and prices of pizzas that
you have previously eaten in your pizza journal. These observations
comprise our training data

import matplotlib.pyplot as plt


X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
plt.figure()
plt.title('Pizza price plotted against diameter')
plt.xlabel('Diameter in inches')
31
EXAMPLE:
plt.ylabel('Price in dollars')
plt.plot(X, y, 'k.')
plt.axis([0, 25, 0, 25])
plt.grid(True)
plt.show()

32
EXAMPLE:
from sklearn.linear_model import LinearRegression
# Training data
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
print 'A 12" pizza should cost: $%.2f' % model.predict([12])[0]
A 12" pizza should cost: $13.68

33
EVALUATING THE FITNESS OF MODEL
sum of squares is calculated with the formula in the can produce the
best pizza-price predictor by minimizing the sum of the residuals. That is,
our model fits if the values it predicts for the response variable are close
to the observed values for all of the training examples. This measure of
the model's fitness is called the residual sum of squares cost function.
Formally, this function assesses the fitness of a model by summing the
squared residuals for all of our training examples. The residual lfollowing
equation,

34
EVALUATING THE MODEL

how well the observed values of the response variables are predicted by the
model. More concretely, r-squared is the proportion of the variance in the
response variable that is explained by the model. An r-squared score of one
indicates that the response variable can be predicted without any error using
the model.

r-squared is equal to the square of the Pearson product moment correlation


coefficient, or Pearson's r.

35
CALCULATION

36
PYTHON IMPLEMENTATION
from sklearn.linear_model import LinearRegression
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
X_test = [[8], [9], [11], [16], [12]]
y_test = [[11], [8.5], [15], [18], [11]]
model = LinearRegression()
model.fit(X, y)
print 'R-squared: %.4f' % model.score(X_test, y_test)

An r-squared score of 0.6620 indicates that a large proportion of the


variance in the test instances'

37
 https://medium.com/ml-research-lab/chapter-2-data-and-
its-different-types-3dfebcbb4dbe
 https://blog.statsbot.co/data-structures-related-to-
machine-learning-algorithms-
5edf77c8bbf4#:~:text=Array,mathematical%20tool%20at%
20your%20disposal.
 https://www.upgrad.com/blog/types-of-data/
 https://www.spirion.com/data-remediation/

38
 https://seleritysas.com/blog/2019/12/12/types-of-
predictive-analytics-models-and-how-they-work/
 https://towardsdatascience.com/selecting-the-correct-
predictive-modeling-technique-ba459c370d59
 https://www.netsuite.com/portal/resource/articles/financia
l-management/predictive-modeling.shtml
 https://www.dezyre.com/article/types-of-analytics-
descriptive-predictive-prescriptive-analytics/209#toc-2
 https://www.sciencedirect.com/topics/computer-
science/descriptive-model

39 Prof. Monali Suthar (SOCET-CE)


40 Prof. Monali Suthar (SOCET-CE)
Silver Oal College Of Engineering And Technology

Unit 4 :
Basics of Feature Engineering:

1
Outline
 Feature and Feature Engineering,
 Feature transformation:
 Construction
 Extraction,
 Feature subset selection :
 Issues in high-dimensional data,
 key drivers,
 measure
 overall process

2
Feature and Feature Engineering
 Input in machine learning which are usually in the form of
structured columns.
 Algorithms require features with some specific
characteristic to work properly.
 Feature Engineering?
 Feature engineering is the process of transforming raw data into
features that better represent the underlying problem to the
predictive models, resulting in improved model accuracy on unseen
data.
 Goals of Feature Engineering
1. Preparing the proper input dataset, compatible with the
machine learning algorithm requirements.
2. Improving the performance of machine learning models.
3 Prof. Monali Suthar (SOCET-CE)
Feature Engineering Category
 Feature Engineering is divided into 3 broad categories:-
1. Feature Selection:
 It is all about selecting a small subset of features from a large pool of
features.
 We select those attributes which best explain the relationship of an
independent variable with the target variable.
 There are certain features which are more important than other
features to the accuracy of the model.
 It is different from dimensionality reduction because the
dimensionality reduction method does so by combining existing
attributes, whereas the feature selection method includes or excludes
those features.
 Ex: Chi-squared test, correlation coefficient scores, LASSO, Ridge
regression etc.

4
Feature Engineering Category
II. Feature Transformation:
 It means transforming our original feature to the functions of
original features.
 Ex: Scaling, discretization, binning and filling missing data values are
the most common forms of data transformation.
 To reduce right skewness of the data, we use log.
III. Feature Extraction:
 When the data to be processed through an algorithm is too large,
it’s generally considered redundant.
 Analysis with a large number of variables uses a lot of computation
power and memory, therefore we should reduce the dimensionality
of these types of variables.
 It is a term for constructing combinations of the variables.
 For tabular data, we use PCA to reduce features.
 For image, we can use line or edge detection.

5
Feature transformation
 Feature transformation is the process of modifying
your data but keeping the information.
 These modifications will make Machine Learning
algorithms understanding easier, which will deliver better
results.
 But why would we transform our features?
 data types are not suitable to be fed into a machine learning
algorithm, e.g. text, categories
 feature values may cause problems during the learning process,
e.g. data represented in different scales
 we want to reduce the number of features to plot and visualize
data, speed up training or improve the accuracy of a specific
model

6
Feature Engineering Techniques
 List of Techniques
1.Imputation
2.Handling Outliers
3.Binning
4.Log Transform
5.One-Hot Encoding
6.Grouping Operations
7.Feature Split
8.Scaling
9.Extracting Date

7
Imputation Using (Mean/Median) Values
 This works by calculating the mean/median of the
non-missing values in a column and then replacing
the missing values within each column separately
and independently from the others. It can only be
used with numeric data.

8
Pros and Cons
 Pros:
• Easy and fast.
• Works well with small numerical datasets.
 Cons:
• Doesn‟t factor the correlations between features. It
only works on the column level.
• Will give poor results on encoded categorical
features (do NOT use it on categorical features).
• Not very accurate.
• Doesn‟t account for the uncertainty in the
imputations.

9
Pros and Cons
 Pros:
• Easy and fast.
• Works well with small numerical datasets.
 Cons:
• Doesn‟t factor the correlations between features. It
only works on the column level.
• Will give poor results on encoded categorical
features (do NOT use it on categorical features).
• Not very accurate.
• Doesn‟t account for the uncertainty in the
imputations.

10
Imputation Using (Most Frequent) or
(Zero/Constant) Values:
 Most Frequent is another statistical strategy to
impute missing values and YES!! It works with
categorical features (strings or numerical
representations) by replacing missing data with the
most frequent values within each column.
 Pros:
• Works well with categorical features.
 Cons:
• It also doesn‟t factor the correlations between
features.
• It can introduce bias in the data.
11
Imputation Using (Most Frequent) or
(Zero/Constant) Values:

12
Imputation Using k-NN
 The k nearest neighbors is an algorithm that is used
for simple classification. The algorithm uses „feature
similarity‟ to predict the values of any new data
points.

 This means that the new point is assigned a value


based on how closely it resembles the points in the
training set. This can be very useful in making
predictions about the missing values by finding
the k’s closest neighbor's to the observation with
missing data and then imputing them based on the
non-missing values in the neighborhood.

13
Pros and Cons
 Pros:
• Can be much more accurate than the mean, median
or most frequent imputation methods (It depends on
the dataset).
 Cons:
• Computationally expensive. KNN works by storing
the whole training dataset in memory.
• K-NN is quite sensitive to outliers in the data (unlike
SVM)

14
Handling outlier
• Incorrect data entry or error during data processing
• Missing values in a dataset.
• Data did not come from the intended sample.
• Errors occur during experiments.
• Not an errored, it would be unusual from the original.
• Extreme distribution than normal.

15
Handling outlier
Univariate method:
 Univariate analysis is the simplest form of analyzing data.
“Uni” means “one”, so in other words your data has only one
variable.
 It doesn‟t deal with causes or relationships (unlike regression )
and it‟s major purpose is to describe; It takes data,
summarizes that data and finds patterns in the data.

 Univariate and multivariate represent two approaches to


statistical analysis.
 Univariate involves the analysis of a single variable
while multivariate analysis examines two or more variables.
 Most multivariate analysis involves a dependent variable and
multiple independent variables.

16
Handling outlier with Z score
 The Z-score is the signed number of standard deviations by which
the value of an observation or data point is above the mean value of
what is being observed or measured.
 Z score is an important concept in statistics. Z score is also called
standard score. This score helps to understand if a data value is
greater or smaller than mean and how far away it is from the mean.
More specifically, Z score tells how many standard deviations away a
data point is from the mean.

 The intuition behind Z-score is to describe any data point by finding


their relationship with the Standard Deviation and Mean of the
group of data points. Z-score is finding the distribution of data
where mean is 0 and standard deviation is 1 i.e. normal distribution.
 Z score = (x -mean) / std. deviation
 If the z score of a data point is more than 3, it indicates that the data
point is quite different from the other data points. Such a data point
can be an outlier.
17
Binning
 Data binning, bucketing is a data pre-processing method
used to minimize the effects of small observation errors.
 The original data values are divided into small intervals
known as bins and then they are replaced by a general
value calculated for that bin.
 This has a smoothing effect on the input data and may
also reduce the chances of overfitting in case of small
datasets.

18
Log Transform
 The Log Transform is one of the most popular
Transformation techniques out there.
 It is primarily used to convert a skewed distribution to a
normal distribution/less-skewed distribution.
 In this transform, we take the log of the values in a
column and use these values as the column instead.

19
Standard Scaler
 The Standard Scaler is another popular scaler that is very
easy to understand and implement.
 For each feature, the Standard Scaler scales the values
such that the mean is 0 and the standard deviation is 1(or
the variance).
x_scaled = x – mean/std_dev

 However, Standard Scaler assumes that the distribution of


the variable is normal. Thus, in case, the variables are not
normally distributed, we either choose a different scaler
or first, convert the variables to a normal distribution and
then apply this scaler

20
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

21
One-Hot Encoding

 A one hot encoding allows the representation of


categorical data to be more expressive.
 Many machine learning algorithms cannot work with
categorical data directly.
 The categories must be converted into numbers.
 This is required for both input and output variables that
are categorical.

22
Feature subset selection
 Feature Selection is the most critical pre-processing
activity in any machine learning process. It intends to
select a subset of attributes or features that makes the
most meaningful contribution to a machine learning
activity.

23
High dimensional data
 High Dimensional refers to the high number of variables or
attributes or features present in certain data sets, more so in the
domains like DNA analysis, geographic information system (GIS),
etc. It may have sometimes hundreds or thousands of dimensions
which is not good from the machine learning aspect because it may
be a big challenge for any ML algorithm to handle that. On the other
hand, a high quantity of computational and a high amount of time
will be required. Also, a model built on an extremely high number of
features may be very difficult to understand. For these reasons, it
is necessary to take a subset of the features instead of the
full set. So we can deduce that the objectives of feature selection
are:
1. Having a faster and more cost-effective (less need for computational
resources) learning model
2. Having a better understanding of the underlying model that generates
the data.
3. Improving the efficacy of the learning model.

24
Feature subset selection methods
1. Wrapper methods
 Wrapping methods compute models with a certain subset of
features and evaluate the importance of each feature.
 Then they iterate and try a different subset of features until the
optimal subset is reached.
 Two drawbacks of this method are the large computation time
for data with many features, and that it tends to overfit the
model when there is not a large amount of data points.
 The most notable wrapper methods of feature selection
are forward selection, backward selection, and stepwise
selection.

25
Feature subset selection methods
1. Wrapper methods
 Forward selection starts with zero features, then, for each
individual feature, runs a model and determines the p-value
associated with the t-test or F-test performed. It then selects
the feature with the lowest p-value and adds that to the
working model.
 Backward selection starts with all features contained in the
dataset. It then runs a model and calculates a p-value
associated with the t-test or F-test of the model for each
feature.
 Stepwise selection is a hybrid of forward and backward
selection. It starts with zero features and adds the one feature
with the lowest significant p-value as described above.
26
Feature subset selection methods
1. Filter methods
 Filter methods use a measure other than error rate to
determine whether that feature is useful.
 Rather than tuning a model (as in wrapper methods), a subset
of the features is selected through ranking them by a useful
descriptive measure.
 Benefits of filter methods are that they have a very low
computation time and will not overfit the data.
 However, one drawback is that they are blind to any
interactions or correlations between features.
 This will need to be taken into account separately, which will
be explained below. Three different filter methods
are ANOVA, Pearson correlation, and variance
thresholding.

27
Feature subset selection methods
2. Filter methods
 The ANOVA (Analysis of variance) test looks a the variation
within the treatments of a feature and also between the
treatments.
 The Pearson correlation coefficient is a measure of the
similarity of two features that ranges between -1 and 1. A value
close to 1 or -1 indicates that the two features have a high
correlation and may be related.
 The variance of a feature determines how much predictive
power it contains. The lower the variance is, the less
information contained in the feature, and the less value it has in
predicting the response variable.

28
Feature subset selection methods
3. Embedded Methods
 Embedded methods perform feature selection as a part of the
model creation process.
 This generally leads to a happy medium between the two
methods of feature selection previously explained, as the
selection is done in conjunction with the model tuning
process.
 Lasso and Ridge regression are the two most common
feature selection methods of this type, and Decision tree also
creates a model using different types of feature selection.

29
Feature subset selection methods
3. Embedded Methods
 Lasso Regression is another way to penalize the beta coefficients in a
model, and is very similar to Ridge regression. It also adds a penalty term
to the cost function of a model, with a lambda value that must be tuned.
 The smaller number of features a model has, the lower the complexity.
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train,y_train)
train_score=lasso.score(X_train,y_train)
test_score=lasso.score(X_test,y_test)
coeff_used = np.sum(lasso.coef_!=0)
 An important note for Ridge and Lasso regression is that all of your features must
be standardized

30
Feature subset selection methods
3. Embedded Methods
 Ridge regression can do this by penalizing the beta coefficients of a model
for being too large. Basically, it scales back the strength of correlation with
variables that may not be as important as others. Ride Regression is done
by adding a penalty term (also called ridge estimator or shrinkage estimator)
to the cost function of the regression. The penalty term takes all of the betas
and scales them by a term lambda (λ) that must be tuned (usually with cross
validation: compares the same model but with different values of lambda).
from sklearn.linear_model import Ridge
rr = Ridge(alpha=0.01)
rr.fit(X_train, y_train)

31
32
 https://seleritysas.com/blog/2019/12/12/types-of-
predictive-analytics-models-and-how-they-work/
 https://towardsdatascience.com/selecting-the-correct-
predictive-modeling-technique-ba459c370d59
 https://www.netsuite.com/portal/resource/articles/financia
l-management/predictive-modeling.shtml
 https://www.dezyre.com/article/types-of-analytics-
descriptive-predictive-prescriptive-analytics/209#toc-2
 https://www.sciencedirect.com/topics/computer-
science/descriptive-model
 https://towardsdatascience.com/intro-to-feature-
selection-methods-for-data-science-4cae2178a00a

33 Prof. Monali Suthar (SOCET-CE)


Silver Oal College Of Engineering And Technology

Unit 5 :
Overview of Probability :

1
Outline

 Statistical tools in Machine Learning,


 Concepts of probability,
 Random variables,
 Discrete distributions,
 Continuous distributions,
 Multiple random variables,
 Central limit theorem,
 Sampling distributions,
 Hypothesis testing,
 Monte Carlo Approximation

2
Concepts of probability
 Probability represents the certainty factor. Certainty is the rate that you
would assign to an event to happen
 Probability is the Bedrock of Machine Learning.
 Algorithms are designed using probability (e.g. Naive Bayes).
 Learning algorithms will make decisions using probability (e.g. information
gain).
 Sub-fields of study are built on probability (e.g. Bayesian networks).
1. Probability of a union of two events:
2. Joint probabilities :

3. Conditional probability :

4. Bayes rule :

3
Probability Theory – Terminology
 Random Experiment – This is an experiment in which the outcome is
not known with certainty.
 Sample Space – This is the universal set that consists of all possible
outcomes of an experiment. It is usually represented using the letter “S”.
Individual outcomes are called elementary events. Sample space can be
finite or infinite.
 Event – It is a subset of a sample space and the probability is usually
calculated with respect to an event. Examples of events include:
 Number of cancellation of orders placed at an E-commerce portal site
exceeding 10%.
 The number of fraudulent credit card transactions exceeding 1%.

4
Random variables
 Random variables play an important role in describing, measuring, and
analyzing uncertain events such as customer churn, employee attrition, and
demand for a product. It is a function that maps every outcome in the
sample space to a real number.
 If random variable X can assume only a finite or countably infinite set of
values, then it is a discrete random variable. E.g., number of orders received
at an e-commerce retailer. These variables are described using probability
mass function (PMF) and cumulative distribution function (CDF).
 Random variable X that can take a value from an infinite set of values is a
continuous random variable. E.g., percentage of attrition of employees.
Continuous random variables are described using probability density
function (PDF) and cumulative distribution function (CDF).
 PDF is the probability that a continuous random variable takes value in a
small neighborhood of “x”:

5
Continuous random variables
 Suppose X is some uncertain continuous quantity. The probability that X
lies in any interval a ≤ X ≤ b can be computed as follows. Define the events
A = (X ≤ a), B = (X ≤ b) and W = (a < X ≤ b). We have that B = A ∨ W, and
since A and W are mutually exclusive, the sum rules gives

 Define the function F(q) p(X ≤ q). This is called the cumulative
distribution function or cdf of X. This is obviously a monotonically
increasing function.

 Example shows as graph Using this notation we have

6
Continuous random variables
𝑑
 Now define f(x) = F(x) (we assume this derivative
𝑑𝑥
exists); this is called the probability density function
or pdf.

7
Binomial Distribution
 Binomial distribution is a discrete probability distribution.
 It has several applications in many business contexts.
 Random variable X is said to follow a binomial distribution when:
1. The random variable can have only two outcomes − success and failure
(also known as Bernoulli trials).
2. The objective is to find the probability of getting x successes out of n
trials.
3. The probability of success is p and thus the probability of failure is (1 −
p).
4. The probability p is constant and does not change between trials.
 The PMF of the binomial distribution (probability that the number of
success will be exactly x out of n trials) is given by

 The CDF of a binomial distribution (probability that the number of success


will be x or less than x out of n trials) is given by

8
Poisson Distribution
 In many situations, we may be interested in calculating the number of
events that may occur over a period of time or space.
 E.g., number of cancellation of orders by customers at an e-commerce
portal, number of customer complaints, number of cash withdrawals at an
ATM, number of typographical errors in a book, number of potholes on
Bangalore roads
 To find the probability of number of events, we use Poisson distribution.
The PMF of a Poisson distribution is given by

9
Exponential Distribution
 Exponential distribution is a single parameter
continuous distribution that is traditionally used for
modeling time-to-failure of electronic components.
 It represents a process in which events occur
continuously and independently at a constant average
rate.
 The probability density function is given by

10
Normal DISTRIBUTION
 Normal distribution is also known as Gaussian distribution or bell curve (as
it is shaped like a bell).
 It is one of the most popular continuous distribution in the field of analytics
especially due to its use in multiple contexts.
 Normal distribution is observed across many naturally occurring measures
such as age, salary, sales volume, birth weight and height.
 Normal distribution is parameterized by two parameters: the mean of the
distribution µ and the variance σ2.

The sample mean of a normal distribution is given by

11
Central Limit Theorem
 It is one of the most important theorems in statistics.
 CLT is key to hypothesis testing, which primarily deals with sampling
distribution.
 Let S1, S2, …, Sk be samples of size n drawn from an independent and
identically distributed population with mean µ and standard deviation σ.
 Let X1, X2, …, Xk be the sample means (of the samples S1, S2, …, Sk).
 According to the CLT, the distribution of X1, X2, …, Xk follows a normal
distribution with mean µ and standard deviation of σ/√n.

12
Hypothesis Test
 Hypothesis testing consists of two complementary statements - null hypothesis and
alternative hypothesis.
Null hypothesis is an existing belief and alternate hypothesis is what we intend to establish
with new evidences (samples).
 Objective of hypothesis testing is to either reject or retain a null hypothesis with the help of
data.
 Hypothesis tests are broadly classified into parametric tests and non-parametric tests.
1. Parametric tests are about population parameters of a distribution such as mean,
proportion, and standard deviation.
2. Non-parametric tests are not about other characteristics such as independence of
events or data following certain distributions such as normal distribution.
Steps for hypothesis tests:
1. Define null and alternative hypotheses. Normally, H0 is used to denote null hypothesis and
HA for alternate hypothesis.
2. Identify the test statistic to be used for testing the validity of the null hypothesis (e.g., Z-test
or t-test).
3. Decide the criteria for rejection and retention of null hypothesis. This is called significance
value (α). Typical value used for α is 0.05.
4. Calculate the p-value, which is the conditional probability of observing the test statistic value
when the null hypothesis is true.
5. Take the decision to reject or retain the null hypothesis based on p-value and α.

13
Analysis Of Variance (Anova)
 One-way ANOVA can be used to study the impact of a single treatment
(also known as factor) at different levels (thus forming different groups) on
a continuous response variable (or outcome variable).
 Then the null and alternative hypotheses for one-way ANOVA for
comparing 3 groups are given by

14
Monte Carlo Approximation
 Monte Carlo methods are a class of techniques for randomly sampling a
probability distribution.
 Often, we cannot calculate a desired quantity in probability, but we can
define the probability distributions for the random variables directly or
indirectly.
 Monte Carlo sampling a class of methods for randomly sampling from a
probability distribution.
 Monte Carlo sampling provides the foundation for many machine learning
methods such as resampling, hyperparameter tuning, and ensemble learning.
 In principle, Monte Carlo methods can be used to solve any problem having
a probabilistic interpretation.
 By the law of large numbers, integrals described by the expected value of
some random variable can be approximated by taking the empirical mean
(a.k.a. the sample mean) of independent samples of the variable.

15
Monte Carlo Approximation
 Need for Sampling
 There are many problems in probability, and more broadly in machine
learning, where we cannot calculate an analytical solution directly.
 In fact, there may be an argument that exact inference may be intractable
for most practical probabilistic models.
 Sampling provides a flexible way to approximate many sums and
integrals at reduced cost.

16
Monte Carlo Methods
 Monte Carlo methods, or MC for short, are a class of
techniques for randomly sampling a probability distribution.
 There are three main reasons to use Monte Carlo methods to
randomly sample a probability distribution; they are:
 Estimate density, gather samples to approximate the distribution of a
target function.
 Approximate a quantity, such as the mean or variance of a
distribution.
 Optimize a function, locate a sample that maximizes or minimizes the
target function.

17
Monte Carlo Methods
 Monte Carlo methods are defined in terms of the way that samples are drawn
or the constraints imposed on the sampling process.
 Some examples of Monte Carlo sampling methods include: direct
sampling, importance sampling, and rejection sampling.
 Direct Sampling. Sampling the distribution directly without prior
information.
 Importance Sampling. Sampling from a simpler approximation of the
target distribution.
 Rejection Sampling. Sampling from a broader distribution and only
considering samples within a region of the sampled distribution.
 It’s a huge topic with many books dedicated to it. Next, let’s make the idea of
Monte Carlo sampling concrete with some familiar examples.
 For example, Monte Carlo methods can be used for:
1. Calculating the probability of a move by an opponent in a complex game.
2. Calculating the probability of a weather event in the future.
3. Calculating the probability of a vehicle crash under specific conditions.

18
Monte Carlo Methods

19
20
 https://machinelearningmastery.com/monte-carlo-sampling-for-
probability/
 https://seleritysas.com/blog/2019/12/12/types-of-predictive-
analytics-models-and-how-they-work/
 https://towardsdatascience.com/selecting-the-correct-
predictive-modeling-technique-ba459c370d59
 https://www.netsuite.com/portal/resource/articles/financial-
management/predictive-modeling.shtml
 https://www.dezyre.com/article/types-of-analytics-descriptive-
predictive-prescriptive-analytics/209#toc-2
 https://www.sciencedirect.com/topics/computer-
science/descriptive-model
 https://towardsdatascience.com/intro-to-feature-selection-
methods-for-data-science-4cae2178a00a

21 Prof. Monali Suthar (SOCET-CE)

You might also like