Machine Learning
Machine Learning
Unit 1 :
Introduction to Machine Learning
Learning System
Regression Trees
Non-Linear Regression
Bayesian Linear Regression
Polynomial Regression
Disadvantages :
Input Data Uses Known and Labeled Data Uses Unknown Data as input
as input
Computational
Complexity Simpler method computationally complex
Learning System
Features:
Helps in training and building your models.
You can run your existing models with the help of TensorFlow.js which
is a model converter.
It helps in the neural network.
Pros:
You can use it in two ways, i.e. by script tags or by installing through
NPM.
It can even help for human pose estimation.
Unit 2 :
Preparing to Model
Problem or Opportunity
Identification
Feature Extraction
Data Preprocessing
Model Building
More input features often make a predictive modeling task more challenging
to model, more generally referred to as the curse of dimensionality.
4. Sort the eigenvectors in highest to lowest order and select the number
of principal components.
print(pca.explained_variance_ratio_)
print(pca.singular_values_)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)
Unit 3 :
Modeling and Evaluation:
It determines, what happened in the past It determines, what can happen in the
Basic
by analyzing stored data. future with the help past data analysis.
19
Application of Predictive method
20
Process of Predictive model
Step 1:Data collection and purification: Data is
accumulated from all the sources to extract the
required information by cleaning data with some
operations that eliminate loud data to get accurate
estimations. Various sources are included Transaction
and customer assistance data, survey and economic
data.
21
Process of Predictive model
Step 2: Data transformation: Data need to be
transformed through accurate processing to get
normalized data. The values are scaled in a provided
range of normalized data, extraneous elements get
removed by correlation analysis to conclude the final
decision.
22
Process of Predictive model
Step 3: Formulation of the predictive model: Any
predictive model often employs regression techniques
to design a predictive model by using the classification
algorithm. During this process, test data is recognized,
classification decisions get implemented on test data
to determine the performance of the model.
23
Process of Predictive model
Step 4: Performance analysis or conclusion: At
last, inferences are drawn from the model, for this,
cluster analysis is performed. After building the model
analysis is important for the maintaining.
24
Steps in building regression model
STEP 1: Collect/Extract Data
The first step in building a regression model is to collect or extract data on the dependent
(outcome) vari-able and independent (feature) variables from different data sources. Data
collection in many cases can be time-consuming and expensive, even when the organization has
well-designed enterprise resource planning (ERP) system.
STEP 2: Pre-Process the Data
Before the model is built, it is essential to ensure the quality of the data for issues such as
reliability, completeness, usefulness, accuracy, missing data, and outliers.
1. Data imputation techniques may be used to deal with missing data. Use of descriptive statistics
and visualization (such as box plot and scatter plot) may be used to identify the existence of
outliers and variability in the dataset.
25
Steps in building regression model
2. Many new variables (such as the ratio of variables or product of variables) can be derived (aka
feature engineering) and also used in model building.
3. Categorical data has must be pre-processed using dummy variables (part of feature engineering)
before it is used in the regression model.
26
Steps in building regression model
STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. The method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.
STEP 6:
Perform Model Diagnostics Regression is often misused since many times the modeler fails to
perform necessary diagnostics tests before applying the model. Before it can be applied, it is
necessary that the model created is validatedfor all model assumptions including the definition of
the function form. If the model assumptions are violated, then the modeler must use remedial
measure.
27
Steps in building regression model
STEP 5: Build the Model
The model is built using the training dataset to estimate the regression parameters. Te method of
Ordinary Least Squares (OLS) is used to estimate the regression parameters.
STEP 6:
Perform Model Diagnostics Regression is often misused since many times the modeler fails to
perform necessary diagnostics tests before applying the model. Before it can be applied, it is
necessary that the model created is validated for all model assumptions including the definition of
the function form. If the model assumptions are violated, then the modeler must use remedial
measure.
28
linear regression model
Linear regression is a quiet and simple statistical regression method used for
predictive analysis and shows the relationship between the continuous variables.
Linear regression shows the linear relationship between the independent variable
(X-axis) and the dependent variable (Y-axis), consequently called linear regression.
If there is a single input variable (x), such linear regression is called simple linear
regression. And if there is more than one input variable, such linear regression is
called multiple linear regression.
The linear regression model gives a sloped straight line describing the relationship
within the variables.
29
Cost function
A cost function, also called a loss function, is used to define and measure the error of a model. The
differences between the prices predicted by the model and the observed prices of the pizzas in the
training set are called residuals or training errors.
Cost function optimizes the regression coefficients or weights and measures how a linear
regression model is performing. The cost function is used to find the accuracy of the mapping
function that maps the input variable to the output variable. This mapping function is also known
as the Hypothesis function.
in Linear Regression, Mean Squared Error (MSE) cost function is used, which is the average of
squared error that occurred between the predicted values and actual values.
30
EXAMPLE:
Let's assume that you have recorded the diameters and prices of pizzas that
you have previously eaten in your pizza journal. These observations
comprise our training data
32
EXAMPLE:
from sklearn.linear_model import LinearRegression
# Training data
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
# Create and fit the model
model = LinearRegression()
model.fit(X, y)
print 'A 12" pizza should cost: $%.2f' % model.predict([12])[0]
A 12" pizza should cost: $13.68
33
EVALUATING THE FITNESS OF MODEL
sum of squares is calculated with the formula in the can produce the
best pizza-price predictor by minimizing the sum of the residuals. That is,
our model fits if the values it predicts for the response variable are close
to the observed values for all of the training examples. This measure of
the model's fitness is called the residual sum of squares cost function.
Formally, this function assesses the fitness of a model by summing the
squared residuals for all of our training examples. The residual lfollowing
equation,
34
EVALUATING THE MODEL
how well the observed values of the response variables are predicted by the
model. More concretely, r-squared is the proportion of the variance in the
response variable that is explained by the model. An r-squared score of one
indicates that the response variable can be predicted without any error using
the model.
35
CALCULATION
36
PYTHON IMPLEMENTATION
from sklearn.linear_model import LinearRegression
X = [[6], [8], [10], [14], [18]]
y = [[7], [9], [13], [17.5], [18]]
X_test = [[8], [9], [11], [16], [12]]
y_test = [[11], [8.5], [15], [18], [11]]
model = LinearRegression()
model.fit(X, y)
print 'R-squared: %.4f' % model.score(X_test, y_test)
37
https://medium.com/ml-research-lab/chapter-2-data-and-
its-different-types-3dfebcbb4dbe
https://blog.statsbot.co/data-structures-related-to-
machine-learning-algorithms-
5edf77c8bbf4#:~:text=Array,mathematical%20tool%20at%
20your%20disposal.
https://www.upgrad.com/blog/types-of-data/
https://www.spirion.com/data-remediation/
38
https://seleritysas.com/blog/2019/12/12/types-of-
predictive-analytics-models-and-how-they-work/
https://towardsdatascience.com/selecting-the-correct-
predictive-modeling-technique-ba459c370d59
https://www.netsuite.com/portal/resource/articles/financia
l-management/predictive-modeling.shtml
https://www.dezyre.com/article/types-of-analytics-
descriptive-predictive-prescriptive-analytics/209#toc-2
https://www.sciencedirect.com/topics/computer-
science/descriptive-model
Unit 4 :
Basics of Feature Engineering:
1
Outline
Feature and Feature Engineering,
Feature transformation:
Construction
Extraction,
Feature subset selection :
Issues in high-dimensional data,
key drivers,
measure
overall process
2
Feature and Feature Engineering
Input in machine learning which are usually in the form of
structured columns.
Algorithms require features with some specific
characteristic to work properly.
Feature Engineering?
Feature engineering is the process of transforming raw data into
features that better represent the underlying problem to the
predictive models, resulting in improved model accuracy on unseen
data.
Goals of Feature Engineering
1. Preparing the proper input dataset, compatible with the
machine learning algorithm requirements.
2. Improving the performance of machine learning models.
3 Prof. Monali Suthar (SOCET-CE)
Feature Engineering Category
Feature Engineering is divided into 3 broad categories:-
1. Feature Selection:
It is all about selecting a small subset of features from a large pool of
features.
We select those attributes which best explain the relationship of an
independent variable with the target variable.
There are certain features which are more important than other
features to the accuracy of the model.
It is different from dimensionality reduction because the
dimensionality reduction method does so by combining existing
attributes, whereas the feature selection method includes or excludes
those features.
Ex: Chi-squared test, correlation coefficient scores, LASSO, Ridge
regression etc.
4
Feature Engineering Category
II. Feature Transformation:
It means transforming our original feature to the functions of
original features.
Ex: Scaling, discretization, binning and filling missing data values are
the most common forms of data transformation.
To reduce right skewness of the data, we use log.
III. Feature Extraction:
When the data to be processed through an algorithm is too large,
it’s generally considered redundant.
Analysis with a large number of variables uses a lot of computation
power and memory, therefore we should reduce the dimensionality
of these types of variables.
It is a term for constructing combinations of the variables.
For tabular data, we use PCA to reduce features.
For image, we can use line or edge detection.
5
Feature transformation
Feature transformation is the process of modifying
your data but keeping the information.
These modifications will make Machine Learning
algorithms understanding easier, which will deliver better
results.
But why would we transform our features?
data types are not suitable to be fed into a machine learning
algorithm, e.g. text, categories
feature values may cause problems during the learning process,
e.g. data represented in different scales
we want to reduce the number of features to plot and visualize
data, speed up training or improve the accuracy of a specific
model
6
Feature Engineering Techniques
List of Techniques
1.Imputation
2.Handling Outliers
3.Binning
4.Log Transform
5.One-Hot Encoding
6.Grouping Operations
7.Feature Split
8.Scaling
9.Extracting Date
7
Imputation Using (Mean/Median) Values
This works by calculating the mean/median of the
non-missing values in a column and then replacing
the missing values within each column separately
and independently from the others. It can only be
used with numeric data.
8
Pros and Cons
Pros:
• Easy and fast.
• Works well with small numerical datasets.
Cons:
• Doesn‟t factor the correlations between features. It
only works on the column level.
• Will give poor results on encoded categorical
features (do NOT use it on categorical features).
• Not very accurate.
• Doesn‟t account for the uncertainty in the
imputations.
9
Pros and Cons
Pros:
• Easy and fast.
• Works well with small numerical datasets.
Cons:
• Doesn‟t factor the correlations between features. It
only works on the column level.
• Will give poor results on encoded categorical
features (do NOT use it on categorical features).
• Not very accurate.
• Doesn‟t account for the uncertainty in the
imputations.
10
Imputation Using (Most Frequent) or
(Zero/Constant) Values:
Most Frequent is another statistical strategy to
impute missing values and YES!! It works with
categorical features (strings or numerical
representations) by replacing missing data with the
most frequent values within each column.
Pros:
• Works well with categorical features.
Cons:
• It also doesn‟t factor the correlations between
features.
• It can introduce bias in the data.
11
Imputation Using (Most Frequent) or
(Zero/Constant) Values:
12
Imputation Using k-NN
The k nearest neighbors is an algorithm that is used
for simple classification. The algorithm uses „feature
similarity‟ to predict the values of any new data
points.
13
Pros and Cons
Pros:
• Can be much more accurate than the mean, median
or most frequent imputation methods (It depends on
the dataset).
Cons:
• Computationally expensive. KNN works by storing
the whole training dataset in memory.
• K-NN is quite sensitive to outliers in the data (unlike
SVM)
14
Handling outlier
• Incorrect data entry or error during data processing
• Missing values in a dataset.
• Data did not come from the intended sample.
• Errors occur during experiments.
• Not an errored, it would be unusual from the original.
• Extreme distribution than normal.
15
Handling outlier
Univariate method:
Univariate analysis is the simplest form of analyzing data.
“Uni” means “one”, so in other words your data has only one
variable.
It doesn‟t deal with causes or relationships (unlike regression )
and it‟s major purpose is to describe; It takes data,
summarizes that data and finds patterns in the data.
16
Handling outlier with Z score
The Z-score is the signed number of standard deviations by which
the value of an observation or data point is above the mean value of
what is being observed or measured.
Z score is an important concept in statistics. Z score is also called
standard score. This score helps to understand if a data value is
greater or smaller than mean and how far away it is from the mean.
More specifically, Z score tells how many standard deviations away a
data point is from the mean.
18
Log Transform
The Log Transform is one of the most popular
Transformation techniques out there.
It is primarily used to convert a skewed distribution to a
normal distribution/less-skewed distribution.
In this transform, we take the log of the values in a
column and use these values as the column instead.
19
Standard Scaler
The Standard Scaler is another popular scaler that is very
easy to understand and implement.
For each feature, the Standard Scaler scales the values
such that the mean is 0 and the standard deviation is 1(or
the variance).
x_scaled = x – mean/std_dev
20
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled
21
One-Hot Encoding
22
Feature subset selection
Feature Selection is the most critical pre-processing
activity in any machine learning process. It intends to
select a subset of attributes or features that makes the
most meaningful contribution to a machine learning
activity.
23
High dimensional data
High Dimensional refers to the high number of variables or
attributes or features present in certain data sets, more so in the
domains like DNA analysis, geographic information system (GIS),
etc. It may have sometimes hundreds or thousands of dimensions
which is not good from the machine learning aspect because it may
be a big challenge for any ML algorithm to handle that. On the other
hand, a high quantity of computational and a high amount of time
will be required. Also, a model built on an extremely high number of
features may be very difficult to understand. For these reasons, it
is necessary to take a subset of the features instead of the
full set. So we can deduce that the objectives of feature selection
are:
1. Having a faster and more cost-effective (less need for computational
resources) learning model
2. Having a better understanding of the underlying model that generates
the data.
3. Improving the efficacy of the learning model.
24
Feature subset selection methods
1. Wrapper methods
Wrapping methods compute models with a certain subset of
features and evaluate the importance of each feature.
Then they iterate and try a different subset of features until the
optimal subset is reached.
Two drawbacks of this method are the large computation time
for data with many features, and that it tends to overfit the
model when there is not a large amount of data points.
The most notable wrapper methods of feature selection
are forward selection, backward selection, and stepwise
selection.
25
Feature subset selection methods
1. Wrapper methods
Forward selection starts with zero features, then, for each
individual feature, runs a model and determines the p-value
associated with the t-test or F-test performed. It then selects
the feature with the lowest p-value and adds that to the
working model.
Backward selection starts with all features contained in the
dataset. It then runs a model and calculates a p-value
associated with the t-test or F-test of the model for each
feature.
Stepwise selection is a hybrid of forward and backward
selection. It starts with zero features and adds the one feature
with the lowest significant p-value as described above.
26
Feature subset selection methods
1. Filter methods
Filter methods use a measure other than error rate to
determine whether that feature is useful.
Rather than tuning a model (as in wrapper methods), a subset
of the features is selected through ranking them by a useful
descriptive measure.
Benefits of filter methods are that they have a very low
computation time and will not overfit the data.
However, one drawback is that they are blind to any
interactions or correlations between features.
This will need to be taken into account separately, which will
be explained below. Three different filter methods
are ANOVA, Pearson correlation, and variance
thresholding.
27
Feature subset selection methods
2. Filter methods
The ANOVA (Analysis of variance) test looks a the variation
within the treatments of a feature and also between the
treatments.
The Pearson correlation coefficient is a measure of the
similarity of two features that ranges between -1 and 1. A value
close to 1 or -1 indicates that the two features have a high
correlation and may be related.
The variance of a feature determines how much predictive
power it contains. The lower the variance is, the less
information contained in the feature, and the less value it has in
predicting the response variable.
28
Feature subset selection methods
3. Embedded Methods
Embedded methods perform feature selection as a part of the
model creation process.
This generally leads to a happy medium between the two
methods of feature selection previously explained, as the
selection is done in conjunction with the model tuning
process.
Lasso and Ridge regression are the two most common
feature selection methods of this type, and Decision tree also
creates a model using different types of feature selection.
29
Feature subset selection methods
3. Embedded Methods
Lasso Regression is another way to penalize the beta coefficients in a
model, and is very similar to Ridge regression. It also adds a penalty term
to the cost function of a model, with a lambda value that must be tuned.
The smaller number of features a model has, the lower the complexity.
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train,y_train)
train_score=lasso.score(X_train,y_train)
test_score=lasso.score(X_test,y_test)
coeff_used = np.sum(lasso.coef_!=0)
An important note for Ridge and Lasso regression is that all of your features must
be standardized
30
Feature subset selection methods
3. Embedded Methods
Ridge regression can do this by penalizing the beta coefficients of a model
for being too large. Basically, it scales back the strength of correlation with
variables that may not be as important as others. Ride Regression is done
by adding a penalty term (also called ridge estimator or shrinkage estimator)
to the cost function of the regression. The penalty term takes all of the betas
and scales them by a term lambda (λ) that must be tuned (usually with cross
validation: compares the same model but with different values of lambda).
from sklearn.linear_model import Ridge
rr = Ridge(alpha=0.01)
rr.fit(X_train, y_train)
31
32
https://seleritysas.com/blog/2019/12/12/types-of-
predictive-analytics-models-and-how-they-work/
https://towardsdatascience.com/selecting-the-correct-
predictive-modeling-technique-ba459c370d59
https://www.netsuite.com/portal/resource/articles/financia
l-management/predictive-modeling.shtml
https://www.dezyre.com/article/types-of-analytics-
descriptive-predictive-prescriptive-analytics/209#toc-2
https://www.sciencedirect.com/topics/computer-
science/descriptive-model
https://towardsdatascience.com/intro-to-feature-
selection-methods-for-data-science-4cae2178a00a
Unit 5 :
Overview of Probability :
1
Outline
2
Concepts of probability
Probability represents the certainty factor. Certainty is the rate that you
would assign to an event to happen
Probability is the Bedrock of Machine Learning.
Algorithms are designed using probability (e.g. Naive Bayes).
Learning algorithms will make decisions using probability (e.g. information
gain).
Sub-fields of study are built on probability (e.g. Bayesian networks).
1. Probability of a union of two events:
2. Joint probabilities :
3. Conditional probability :
4. Bayes rule :
3
Probability Theory – Terminology
Random Experiment – This is an experiment in which the outcome is
not known with certainty.
Sample Space – This is the universal set that consists of all possible
outcomes of an experiment. It is usually represented using the letter “S”.
Individual outcomes are called elementary events. Sample space can be
finite or infinite.
Event – It is a subset of a sample space and the probability is usually
calculated with respect to an event. Examples of events include:
Number of cancellation of orders placed at an E-commerce portal site
exceeding 10%.
The number of fraudulent credit card transactions exceeding 1%.
4
Random variables
Random variables play an important role in describing, measuring, and
analyzing uncertain events such as customer churn, employee attrition, and
demand for a product. It is a function that maps every outcome in the
sample space to a real number.
If random variable X can assume only a finite or countably infinite set of
values, then it is a discrete random variable. E.g., number of orders received
at an e-commerce retailer. These variables are described using probability
mass function (PMF) and cumulative distribution function (CDF).
Random variable X that can take a value from an infinite set of values is a
continuous random variable. E.g., percentage of attrition of employees.
Continuous random variables are described using probability density
function (PDF) and cumulative distribution function (CDF).
PDF is the probability that a continuous random variable takes value in a
small neighborhood of “x”:
5
Continuous random variables
Suppose X is some uncertain continuous quantity. The probability that X
lies in any interval a ≤ X ≤ b can be computed as follows. Define the events
A = (X ≤ a), B = (X ≤ b) and W = (a < X ≤ b). We have that B = A ∨ W, and
since A and W are mutually exclusive, the sum rules gives
Define the function F(q) p(X ≤ q). This is called the cumulative
distribution function or cdf of X. This is obviously a monotonically
increasing function.
6
Continuous random variables
𝑑
Now define f(x) = F(x) (we assume this derivative
𝑑𝑥
exists); this is called the probability density function
or pdf.
7
Binomial Distribution
Binomial distribution is a discrete probability distribution.
It has several applications in many business contexts.
Random variable X is said to follow a binomial distribution when:
1. The random variable can have only two outcomes − success and failure
(also known as Bernoulli trials).
2. The objective is to find the probability of getting x successes out of n
trials.
3. The probability of success is p and thus the probability of failure is (1 −
p).
4. The probability p is constant and does not change between trials.
The PMF of the binomial distribution (probability that the number of
success will be exactly x out of n trials) is given by
8
Poisson Distribution
In many situations, we may be interested in calculating the number of
events that may occur over a period of time or space.
E.g., number of cancellation of orders by customers at an e-commerce
portal, number of customer complaints, number of cash withdrawals at an
ATM, number of typographical errors in a book, number of potholes on
Bangalore roads
To find the probability of number of events, we use Poisson distribution.
The PMF of a Poisson distribution is given by
9
Exponential Distribution
Exponential distribution is a single parameter
continuous distribution that is traditionally used for
modeling time-to-failure of electronic components.
It represents a process in which events occur
continuously and independently at a constant average
rate.
The probability density function is given by
10
Normal DISTRIBUTION
Normal distribution is also known as Gaussian distribution or bell curve (as
it is shaped like a bell).
It is one of the most popular continuous distribution in the field of analytics
especially due to its use in multiple contexts.
Normal distribution is observed across many naturally occurring measures
such as age, salary, sales volume, birth weight and height.
Normal distribution is parameterized by two parameters: the mean of the
distribution µ and the variance σ2.
11
Central Limit Theorem
It is one of the most important theorems in statistics.
CLT is key to hypothesis testing, which primarily deals with sampling
distribution.
Let S1, S2, …, Sk be samples of size n drawn from an independent and
identically distributed population with mean µ and standard deviation σ.
Let X1, X2, …, Xk be the sample means (of the samples S1, S2, …, Sk).
According to the CLT, the distribution of X1, X2, …, Xk follows a normal
distribution with mean µ and standard deviation of σ/√n.
12
Hypothesis Test
Hypothesis testing consists of two complementary statements - null hypothesis and
alternative hypothesis.
Null hypothesis is an existing belief and alternate hypothesis is what we intend to establish
with new evidences (samples).
Objective of hypothesis testing is to either reject or retain a null hypothesis with the help of
data.
Hypothesis tests are broadly classified into parametric tests and non-parametric tests.
1. Parametric tests are about population parameters of a distribution such as mean,
proportion, and standard deviation.
2. Non-parametric tests are not about other characteristics such as independence of
events or data following certain distributions such as normal distribution.
Steps for hypothesis tests:
1. Define null and alternative hypotheses. Normally, H0 is used to denote null hypothesis and
HA for alternate hypothesis.
2. Identify the test statistic to be used for testing the validity of the null hypothesis (e.g., Z-test
or t-test).
3. Decide the criteria for rejection and retention of null hypothesis. This is called significance
value (α). Typical value used for α is 0.05.
4. Calculate the p-value, which is the conditional probability of observing the test statistic value
when the null hypothesis is true.
5. Take the decision to reject or retain the null hypothesis based on p-value and α.
13
Analysis Of Variance (Anova)
One-way ANOVA can be used to study the impact of a single treatment
(also known as factor) at different levels (thus forming different groups) on
a continuous response variable (or outcome variable).
Then the null and alternative hypotheses for one-way ANOVA for
comparing 3 groups are given by
14
Monte Carlo Approximation
Monte Carlo methods are a class of techniques for randomly sampling a
probability distribution.
Often, we cannot calculate a desired quantity in probability, but we can
define the probability distributions for the random variables directly or
indirectly.
Monte Carlo sampling a class of methods for randomly sampling from a
probability distribution.
Monte Carlo sampling provides the foundation for many machine learning
methods such as resampling, hyperparameter tuning, and ensemble learning.
In principle, Monte Carlo methods can be used to solve any problem having
a probabilistic interpretation.
By the law of large numbers, integrals described by the expected value of
some random variable can be approximated by taking the empirical mean
(a.k.a. the sample mean) of independent samples of the variable.
15
Monte Carlo Approximation
Need for Sampling
There are many problems in probability, and more broadly in machine
learning, where we cannot calculate an analytical solution directly.
In fact, there may be an argument that exact inference may be intractable
for most practical probabilistic models.
Sampling provides a flexible way to approximate many sums and
integrals at reduced cost.
16
Monte Carlo Methods
Monte Carlo methods, or MC for short, are a class of
techniques for randomly sampling a probability distribution.
There are three main reasons to use Monte Carlo methods to
randomly sample a probability distribution; they are:
Estimate density, gather samples to approximate the distribution of a
target function.
Approximate a quantity, such as the mean or variance of a
distribution.
Optimize a function, locate a sample that maximizes or minimizes the
target function.
17
Monte Carlo Methods
Monte Carlo methods are defined in terms of the way that samples are drawn
or the constraints imposed on the sampling process.
Some examples of Monte Carlo sampling methods include: direct
sampling, importance sampling, and rejection sampling.
Direct Sampling. Sampling the distribution directly without prior
information.
Importance Sampling. Sampling from a simpler approximation of the
target distribution.
Rejection Sampling. Sampling from a broader distribution and only
considering samples within a region of the sampled distribution.
It’s a huge topic with many books dedicated to it. Next, let’s make the idea of
Monte Carlo sampling concrete with some familiar examples.
For example, Monte Carlo methods can be used for:
1. Calculating the probability of a move by an opponent in a complex game.
2. Calculating the probability of a weather event in the future.
3. Calculating the probability of a vehicle crash under specific conditions.
18
Monte Carlo Methods
19
20
https://machinelearningmastery.com/monte-carlo-sampling-for-
probability/
https://seleritysas.com/blog/2019/12/12/types-of-predictive-
analytics-models-and-how-they-work/
https://towardsdatascience.com/selecting-the-correct-
predictive-modeling-technique-ba459c370d59
https://www.netsuite.com/portal/resource/articles/financial-
management/predictive-modeling.shtml
https://www.dezyre.com/article/types-of-analytics-descriptive-
predictive-prescriptive-analytics/209#toc-2
https://www.sciencedirect.com/topics/computer-
science/descriptive-model
https://towardsdatascience.com/intro-to-feature-selection-
methods-for-data-science-4cae2178a00a