0% found this document useful (0 votes)
20 views33 pages

Data science-Unit-3-Complete

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views33 pages

Data science-Unit-3-Complete

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

15‐04‐2021

Unit – 3 Syllabus
Data analysis:
Introduction,
Terminology and Concepts,
Introduction to statistics,
Central tendencies and distributions,
Variance,
Data Science Distribution properties and arithmetic,
Samples/CLT
Data Pre-process:
Unit 3 Data Cleaning,
Consistency checking,
Heterogeneous and missing data,
Nawab Shah Alam Khan College of Engineering and Technology Data Transformation & Segmentation,
Machine Learning algorithms- Linear Regression, SVM, Naïve Bayes

1 2

Data analysis - Introduction Types of Data analysis


What is Data Analysis?
An analysis of any event can be done in one of two ways:
Data analysis: is defined as a process of cleaning, transforming, processing raw
1. Quantitative Analysis: or Statistical Analysis is the science of collecting and
data and modeling it to discover useful information for business decision-making.
interpreting data with numbers and graphs to identify patterns and trends.
Purpose of Data Analysis
• Extract useful information from data
2. Qualitative Analysis: Qualitative or Non-Statistical Analysis gives generic
• Make decisions based upon the data analysis.
information and uses text, sound and other forms of media to do so.
Why Data Analysis?
For example, if I want a purchase a coffee from a shop, available in Short, Tall
• Data analysis plays an important role in:
and Grande sizes, then this example is of Qualitative Analysis.
• Discovering useful information
• Answering questions from data
But if a store sells 70 regular coffees a week, it is Quantitative Analysis because
• Predicting the future or unknown
we have a number representing the coffees sold per week.
To grow your business, you need to do is Analysis!
• If your business is not growing - look back into the data to find mistakes.
Although the purpose of both these analyses is to provide results, Quantitative
• If business is growing-look forward to making the business to grow more.
analysis provides a clearer picture hence making it crucial in analytics.
All you need to do is analyze your business data and business processes.

3 4
15‐04‐2021

Types of Data analysis Categories of Data analysis


Categories of Data Analysis :
1. Statistical Analysis: shows “what happened” by using past data in the form of
dashboards.
There are two categories of this type:
a. Descriptive Analysis: analyses complete data.
It shows mean and deviation for continuous data
whereas percentage and frequency for categorical data.
b. Inferential Analysis: analyses samples from the complete data.
probability to arrive at a conclusion.

2. Diagnostic Analysis: shows “Why did it happen”?


3. Predictive Analysis: predicts “What is likely to happen in the future”?
4. Prescriptive Analysis: What’s the best course of action? to be taken
5. Text Analysis: also referred to as Data Mining.
It is one of the methods of data analysis to discover patterns in large data sets.

5 6

EDA & CDA Introduction to Statistics


In statistical or real applications, we have: To become a successful Data Scientist you must know your basics - Maths and
1. Exploratory data analysis (EDA): focuses on discovering new features in the Statistics - the building blocks of Machine Learning algorithms.
data
2. Confirmatory data analysis (CDA): focuses on confirming or falsifying Statistics is the science of analyzing data.
existing hypotheses.
When we have created a model for prediction, we must assess the prediction's
EDA = used to learn reliability.
- for deriving conclusions and predictions After all, what is a prediction worth, if we cannot rely on it?
- observations and trials don't make a source of evidence.
Descriptive Statistics: summarizes important features of a data set such as:
CDA = used to confirm 1. Count
- entirely pre-planned - data is collected and analyzed with carefully designed 2. Sum
experimental RCTs (randomized controlled trials). 3. Standard Deviation
Example: Pharmaceutical industry - responsible for drug evaluation and approval 4. Percentile
is solely based on the CDA. 5. Average etc.,

7 8
15‐04‐2021

Branches of Statistics Descriptive Statistics


• •

9 10

Central tendencies Central tendencies


Measures Of The Center tendencies The mean or average horsepower of the cars among the population of cars:
Mean: Measure of average of all the values in a sample is called Mean. Mean = (110+110+93+96+90+110+110+110)/8 = 103.625
Median: Measure of the central value of the sample set is called Median.
Mode: The value most recurrent in the sample set is known as Mode. Median: If we want to find out the center value of mpg among the population of
cars, we will arrange the mpg values in ascending or descending order and
Using descriptive analysis, you can analyze each of the variables in the sample choose the middle value.
data set for mean, standard deviation, minimum and maximum. The mpg for 8 cars: 21,21,21.3,22.8,23,23,23,23
Median = (22.8+23 )/2 = 22.9

Mode: If we want to find out the most common type of cylinder among the
population of cars, we will check the value which is repeated most number of
times. Here we can see that the cylinders come in two values, 4 and 6. Take a
look at the data set, you can see that the most recurring value is 6.
Hence 6 is our Mode.

11 12
15‐04‐2021

Measures of Central Tendency Measures of Central Tendency


• •

13 14

Mean Median
• •

15 16
15‐04‐2021

Mode Measure of Central Tendency


• •

17 18

Summary Statistics for Data Science - Terminology


• Population: is the set of sources from which data has to be collected.

• Sample: is a subset of the Population

• Variable: is any characteristics, number, or quantity that can be measured or


counted. A variable may also be called a data item.

• A statistical Parameter or population parameter : is a quantity that indexes a


family of probability distributions.
• For example, the mean, median, etc of a population.

• Getting familiar with data in the dataset:


• We can use the describe() function in Python to summarize the data:
• Example
• print (full_health_data.describe())

19 20
15‐04‐2021

Population and Sample Complete sample


• •

21 22

Partial sample

25%, 50% and 75% - Percentiles


Percentiles are used in statistics to give you a number that describes the value that a given
percent of the values are lower than.
Let us try to explain it by some examples, using Average_Pulse.

The 25% percentile of Average_Pulse means that 25% of all of the training sessions have an
average pulse of <=100 beats per minute. Or it means that 75% of all of the training sessions have
an average pulse of >=100 beats per minute.
The 75% percentile of Average_Pulse means that 75% of all the training session have an average
pulse of <=111.
Or it means that 25% of all of the training sessions have an average pulse of >=111 beats per
minute

23 24
15‐04‐2021

Measures of the Spread Range


Measures of the Spread •
Just like the measure of center(mean, median and mode), we also have
measures of the spread, which comprises of the following measures:

Range: It is the given measure of how spread apart the values in a data set are.
Inter Quartile Range (IQR): It is the measure of variability, based on dividing a
data set into quartiles.
Variance: It describes how much a random variable differs from its expected
value. It entails computing squares of deviations.

Deviation: is the difference between each element from the mean.


Population Variance: is the average of squared deviations
Sample Variance: is the average of squared differences from the mean
Standard Deviation: It is the measure of the dispersion of a set of data from its
mean.

25 26

Range Quartiles
• •

27 28
15‐04‐2021

Measure of Dispersion (Spread) Variance


• •

29 30

Variance: is a number that indicates how the values are spread around the Step 1 to Calculate the Variance: Find the Mean
mean. Suppose we want to find the variance of Average_Pulse.
In fact, if you take the square root of the variance, you get the standard deviation. 1. Find the mean:
(80+85+90+95+100+105+110+115+120+125) / 10 = 102.5
Or the other way around, if you multiply the standard deviation by itself, you get the variance!
We will first use the data set with 10 observations to give an example of how we can calculate
The mean is 102.5
the variance: full_health_data dataset Step 2: For Each Value - Find the Difference From the Mean
Duration Average_Pulse Max_Pulse Calorie_Burnag Hours_Work Hours_Sleep 2. Find the difference from the mean for each value:
e
80 - 102.5 = -22.5
30 80 120 240 10 7
85 - 102.5 = -17.5
30 85 120 250 10 7
90 - 102.5 = -12.5
45 90 130 260 8 7
95 - 102.5 = -7.5
45 95 130 270 8 7 100 - 102.5 = -2.5
45 100 140 280 0 7 105 - 102.5 = 2.5
60 105 140 290 7 8 110 - 102.5 = 7.5
60 110 145 300 7 8 115 - 102.5 = 12.5
60 115 145 310 8 8 120 - 102.5 = 17.5
75 120 150 320 0 8 125 - 102.5 = 22.5
75 125 150 330 8 8

31 32
15‐04‐2021

Step 3: Find the square value for each difference: Standard Deviation
(-22.5)^2 = 506.25
(-17.5)^2 = 306.25 A mathematical function will have difficulties in predicting precise values, if the observations are
"spread".
(-12.5)^2 = 156.25 Variance is a common measure of data dispersion but in most cases the values are pretty large
(-7.5)^2 = 56.25 and hard to compare ( as the values are squared).
(-2.5)^2 = 6.25 In most analysis standard deviation is much more meaningful than variance.
2.5^2 = 6.25 standard deviation = 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
7.5^2 = 56.25
• Standard deviation is a measure of uncertainty.
12.5^2 = 156.25 • A low standard deviation means that most of the numbers are close to the mean (average)
17.5^2 = 306.25 value.
22.5^2 = 506.25 • A high standard deviation means that the values are spread out over a wider range.
Note:
1. We must square the values to get the total spread. Standard Deviation is often represented by the symbol Sigma: σ
2. Cannot have negative values otherwise the total will be 0. Example python code:
Step 4: The Variance is the Average Number of These Squared Values import numpy as np

4. Sum the squared values and find the average: std = np.std(full_health_data)
(506.25 + 306.25 + 156.25 + 56.25 + 6.25 + 6.25 + 56.25 + 156.25 + 306.25 + 506.25) / 10 = 206.25 print(std)
The variance is 206.25.

33 34

Standard Deviation
• : average deviation of each value in a data set from the mean. •

Intuition:
If variance is high - means you have larger variability in your dataset.
(or we can say more values are spread out around your mean value.)
For a variable’s value is according to normal distribution, you need to know its mean and standard deviation.
Standard deviation represents the average distance of an observation from the mean If the distribution is indeed normal, you will get a plot that is close to the above figure.
The larger the standard deviation, larger the variability of the data.

35 36
15‐04‐2021

• •

37 38

• •

39 40
15‐04‐2021

Properties of Normal Distribution


Properties of normal distribution
• •
1. The distribution is symmetric.
2. The mean, median, and mode are all the same value.
3. The approximate data distribution around the mean with increasing standard deviation are
( Empirical rules):
a) 68% for +/- one standard deviation
b) 95% for +/- two standard deviations
c) 99.7% for +/- three standard deviations

The data points outside of three standard deviations are considered outliers as they are very
unlikely to occur.

4. Skewness and kurtosis


Skewness and kurtosis are coefficients that measure how different a distribution is from a normal
distribution. Skewness measures the symmetry of a normal distribution while kurtosis measures the
thickness of the tail ends relative to the tails of a normal distribution.

41 42

• •

43 44
15‐04‐2021

Skewness
•What is Skewness? •
Skewness is a measure of asymmetry or distortion of symmetric distribution.
It measures the deviation of the given distribution of a random variable from a symmetric
normal distribution.
A normal distribution is without any skewness, as it is symmetrical on both sides.
Hence, a curve is regarded as skewed if it is shifted towards the right or the left.

45 46

Kurtosis Types of Kurtosis


•What is Kurtosis? •Types of Kurtosis
Kurtosis: is a statistical measure that defines how heavily the tails of a distribution differ The types of kurtosis are determined by the excess kurtosis of a particular distribution. The excess
from the tails of a normal distribution. kurtosis can take positive or negative values, as well as values close to zero..

1. Mesokurtic: Data that follows a mesokurtic distribution shows an excess kurtosis of zero or close
In other words, kurtosis identifies whether the tails of a given distribution contain extreme
to zero. This means that if the data follows a normal distribution, it follows a mesokurtic distribution.
values. 2. Leptokurtic: indicates a positive excess kurtosis.
Distribution shows heavy tails on either side, indicating large outliers.
Along with skewness, kurtosis is an important descriptive statistic of data distribution. In finance, a leptokurtic distribution shows that the investment returns may be prone to extreme
However, the two concepts must not be confused with each other. Skewness essentially values on either side. Therefore, an investment whose returns follow a leptokurtic distribution is
measures the symmetry of the distribution, while kurtosis determines the heaviness of the considered to be risky.
distribution tails. 3. Platykurtic: distribution shows a negative excess kurtosis. The kurtosis reveals a distribution with
flat tails. The flat tails indicate the small outliers in a distribution.
What is Excess Kurtosis? In the finance context, the platykurtic distribution of the investment returns is desirable for investors
An excess kurtosis is a metric that compares the kurtosis of a distribution against the kurtosis of a because there is a small probability that the investment would experience extreme returns.
normal distribution. The kurtosis of a normal distribution equals 3. Therefore, the excess kurtosis
is found using the formula below:

Excess Kurtosis = Kurtosis – 3

47 48
15‐04‐2021

• •

49 50

Coefficient of Variation

The coefficient of variation is used to get an idea of how large the standard deviation is.

Mathematically, the coefficient of variation is defined as:

Coefficient of Variation = Standard Deviation / Mean

In Python we can get using the following code:

import numpy as np

cv = np.std(full_health_data) / np.mean(full_health_data)
print(cv)

51 52
15‐04‐2021

Corelation Coefficient Corelation Coefficient


Correlation
Correlation measures the relationship between two variables. Example of a Perfect Linear Relationship (Correlation Coefficient = 1)
We will use scatterplot to visualize the relationship between Average_Pulse and Calorie_Burnage
We mentioned that a function has a purpose to predict a value, by converting input (x) to output (we have used the small data set of the sports watch with 10 observations).
(f(x)). We can say also say that a function uses the relationship between two variables for
prediction. This time we want scatter plots, so we change kind to "scatter":
y=f(x)
Correlation Coefficient Example
The correlation coefficient measures the relationship between two variables. import matplotlib.pyplot as plt

The correlation coefficient can never be less than -1 or higher than 1. health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', kind='scatter')
plt.show()
1 = there is a perfect linear relationship between the variables (like Average_Pulse against
Calorie_Burnage)
0 = there is no linear relationship between the variables
-1 = there is a perfect negative linear relationship between the variables (e.g. Less hours worked,
leads to higher calorie burnage during a training session)

53 54

Corelation Coefficient Corelation Coefficient


• Example of a Perfect Linear Relationship Example of a Perfect Negative Linear Relationship (Correlation Coefficient = -1)
• (Correlation Coefficient = 1)
• We will use scatterplot to visualize the relationship between
Average_Pulse and Calorie_Burnage
• (we have used the small data set of the sports watch with 10
observations).

• It exists a perfect linear relationship between Average_Pulse


and Calorie_Burnage.

• Example: Pyhton code


• import matplotlib.pyplot as plt
• health_data.plot(x ='Average_Pulse', y='Calorie_Burnage', We have plotted fictional data here. The x-axis represents the amount of hours worked at our job before a
kind='scatter') training session. The y-axis is Calorie_Burnage.
If we work longer hours, we tend to have lower calorie burnage because we are exhausted before the training
• plt.show() session.
The correlation coefficient here is -1.

55 56
15‐04‐2021

Corelation Coefficient
Example of No Linear Relationship (Correlation coefficient = 0) •
Correlation Coefficient = 0
Here, we have plotted Max_Pulse against Duration from the
full_health_data set.

As you can see, there is no linear relationship between the two


variables. It means that longer training session does not lead to
higher Max_Pulse.

The correlation coefficient here is 0.

Example
import matplotlib.pyplot as plt

full_health_data.plot(x ='Duration', y='Max_Pulse', kind='scatter')


plt.show()

57 58

Central Limit Theorem (CLT)


•What is the Central Limit Theorem (CLT)? •What is the Central Limit Theorem (CLT)?
Definition to CLT: Given a dataset with unknown distribution (it could be uniform,
Significance of the Central Limit Theorem binomial or completely random), the sample means will approximate the normal
Statistical Significance distribution.
Practical Applications Let’s understand the central limit theorem with the help of an example. This will
Assumptions Behind the Central Limit Theorem help you intuitively grasp how CLT works underneath.

Consider that there are 15 sections in the science department of a university and
each section hosts around 100 students. Our task is to calculate the average
weight of students in the science department. Sounds simple, right?

Aspiring data scientists just calculate the average:


First, measure the weights of all the students in the science department
Add all the weights
Finally, divide the total sum of weights with a total number of students to get the
average

59 60
15‐04‐2021

•But what if the size of the data is huge? Does this approach make sense? •
Not really – measuring the weight of all the students will be a very tiresome and
long process. So, what can we do instead? Let’s look at an alternate approach.

First, draw groups of students at random from the class. We will call this a sample.
We’ll draw multiple samples, each consisting of 30 students.
1. Calculate the individual mean of these samples
2. Calculate the mean of these sample means
3. This value will give us the approximate mean weight of the students in the
science department
4. Additionally, the histogram of the sample mean weights of students will
resemble a bell curve (or normal distribution)
This, in a nutshell, is what the central limit theorem is all about.

61 62

• •

63 64
15‐04‐2021

•Significance of the Central Limit Theorem •Also, the sample mean can be used to create the range of values known as a
The central limit theorem has both statistical significance as well as practical confidence interval (that is likely to consist of the population mean)
applications.
We’ll look at both aspects to gauge where we can use them.

Statistical Significance of CLT


•Analyzing data involves statistical methods like hypothesis testing and
constructing confidence intervals.
•These methods assume that the population is normally distributed.
•In the case of unknown or non-normal distributions, we treat the sampling
distribution as normal according to the central limit theorem
•If we increase the samples drawn from the population, the standard deviation of
sample means will decrease. This helps us estimate the population mean much
more accurately.

65 66

•Practical Applications of CLT •Assumptions Behind the Central Limit Theorem


1. Political/election polls are prime CLT applications. It’s important to understand the assumptions behind this technique:
2. These polls estimate the percentage of people who support a particular 1. The data must follow the randomization condition. It must be sampled
candidate. randomly
2. Samples should be independent of each other. One sample should not
3. You might have seen these results on news channels that come with influence the other samples
confidence intervals. The central limit theorem helps calculate that 3. Sample size should be not more than 10% of the population when sampling is
confidence interval. done without replacement
4. The sample size should be sufficiently large.
4. To calculate the mean family income for a particular region. How to figure out how large this size should be?
It depends on the population.
5. The central limit theorem has many applications in different fields. When the population is skewed or asymmetric, the sample size should be large.
If the population is symmetric, then we can draw small samples as well
In general, a sample size of 30 is considered sufficient when the population is
symmetric.

67 68
15‐04‐2021

Statistics – Probability
•The mean of the sample means is denoted as: •
µ X̄ = µ Statistical concepts that make your journey pleasant and lead you to success in
the field of Data Science.
where, Statistics is a powerful tool while performing the art of Machine Learning and
µ X̄ = Mean of the sample means Data Science.
µ= Population mean

And, the standard deviation of the sample mean is denoted as:

σ X̄ = σ/sqrt(n)

where,

σ X̄ = Standard deviation of the sample mean


σ = Population standard deviation
n = sample size

69 70

Data Preprocessing Data Preprocessing


• •
Fact - high quality data leads to better models and predictions To ensure high quality data, it’s crucial to preprocess it.
So data pre-processing has become vital–and the fundamental step in the data To make the process easier, data preprocessing is divided into four stages:
science/machine learning/AI pipeline. 1. Data cleaning,
2. Data integration,
Measures for data quality: A multidimensional view 3. Data reduction,
4. and Data transformation.
1. Accuracy: correct or wrong, accurate or not
2. Completeness: not recorded, unavailable, … 1. Data cleaning : refers to techniques to ‘clean’ data by removing outliers, replacing missing
values, smoothing noisy data, and correcting inconsistent data.
3. Consistency: some modified but some not, dangling..
4. Timeliness: timely update?
5. Believability: how much is the data trustable ?
6. Interpretability: how easily the data can be understood by the stakeholders?
(* Remember ACCTBI )

71 72
15‐04‐2021

Major Tasks in Data Preprocessing 1. Data Cleaning


• Data in the Real World Is Dirty
• Lots of potentially incorrect data is recorded
• 1. Incomplete data: lacking:
1) attribute values e.g., Occupation=“ ” (missing data)
2) certain attributes of interest (missing columns in table)
3) or containing only aggregate data (average value)
• 2. Noisy data: containing noise, errors or outliers.
• Noise: data that is corrupted, or distorted or meaningless.
• Errors: incorrect data (e.g., Salary=“−10” )
• Outliers: out of the range data or data not belonging to the dataset (rows or
records or tuples)
• Example: { 1,2 , 40,45,50,55,60,65,70,80, 80000,10000 }
73 74

73 74

1. Data Cleaning Missing Data


• 3. Inconsistent data: containing discrepancies (differences or inconsistencies) in
• Already covered
codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records

• 4. Intentional data :users purposely submit incorrect data when they do not
want to disclose personal information
• Jan. 1 as everyone’s birthday?
• *This is also known as disguised missing data

75 76

75 76
15‐04‐2021

Data Cleaning Noisy Data


• Noise: random error or variance in a measured variable
• Noisy data (incorrect data) occurs due to:
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data

77
78

77 78

Noisy Data Noisy Data

79 80

79 80
15‐04‐2021

How to Handle Noisy Data? How to Handle Noisy Data?


• Q: How can we smooth out the data to remove noise?
• The following are the data smoothing techniques:
• 1. Binning
• first sort data and partition into (equal-frequency) bins
• then one can :
• smooth by bin means: each value is replaced by mean value
• smooth by bin median: each value is replaced by median value
• smooth by bin boundaries: each value is replaced by closest boundary
value.

81 82

81 82

How to Handle Noisy Data? How to Handle Noisy Data?


• 2. Regression:
• Data can be smoothed by fitting the data into regression functions
• Linear Regression: Finding the “best line” to fit two attributes – One attribute used to
predict other
• Multiple linear regression: more than two attributes are involved and data we fit to a
dimensional surface
• 3. Clustering:
• Similar values are organized into “groups” or "clusters”
• Values outside of the clusters are “outliers”
• is used to detect and remove “outliers”

83 84

83 84
15‐04‐2021

How to Handle Noisy Data? How to Handle Noisy Data?


• 4. Combined computer and human inspection
• detect suspicious values and check manually (by humans) (e.g., dealing
with possible outliers)

Exception: In some applications where the rare events can be more


interesting than the more regularly occurring ones. The analysis of outlier data
is referred to as outlier analysis or anomaly mining.
Example: Fraud detection – credit cards
( Track repeated attempts made by a hacker to withdraw a large amount)

85 86

85 86

Data Preprocessing 2. Data Integration



2. Data integration
Since data is being collected from multiple sources, data integration has become a vital part Handling redundancy in Data Integration
of the process. Redundant data occur often when integration of multiple databases
This may lead to redundant and inconsistent data, which could result in poor accuracy and
speed of data model. Object identification: The same attribute or object may have different names in
To deal with these issues and maintain the data integrity, approaches such as tuple duplication different databases
detection and data conflict detection are sought after. Derivable data: One attribute may be a “derived” attribute in another table,
The most common approaches to integrate data are explained below. e.g., annual revenue
1. Data consolidation: The data is physically bought together to one data store (usually Redundant attributes may be able to be detected by correlation analysis and
involves Data Warehousing). covariance analysis
2. Data propagation: Copying data from one location to another using applications is called Careful integration of the data from multiple sources may
data propagation.
It can be synchronous or asynchronous and is event-driven. - help reduce/avoid redundancies and inconsistencies.
3. Data virtualization: An interface is used to provide a real-time and unified view of data from - and improve processing speed and quality
multiple sources.
The data can be viewed from a single point of access.
88

87 88
15‐04‐2021

3. Data Reduction 3. Data Reduction


Data reduction strategies
 Data sets for analysis may contain hundreds of attributes, many of which may 1. Dimensionality reduction: is the process of reducing the number of random
be irrelevant or redundant to the mining task. variables or attributes under construction,
e.g., remove unimportant attributes.
 Often refers to techniques that reduce dimensionality by creating new
Methods used:
attributes that are combination of old attributes. Wavelet transforms
 Data reduction: Obtain a reduced representation of the data set that is much Principal Components Analysis (PCA)
smaller in volume yet produces the same (or almost the same) analytical Feature subset selection, feature creation
results 2. Numerosity reduction : techniques replace the original data volume by
 Why data reduction? alternative forms of data representation. There are two techniques :
a) Parametric methods: A model used to estimate the data so that only data
 A database/data warehouse may store terabytes of data.
parameters need to be stored instead of actual data.
 Complex data analysis may take a very long time to run on the complete data Examples: Regression and Log-Linear Models
set. b) Non-parametric methods: for storing reduced representation of data.
Examples:
Histograms, clustering, sampling
89
Data cube aggregation 90

89 90

3. Data Reduction 4. Data Transformation


• 3. Data compression: transformations applied to obtain a reduced or The final step of data preprocessing is transforming the data into a form
appropriate for Data Modeling.
“compressed” representation of original data.
Strategies that enable data transformation include:
• Lossless: if the original data can be reconstructed from the compressed data 1.Smoothing
without any information loss then the data reduction is called lossless. 2.Attribute/feature construction: New attributes are constructed from the given
set of attributes.
• Lossy: if we can reconstruct only an approximation of the original data then the
3.Aggregation: Summary and Aggregation operations are applied on the given
data reduction is called lossy. set of attributes to come up with new attributes.
4.Normalization: The data in each attribute is scaled between a smaller range
e.g. 0 to 1 or -1 to 1.
5.Discretization: Raw values of the numeric attributes are replaced by discrete or
conceptual intervals, which can in return be further organized into higher level
intervals.
6.Concept hierarchy generation for nominal data: Values for nominal data are
generalized to higher order concepts.
91 92

91 92
15‐04‐2021

Data Segmentation Data Segmentation


Data Segmentation is the process of taking the data and dividing it into groups How can you apply segmentation to your data?
of similar data together to use it more efficiently. To implement the right kind of Data Segmentation and to communicate more
Examples of Data Segmentation could be: effectively with your target group requires a blend between having the right
1. Gender processes and technology in place (such as data quality tools and customer
2. Customers vs. Prospects data validation).
3. Industry This allows you to analyze and profile your current database, whilst ensuring any
Why is Data Segmentation important? incoming data is also segmented accordingly. A key requirement of Data
The key benefits of Data Segmentation are: Segmentation is high-quality data, in terms of it being both accurate and does
You will be able create messaging that is tailored and sophisticated to suit your not lack basic information such as “name” or “address”.
target market – appealing to their needs better.
It allows analysis of your data stored in your database, helping to identify
potential opportunities and challenges based within it.
Enables you to mass-personalise your marketing communications, reducing costs.
93 94

93 94

Linear Regression in Machine Learning Linear Regression in Machine Learning


Regression: is a method of modelling a target value based on independent
predictors.
Linear regression is one of the easiest and most popular supervised machine
Learning algorithms for predictive analysis .

Linear regression makes predictions for continuous/real or numeric variables such as


sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (X) variables, hence called as linear regression.
i.e., it finds how the value of the dependent variable is changing according to the
value of the independent variable.

The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

95 96

95 96
15‐04‐2021

Linear Regression in Machine Learning Linear Regression in Machine Learning

97 98

97 98

Linear Regression in Machine Learning Linear Regression line


Mathematically, we can represent a linear regression as: A regression line can show two types of relationship:
y= a0+a1x+ ε a) Positive Linear Relationship:
Here,
If the dependent variable increases on the Y-axis and independent variable
y= Dependent Variable (Target Variable) increases on X-axis, then such a relationship is termed as a Positive linear
x= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom) relationship.
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model representation.

Types of Linear Regression


Linear regression can be further divided into two types of the algorithms:

Simple Linear Regression:


If a single independent variable is used to predict the value of a numerical dependent variable, then such a
Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression99 algorithm is called Multiple Linear Regression. 100

99 100
15‐04‐2021

Linear Regression line Linear Regression


b) Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-
Cost function: is used to estimate the values of the coefficients(a0,a1) for the best
axis, then such a relationship is called a negative linear relationship. fit line.

It measures how a linear regression model is performing.


We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable.
This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
Finding the best fit line: is the average of squared error occurred between the predicted values and
When working with linear regression, our main goal is to find the best fit line that means:
actual values. It can be written as:
the error between predicted values and actual values should be minimized.
The best fit line will have the least error.
For the above linear equation, MSE can be calculated as:
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
101 102

101 102

Linear Regression Linear Regression


Where,

N=Total number of observation


yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual.
If the observed points are far from the regression line, then the residual will be high, and so cost
function will be high.
If the scatter points are close to the regression line, then the residual will be small and hence the
cost function.

Gradient Descent:
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process
of finding the best model out of various models is called optimization.

103 104

103 104
15‐04‐2021

SVM machine learning algorithm SVM machine learning algorithm


At the end of the session you will be able to know: Example: Consider a population composed of 50%-50% Males and Females.
Using a sample of this population, you want to create some set of rules which will guide us to
What is Support Vector Machine? predict gender class for rest of the population.
How does it work? Using this algorithm, we intend to build a robot which can identify whether a person is a male or a
How to implement SVM in Python? female.
Pros and Cons associated with SVM This is a sample problem of classification analysis.
Let’s assume that Height of the individual and Hair Length. Following is a scatter plot of the
What is Support Vector Machine? sample.
The blue circles in the plot represent females and green
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used
squares represents male. A few expected insights from the
for both classification or regression challenges. graph are :
However, it is mostly used in classification problems.
1. Males in our population have a higher average height.

2. Females in our population have longer scalp hairs.

If we were to see an individual with height 180 cms and hair


length 4 cms, our best guess will be to classify this individual
105
as a male. This is how we do a classification analysis. 106

105 106

SVM machine learning algorithm SVM machine learning algorithm


What is a Support Vector and what is SVM? How do we decide which is the best frontier for this particular problem statement?
Support Vectors are simply the co-ordinates of individual observation. The objective function in a SVM is to find the minimum distance of the frontier from closest support vector
For instance, (150,45) is a support vector which corresponds to a female. (this can belong to any class).
Support Vector Machine is a frontier(border) line which best segregates (separates) the Male from the For instance, orange frontier is closest to blue circles.
Females.
And the closest blue circle is 2 units away from the frontier. Once we have these distances for all the
In this case, the two classes are well separated from each other, hence it is easier to find a SVM.
frontiers, we simply choose the frontier with the maximum distance (from the closest support vector).
How to find the Support Vector Machine for case in hand? Out of the three shown frontiers, we see the black frontier is farthest from nearest support vector (i.e. 15
There are many possible frontier (border) which can classify the problem in hand. Following are the three units).
possible frontiers. What if we do not find a clean frontier which segregates the classes?
Our job was relatively easier finding the SVM in this business case. What if the distribution looked something
like as follows :

107 108

107 108
15‐04‐2021

SVM machine learning algorithm SVM machine learning algorithm


How does it work? Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and C) and all
Let’s understand: are segregating the classes well.
Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C). Now,
identify the right hyper-plane to classify star and circle.

You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane
which segregates(separate or set apart) the two classes better”.
In this scenario, hyper-plane “B” has excellently performed this job.

109 110

109 110

SVM machine learning algorithm SVM machine learning algorithm


How can we identify the right hyper-plane? You may select the hyper-plane B as it has higher margin compared to A.
Here, maximizing the distances between nearest data point (either class) and hyper-plane will But, here is the catch, SVM selects the hyper-plane which classifies the classes accurately prior to
help us to decide the right hyper-plane. maximizing margin.
This distance is called as Margin. Let’s look at the below snapshot: Here, hyper-plane B has a classification error and A has classified all correctly. Therefore, the right
hyper-plane is A.

Above, you can see that the margin for hyper-plane C is high as compared to both A and B.
Hence, we name the right hyper-plane as C.
Another lightning reason for selecting the hyper-plane with higher margin is robustness.
If we select a hyper-plane having low margin then there is high chance of miss-classification. 111 112

111 112
15‐04‐2021

SVM machine learning algorithm SVM machine learning algorithm


Can we classify two classes (Scenario-4)?: Find the hyper-plane to segregate to classes (Scenario-5):
Below, I am unable to segregate the two classes using a straight line, as one of the stars lies in the In the scenario below, we can’t have linear hyper-plane between the two classes,
territory of other(circle) class as an outlier. so how does SVM classify these two classes?
Till now, we have only looked at the linear hyper-plane.

One star at other end is like an outlier for star class.


The SVM algorithm has a feature to ignore outliers and find the hyper-plane that has the
maximum margin.
Hence, we can say, SVM classification is robust to outliers. 113 114

113 114

SVM machine learning algorithm SVM machine learning algorithm


SVM can solve this problem.
Easily! It solves this problem by introducing additional feature. In the SVM classifier, it is easy to have a linear hyper-plane between these two classes.
Here, we will add a new feature z=x2+y2. Burning question: Should we need to add this feature manually to have a hyper-plane.
Now, let’s plot the data points on axis x and z: Answer: No, the SVM algorithm has a technique called the kernel trick.
The SVM kernel is a function that takes low dimensional input space and transforms it to a higher
dimensional space i.e. it converts non separable problem to separable problem.
It is mostly useful in non-linear separation problem.
It does some extremely complex data transformations, then finds out the process to separate the
data based on the labels or outputs you’ve defined.

When we look at the hyper-plane in original input space it looks like a circle:

In above plot, points to consider are:


1. All values for z would be positive always because z is the squared sum of both x and y
2. In the original plot, red circles appear close to the origin of x and y axes, leading to lower
value of z and star relatively away from the origin result to higher value of z.
115 116

115 116
15‐04‐2021

SVM machine learning algorithm Naïve Bayes Classifier


Pros and Cons associated with SVM
Pros: • What is Naive Bayes algorithm?How Naive Bayes Algorithms works?
It works really well with a clear margin of separation • What are the Pros and Cons of using Naive Bayes?
It is effective in high dimensional spaces.
It is effective in cases where the number of dimensions is greater than the number of samples. • 4 Applications of Naive Bayes Algorithm
It uses a subset of training points in the decision function (called support vectors), so it is also •
memory efficient.
Cons:
It doesn’t perform well when we have large data set because the required training time is higher
It also doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping
SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-
fold cross-validation. It is included in the related SVC method of Python scikit-learn library.

117 118

117 118

Example Data set


Naïve Bayes Classifier y(Response vector) =‘Play Golf’
Here X (feature matrix)=(‘Outlook’, ‘Temp’, ‘Humidity’, ‘Windy’)
• (X=(x1,x2,..xn)) (y)
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors(rows) of dataset in which each vector (row) consists of
the value of dependent features.

In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.


Response vector contains the value of class variable (prediction or output) for each row of
feature matrix.
In above dataset, the class variable name is ‘Play golf’.
Assumption:
The fundamental Naive Bayes assumption is that each feature makes an:
1. independent
2. equal
contribution to the outcome.

119 120

119 120
15‐04‐2021

Naïve Bayes Classifier - Assumptions Bayes Theorem


With relation to our dataset, this concept can be understood as: •Bayes’ Theorem
1. We assume that the features are independent. •Bayes’ Theorem finds the probability of an event occurring given the probability of another
For example, the temperature being ‘Hot’ has nothing to do with the humidity or the outlook event that has already occurred.
being ‘Rainy’ has no effect on the winds. Hence, the features are assumed to be independent.
2. Secondly, each feature is given the same weight(or importance). •Bayes’ theorem is stated mathematically as the following equation:
For example, knowing only temperature and humidity alone can’t predict the outcome •𝑃 𝐴 𝐵
|

accurately.
None of the attributes is irrelevant and are assumed to be contributing equally to the • where A and B are events and P(B) != 0.
outcome. Basically, we are trying to find probability of event A, given the event B is true. Event B is also

termed as evidence.
Note: The assumptions made by Naive Bayes are not generally correct in real-world situations.
P(A) is the priori of A (or prior probability, i.e. Probability of event before evidence is seen). The
In-fact, the independence assumption is never correct but often works well in practice.
evidence is an attribute value of an unknown instance(here, it is event B).
 P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.

121 122

121 122

Bayes Theorem Naïve Bayes Classifier - Example


• Now, with regards to our dataset, we can apply Bayes’ theorem in following way:
|
•𝑃 𝑦𝑋
•where, y is class variable and X is a dependent feature vector (of size n) where:
X=(x1,x2,….,xn)
• Just to clear, an example of a feature vector and corresponding class variable can
Where: be: (refer 1st row of dataset)

P(A|B) – the probability of event A occurring, given •X = (Rainy, Hot, High, False)
event B has occurred •y = No
P(B|A) – the probability of event B occurring, given •So basically, P(y|X) here means, the probability of “Not playing golf” given that the
event A has occurred weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no
P(A) – the probability of event A wind”.
P(B) – the probability of event B
123 124

123 124
15‐04‐2021

Naïve Bayes Classifier - Example Naïve Bayes Classifier - Example


• Assumption • Now, we need to create a classifier model.
• Now, its time to put a naive assumption to the Bayes’ theorem, which is, independence among • For this, we find the probability of given set of inputs for all possible values of the class variable y
the features. and pick up the output with maximum probability. This can be expressed mathematically as:
• So now, we split evidence into the independent parts.
• Now, if any two events A and B are independent, then,
• So, finally, we are left with the task of calculating P(y) and P(xi | y)
• P(A,B) = P(A)P(B)
• Hence, we reach to the result:
𝑃 𝑥1 𝑦 𝑃 𝑥2 𝑦 … 𝑃 𝑥𝑛 𝑦 𝑃 𝑦 𝑃 𝑦 ∏ 𝑃 𝑥𝑖 |𝑦 • Please note that P(y) is also called class probability and P(xi | y) is called conditional
𝑃 𝑦|𝑥1, 𝑥2, … 𝑥𝑛 probability.
𝑃 𝑥1 𝑃 𝑥2 … … 𝑃 𝑥𝑛 𝑃 𝑥1 𝑃 𝑥2 … … 𝑃 𝑥𝑛


• Now, as the denominator remains constant for a given input, we can remove that term: • Let us try to apply the above formula manually on our weather dataset. For this, we need to do
some precomputations on our dataset.
𝑃 𝑦|𝑥1, 𝑥2, … 𝑥𝑛 ∝ 𝑃 𝑦 ∏ 𝑃 𝑥𝑖 |𝑦
• • We need to find P(xi | yj) for each xi in X and yj in y. All these calculations have been
demonstrated in the tables below:
125 126

125 126

Naïve Bayes Classifier Naïve Bayes Classifier


For example, probability of playing golf given that the temperature is cool, i.e P(temp. = cool |
play golf = Yes) = 3/9.

Also, we need to find class probabilities (P(y)) which has been calculated in the table 5. For
example, P(play golf = Yes) = 9/14.
So now, we are done with our pre-computations and the classifier is ready!
Let us test it on a new set of features (let us call it today):
today = (Sunny, Hot, Normal, False)
So, probability of playing golf is given by:
P(SunnyOutlook|Yes)P(HotTemperature|Yes)P(NormalHumidity|Yes)P(NoWind|Yes)P(Yes)
𝑃 𝑌𝑒𝑠|𝑡𝑜𝑑𝑎𝑦)=

and probability to not play golf is given by:


P(SunnyOutlook|No)P(HotTemperature|No)P(NormalHumidity|No)P(NoWind|No)P(No)
𝑃 𝑁𝑜|𝑡𝑜𝑑𝑎𝑦)=

127 128

127 128
15‐04‐2021

Naïve Bayes Classifier


Naïve Bayes Classifier What are the Pros and Cons of Naive Bayes?
Since, P(today) is common in both probabilities, we can ignore P(today) and find Pros:
proportional probabilities as: 1. It is easy and fast to predict class of test data set.
2. It also perform well in multi class prediction
3. When assumption of independence holds, a Naive Bayes classifier performs better compare
to other models like logistic regression and you need less training data.
4. It perform well in case of categorical input variables compared to numerical variable(s).
5. For numerical variable, normal distribution is assumed (bell curve, which is a strong
assumption).
Cons:
If categorical variable has a category (in test data set), which was not observed in training data
set, then model will assign a 0 (zero) probability and will be unable to make a prediction.
This is often known as “Zero Frequency”.
To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is
called Laplace estimation.
On the other side naive Bayes is also known as a bad estimator, so the probability outputs from
predict_proba are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is
So, prediction that golf would be played is ‘Yes’. almost impossible that we get a set of predictors which are completely independent. 130
The method that we discussed above is applicable for discrete data. 129

129 130

Naïve Bayes Classifier


Applications of Naive Bayes Algorithms
Real time Prediction: Naive Bayes is an eager learning classifier hence it is fast.
Thus, it could be used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms.
As a result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment
Analysis (in social media analysis, to identify positive and negative customer
sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering
together builds a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a user would
like a given resource or not.
131

131

You might also like