20IT503 - Big Data Analytics - Unit2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Please read this disclaimer before proceeding:

This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
20IT503

Big Data Analytics


Department: IT
Batch/Year: 2020-24/III
Created by: K.SELVI,
Assistant Professor
Date: 30.07.2022
Table of Contents

S NO CONTENTS PAGE NO
1 Table of Contents 5

2 Course Objectives 6

3 Pre Requisites (Course Names with Code) 7


4 Syllabus (With Subject Code, Name, LTPC details) 8
5 Course Outcomes 10
6 CO- PO/PSO Mapping 11
7 Lecture Plan 12
8 Activity Based Learning 13
9 2 DESCRIPTIVE ANALYTICS USING STATISTICS
2.1 Mean, Median and Mode 14
2.2 Standard Deviation & variance 18
2.3 Probability 21
2.4 Probability Density Function 23
2.5 Percentiles and Moments 27
2.6 Correlation and Covariance 33
2.7 Conditional Probability 34
2.8 Bayes’ Theorem 36
2.9 Introduction to Univariate, Bivariate Analysis and 40
Multivariate Analysis
2.10 Dimensionality Reduction using Principal Component 42
Analysis (PCA)
2.11 Dimensionality Reduction using LDA. 48

10 Reference Videos 50

11 Assignments 51

12 Part A (Q & A) 52

13 Part B Qs 57

14 Supportive Online Certification Courses 58

Real time applications in day to day life and to Industry 59


15

16 Mini Project Suggestions 60

17 Prescribed Text Books & Reference Books 61


5
Course Objectives
To Understand the Big Data Platform and its Use cases

To Provide an overview of Apache Hadoop

To Provide HDFS Concepts and Interfacing with HDFS

To Understand Map Reduce Jobs


PreRequisites
CS8391 – Data Structures

CS8492 – Database Management System


Syllabus
20IT503 BIG DATA ANALYTICS LTPC
3003

UNIT I INTRODUCTION TO BIG DATA 9

Data Science – Fundamentals and Components –Types of Digital Data – Classification


of Digital Data – Introduction to Big Data – Characteristics of Data – Evolution of Big
Data – Big Data Analytics – Classification of Analytics – Top Challenges Facing Big
Data – Importance of Big Data Analytics.

UNIT II DESCRIPTIVE ANALYTICS USING STATISTICS 9

Mean, Median and Mode – Standard Deviation and Variance – Probability –


Probability Density Function – Percentiles and Moments – Correlation and Covariance
– Conditional Probability – Bayes’ Theorem – Introduction to Univariate, Bivariate and
Multivariate Analysis – Dimensionality Reduction using Principal Component Analysis
(PCA) and LDA.

UNIT III PREDICTIVE MODELING AND MACHINE LEARNING 9

Linear Regression – Polynomial Regression – Multivariate Regression –Bias/Variance


Trade Off – K Fold Cross Validation – Data Cleaning and Normalization – Cleaning
Web Log Data – Normalizing Numerical Data – Detecting Outliers – Introduction to
Supervised And Unsupervised Learning – Reinforcement Learning – Dealing with
Real World Data – Machine Learning Algorithms –Clustering.
Syllabus
UNIT IV BIG DATA HADOOP FRAMEWORK 9

Introducing Hadoop –Hadoop Overview – RDBMS versus Hadoop – HDFS (Hadoop


Distributed File System): Components and Block Replication – Processing Data with
Hadoop – Introduction to MapReduce – Features of MapReduce – Introduction to
NoSQL: CAP theorem – MongoDB: RDBMS Vs MongoDB – Mongo DB Database Model
– Data Types and Sharding – Introduction to Hive – Hive Architecture – Hive Query
Language (HQL).

UNIT V PYTHON AND R PROGRAMMING 9

Python Introduction – Data types - Arithmetic - control flow – Functions - args -


Strings – Lists – Tuples – sets – Dictionaries Case study: Using R, Python, Hadoop,
Spark and Reporting tools to understand and Analyze the Real world Data sources in
the following domain- financial, Insurance, Healthcare in Iris, UCI datasets.
Course Outcomes
CO# COs K Level

CO1 Identify Big Data and its Business Implications. K3

CO2 List the components of Hadoop and Hadoop Eco- K4


System

CO3 Access and Process Data on Distributed File System K4

CO4 Manage Job Execution in Hadoop Environment K4

CO5 Develop Big Data Solutions using Hadoop Eco System K4


CO – PO/PSO Mapping

PO PO PO PO PO PO PO PO PO PO PO PO PSO PSO PSO


CO #
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3

2 3 3 3 3 1 1 - 1 2 1 1 2 2 2
CO1

2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO2

2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO3

2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO4

2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
CO5
Lecture Plan
UNIT II DESCRIPTIVE ANALYTICS USING STATISTICS 9

Mean, Median and Mode – Standard Deviation and Variance – Probability – Probability
Density Function – Percentiles and Moments – Correlation and Covariance – Conditional
Probability – Bayes’ Theorem – Introduction to Univariate, Bivariate and Multivariate
Analysis – Dimensionality Reduction using Principal Component Analysis (PCA) and
LDA.
Session Mode of Referenc
Topics to be covered
No. delivery e
Mean, Median and Mode Chalk&board
1 2
Standard Deviation and Variance Chalk&board 2
2

Probability – Probability Density Function Chalk&board 2


3
Percentiles and Moments Chalk&board 2
4
Chalk&board 2
5 Correlation and Covariance – Conditional
Probability
6 Bayes‘ Theorem Chalk&board 2

7 Introduction to Univariate, Bivariate and Chalk&board 2


Multivariate Analysis
8,9 Dimensionality Reduction using Principal Chalk&board 2
Component Analysis (PCA) and LDA.

CONTENT BEYOND THE SYLLABUS :


NPTEL/OTHER REFERENCES / WEBSITES : -

Ref-book 2:Kim H. Pries and Robert Dunnigan, "Big Data Analytics: A


Practical Guide for Managers " CRC Press, 2015.

NUMBER OF PERIODS : Planned: 9 Actual:

DATE OF COMPLETION : Planned: Actual:

REASON FOR DEVIATION (IF ANY) :

CORRECTIVE MEASURES :

Signature of the Faculty Signature of HoD


ACTIVITY BASED LEARNING

S NO TOPICS

CROSS WORD PUZZLE (https://puzzel.org/en/crossword/play?p=-


1
MX2J3z9r5p8eT6WUO0v)

13
2. Descriptive Analytics Using statistics

2.1 Mean Median and Mode:

Data Science as beginners we should known common term as Central


Tendency Measure with 3 M’s, Mean, Median and Mode. One should understand
what do these terms mean and when they are given priority while analyzing any
data set and concluding a decision depending upon the types of data.

Types of Data:

The data available for analysis always have two main categories,
Quantitative and Qualitative.

Quantitative data: It has numerical values such as time, speed, etc.

Qualitative data : Non-numerical values such as color, yes or no, etc. There
are two types of Quantitative data.

1. Discrete:

The Discrete data has values that can not be broken down into
fractions such as numbers on dice, number of students in a class, etc.

2. Continuous:

The Continuous data can be available in fractional values such as the


height of a person, distance, etc.

The data can be further classified into Nominal, Ordinal, Interval, and
Ratio data.
Nominal data: It is categorical data such as gender, religion, etc.

Ordinal data: is a measure of rank or order such as exam grades, position in the
company, etc.

Interval data: is a measure of equality and interval such as 30 oC is hotter than


15 oC, water is in liquid form when the temperature is in between 0 oC and 100
oC. etc.

Ratio data :When in addition to setting up inequalities we can also form


quotients such data is known as Ratio data, such as the ratio of height, weight,
etc.

The measure of Central Tendency:

Data tend to accumulate around the average value of the total data
under consideration. Measures of central tendency will help us to find the middle,
or the average, of a data set. If most of the data is centrally located and there is
a very small spread it will form an asymmetric bell curve. In such conditions
values of mean, median and mode are equal.

Mean: It is the average of values. Consider 3 temperature values 30 oC, 40


oC and 50 oC, then the mean is (30+40+50)/3=40 oC.

Median: It is the centrally located value of the data set sorted in ascending
order. Consider 11 (ODD) values 1,2,3,7,8,3,2,5,4,15,16. We first sort the values
in ascending order 1,2,2,3,3,4,5,7,8,15,16 then the median is 4 which is located
at the 6th number and will have 5 numbers on either side.
If the data set is having an even number of values then the median
can be found by taking the average of the two middle values. Consider 10
(EVEN) values 1,2,3,7,8,3,2,5,4,15. We first sort the values in ascending order
1,2,2,3,3,4,5,7,8,15 then the median is (3+4)/2=3.5 which is the average of the
two middle values i.e. the values which are located at the 5th and 6th number in
the sequence and will have 4 numbers on either side.

Mode: It is the most frequent value in the data set. We can easily get the
mode by counting the frequency of occurrence. Consider a data set with the
values 1,5,5,6,8,2,6,6. In this data set, we can observe the following,

We often test our data by plotting the distribution curve, if most of


the values are centrally located and very few values are off from the center
then we say that the data is having a normal distribution. At that time the
values of mean, median, and mode are almost equal.
However, when our data is skewed, for example, as with the right-skewed
data set below:

We can say that the mean is being dragged in the direction of the
skew. In this skewed distribution, mode < median < mean. The more skewed
the distribution, the greater the difference between the median and mean, here
we consider median for the conclusion. The best example of the right-skewed
distribution is salaries of employees, where higher-earners provide a false
representation of the typical income if expressed as mean salaries and not the
median salaries.
For left-skewed distribution mean < median < mode. In such a case
also, we emphasize the median value of the distribution.
2.2 Standard Deviation and Variance:

Variance:

variance is the average of the squared differences from the mean.

To understand what it means, let’s first take a look at the data set
where we have a list of 10 salaries, as below:
Notice that most of the values are concentrated around 15,000 and
35,000, but there is an extreme value (an outlier) of 200,000 that pushes up the
mean to 40,500 and dilates the range to 185,000.
Now, going back to the concept introduced earlier, let’s calculate the
variance. We are going to add together the square of the differences of each
point from the mean, and then divide it by the number of values in the set.

Remember that the mean (μ) is 40,500.


The variance is commonly represented as the Greek lowercase
letter Sigma squared (σ²). One way of finding the variance is using the
following equation:

Where x represents each term in the set, μ is the mean, and n is the number of
terms in the set.
There is also a quicker way to find the variance. Please check the equation
below.

so now we have the value of the variance, but you might have
noticed that it’s extremely large! That is because the variance is measured using
the values squared.
Although the variance is widely used, there is another statistical
concept that is more intuitive when we are measuring the variability of data
around the mean. That is the Standard Deviation.
Standard Deviation
After identifying the variance, finding the standard deviation is pretty
straightforward. It’s the square root of the variance.
Remember that the symbol of the variance is σ²? The standard deviation is
represented by σ.

In our data set, the standard deviation will be:

Notice that 53,500 feels more connected to the values in our list, but
what does it mean? It means that the salaries in our list are, on average,
$53,500 away from the mean.
The closer the values are from the mean, the smaller the standard
deviation. In our case, the value of the standard deviation was stretched because
we have an outlier.
Now, just to see how the standard deviation changes, let’s eliminate
the outlier. Our salaries’ list now remains with 9 values:

Without the outlier, the value of the standard deviation declined


dramatically. Considering this new set of values, the salaries would be, on
average, $6,285 away from the mean, which in this case is 22,777.
Standard deviation is an excellent way to identify outliers. Data
points that lie more than one standard deviation from the mean can be
considered unusual.
2.3 Probability:
Probability implies 'likelihood' or 'chance'. When an event is certain
to happen then the probability of occurrence of that event is 1 and when it is
certain that the event cannot happen then the probability of that event is 0.
Hence the value of probability ranges from 0 to 1.
Classical Definition of Probability
As the name suggests the classical approach to defining probability
is the oldest approach. It states that if there are n exhaustive, mutually
exclusive and equally likely cases out of which m cases are favourable to the
happening of event A.
Then the probabilities of event A is defined as given by the following probability
function:
Formula
The probability of an Event = (Number of favourable outcomes) / (Total number
of possible outcomes)
P(A) = n(E) / n(S)
P(A) < 1

Here, P(A) means finding the probability of an event A, n(E) means


the number of favourable outcomes of an event and n(S) means the set of all
possible outcomes of an event

Thus to calculate the probability we need information on number of


favourable cases and total number of equally likely cases. This can he
explained using following example.
Example
Problem Statement:
A coin is tossed. What is the probability of getting a head?
Solution:
Total number of equally likely outcomes (n) = 2 (i.e. head or tail)
Number of outcomes favorable to head (m) = 1
P(head)=1/2
Probability and Statistics:

We use a lot of probability concepts in statistics and hence in machine


learning, they are like using the same methodologies. In probability, the model is
given and we need to predict the data. While in statistics we start with the data
and predict the model. We look at probability and search from data distributions
which closely match the data distribution that we have. Then we assume that the
function or the model must be the same as the one we looked into in probability
theory.
2.4 Probability Density Function(PDF):
The Probability Density Function(PDF) defines the probability function
representing the density of a continuous random variable lying between a specific
range of values. In other words, the probability density function produces the
likelihood of values of the continuous random variable. Sometimes it is also called
a probability distribution function or just a probability function. However, this
function is stated in many other sources as the function over a broad set of
values. Often it is referred to as cumulative distribution function or sometimes
as probability mass function(PMF). However, the actual truth is PDF (probability
density function ) is defined for continuous random variables, whereas PMF
(probability mass function) is defined for discrete random variables.
In the case of a continuous random variable, the probability taken by
X on some given value x is always 0. In this case, if we find P(X = x), it does not
work. Instead of this, we must calculate the probability of X lying in an interval (a,
b). Now, we have to figure it for P(a< X< b), and we can calculate this using the
formula of PDF. The Probability density function formula is given as,

𝐛
𝐏 𝐚<𝐗<𝐛 =න 𝐟 𝐱 𝐝𝐱
𝐚

Or

𝐛
𝐏(𝐚 ≤ 𝐗 ≤ 𝐛) = න 𝐟(𝐱)𝐝𝐱
𝐚
This is because, when X is continuous, we can ignore the
endpoints of intervals while finding probabilities of continuous random
variables. That means, for any constants a and b,
P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b).
Probability Density Function Graph
The probability density function is defined as an integral of the density of the
variable density over a given range. It is denoted by f (x). This function is
positive or non-negative at any point of the graph, and the integral, more
specifically the definite integral of PDF over the entire space is always equal to
one. The graph of PDFs typically resembles a bell curve, with the probability
of the outcomes below the curve. The below figure depicts the graph of a
probability density function for a continuous random variable x with function
f(x).
Probability Density Function Properties
Let x be the continuous random variable with density function f(x),
and the probability density function should satisfy the following conditions:
For a continuous random variable that takes some value between certain limits,
say a and b, the PDF is calculated by finding the area under its curve and the
X-axis within the lower limit (a) and upper limit (b). Thus, the PDF is given by

𝑏
𝑃(𝑥) = න 𝑓(𝑥)𝑑𝑥
𝑎

• The probability density function is non-negative for all the possible values,
i.e. f(x)≥ 0, for all x.
• The area between the density curve and horizontal X-axis is equal to 1, i.e.

න 𝑓(𝑥)𝑑𝑥 = 1
−∞

Due to the property of continuous random variables, the density


function curve is continued for all over the given range. Also, this defines itself
over a range of continuous values or the domain of the variable.
Probability Density Function Example
Question:
Let X be a continuous random variable with the PDF given by:

𝑥; 0 < 𝑥 < 1
𝑓(𝑥) = {2 − 𝑥; 1 < 𝑥 < 2
0; 𝑥 > 2
Find P(0.5 < x < 1.5).
Solution:
Given PDF is:

𝑥; 0 < 𝑥 < 1
𝑓(𝑥) = {2 − 𝑥; 1 < 𝑥 < 2
0; 𝑥 > 2
1.5
𝑃(0.5 < 𝑋 < 1.5) = න 𝑓(𝑥)𝑑𝑥
0.5

Let us split the integral by taking the intervals as given below:

1 1.5
=න 𝑓(𝑥)𝑑𝑥 + න 𝑓(𝑥)𝑑𝑥
0.5 1

Integrating the functions, we get;


𝑥2 𝑥 2 1.5
= ( )10.5 +(2𝑥 − )
2 2 1

= [(1)2/2 – (0.5)2/2] + {[2(1.5) – (1.5)2/2] – [2(1) – (1)2/2]}


= [(½) – (⅛)] + {[3 – (9/8)] – [2 – (½)]}
= (⅜) + [(15/8) – (3/2)]
= (3 + 15 – 12)/8
= 6/8
= 3/4
Applications of Probability Density Function
The probability density function has many applications in different
fields of study such as statistics, science and engineering. Some of the
important applications of the probability density function are listed below:
• In Statistics, it is used to calculate the probabilities associated with the
random variables.
• The probability density function is used in modelling the annual data of
atmospheric NOx temporal concentration
• It is used to model the diesel engine combustion.
2.5 Percentiles and Moments:

Percentiles:

In statistics, percentiles are used to understand and interpret


data. The nth percentile of a set of data is the value at which n percent
of the data is below it. In everyday life, percentiles are used to
understand values such as test scores, health indicators, and other
measurements. For example, an 18-year-old male who is six and a half
feet tall is in the 99th percentile for his height. This means that of all the
18-year-old males, 99 percent have a height that is equal to or less than
six and a half feet. An 18-year-old male who is only five and a half feet
tall, on the other hand, is in the 16th percentile for his height, meaning
only 16 percent of males his age are the same height or shorter.
• Percentiles are used to understand and interpret data. They indicate
the values below which a certain percentage of the data in a data set
is found.
• Percentiles can be calculated using the formula n = (P/100) x N,
where P = percentile, N = number of values in a data set (sorted
from smallest to largest), and n = ordinal rank of a given value.
• Percentiles are frequently used to understand test scores and
biometric measurements

What Percentile Means

Percentiles should not be confused with percentages. The


latter is used to express fractions of a whole, while percentiles are the
values below which a certain percentage of the data in a data set is
found.
In practical terms, there is a significant difference between the
two. For example, a student taking a difficult exam might earn a score of 75
percent. This means that he correctly answered every three out of four
questions. A student who scores in the 75th percentile, however, has obtained
a different result. This percentile means that the student earned a higher
score than 75 percent of the other students who took the exam. In other
words, the percentage score reflects how well the student did on the exam
itself; the percentile score reflects how well he did in comparison to other
students.

Percentile Formula:
Percentiles for the values in a given data set can be calculated
using the formula:
n = (P/100) x N
where N = number of values in the data set, P = percentile, and n
= ordinal rank of a given value (with the values in the data set sorted from
smallest to largest). For example, take a class of 20 students that earned the
following scores on their most recent test: 75, 77, 78, 78, 80, 81, 81, 82, 83,
84, 84, 84, 85, 87, 87, 88, 88, 88, 89, 90. These scores can be represented as
a data set with 20 values: {75, 77, 78, 78, 80, 81, 81, 82, 83, 84, 84, 84, 85,
87, 87, 88, 88, 88, 89, 90}.
We can find the score that marks the 20th percentile by plugging
in known values into the formula and solving for n:
n = (20/100) x 20
n=4
The fourth value in the data set is the score 78. This means that
78 marks the 20th percentile; of the students in the class, 20 percent earned
a score of 78 or lower.
Deciles Common Percentiles:
Given a data set that has been ordered in increasing magnitude,
the median, first quartile, and third quartile can be used split the data into
four pieces. The first quartile is the point at which one-fourth of the data lies
below it. The median is located exactly in the middle of the data set, with half
of all the data below it. The third quartile is the place where three-fourths of
the data lies below it.
The median, first quartile, and third quartile can all be stated in
terms of percentiles. Since half of the data is less than the median, and one-
half is equal to 50 percent, the median marks the 50th percentile. One-fourth
is equal to 25 percent, so the first quartile marks the 25th percentile. The
third quartile marks the 75th percentile.
Besides quartiles, a fairly common way to arrange a set of data is
by deciles. Each decile includes 10 percent of the data set. This means that
the first decile is the 10th percentile, the second decile is the 20th percentile,
etc. Deciles provide a way to split a data set into more pieces than quartiles
without splitting the set into 100 pieces as with percentiles.
Applications of Percentiles
Percentile scores have a variety of uses. Anytime that a set of data
needs to be broken into digestible chunks, percentiles are helpful. They are
often used to interpret test scores—such as SAT scores—so that test-takers
can compare their performance to that of other students. For example, a
student might earn a score of 90 percent on an exam. That sounds pretty
impressive; however, it becomes less so when a score of 90 percent
corresponds to the 20th percentile, meaning only 20 percent of the class
earned a score of 90 percent or lower.
Moments:
Moments in mathematical statistics involve a basic
calculation. These calculations can be used to find a probability distribution's
mean, variance, and skewness.
Suppose that we have a set of data with a total
of n discrete points. One important calculation, which is actually several
numbers, is called the sth moment. The sth moment of the data set with
values x1, x2, x3, ... , xn is given by the formula:
(x1s + x2s + x3s + ... + xns)/n
Using this formula requires us to be careful with our order of
operations. We need to do the exponents first, add, then divide this sum
by n the total number of data values.
First Moment:
For the first moment, we set s = 1. The formula for the first moment is thus:
(x1x2 + x3 + ... + xn)/n
This is identical to the formula for the sample mean.
The first moment of the values 1, 3, 6, 10 is (1 + 3 + 6 + 10) / 4 = 20/4 .
Second Moment:
For the second moment we set s = 2. The formula for the second moment is:
(x12 + x22 + x32 + ... + xn2)/n
The second moment of the values 1, 3, 6, 10 is (12 + 32 + 62 + 102) / 4 = (1
+ 9 + 36 + 100)/4 = 146/4 = 36.5.
Third Moment:
For the third moment we set s = 3. The formula for the third moment is:
(x13 + x23 + x33 + ... + xn3)/n
The third moment of the values 1, 3, 6, 10 is (13 + 33 + 63 + 103) / 4 = (1 +
27 + 216 + 1000)/4 = 1244/4 = 311.
Higher moments can be calculated in a similar way. Just
replace s in the above formula with the number denoting the desired moment.
Moments About the Mean
A related idea is that of the sth moment about the mean. In this
calculation we perform the following steps:
1. First, calculate the mean of the values.
2.Next, subtract this mean from each value.
3.Then raise each of these differences to the sth power.
4.Now add the numbers from step #3 together.
Finally, divide this sum by the number of values we started with.
The formula for the sth moment about the mean m of the values
values x1, x2, x3, ..., xn is given by:
ms = ((x1 - m)s + (x2 - m)s + (x3 - m)s + ... + (xn - m)s)/n
First Moment About the Mean
The first moment about the mean is always equal to zero, no
matter what the data set is that we are working with. This can be seen in the
following:
m1 = ((x1 - m) + (x2 - m) + (x3 - m) + ... + (xn - m))/n = ((x1+ x2 + x3 + ...
+ xn) - nm)/n = m - m = 0.
Second Moment About the Mean
The second moment about the mean is obtained from the above
formula by settings = 2:
m2 = ((x1 - m)2 + (x2 - m)2 + (x3 - m)2 + ... + (xn - m)2)/n
This formula is equivalent to that for the sample variance.
For example, consider the set 1, 3, 6, 10. We have already
calculated the mean of this set to be 5. Subtract this from each of the data
values to obtain differences of:
1 – 5 = -4
3 – 5 = -2
6–5=1
10 – 5 = 5
We square each of these values and add them together: (-4)2 + (-
2)2 + 12 + 52 = 16 + 4 + 1 + 25 = 46. Finally divide this number by the
number of data points: 46/4 = 11.5
Applications of Moments
As mentioned above, the first moment is the mean and the second
moment about the mean is the sample variance. Karl Pearson introduced the
use of the third moment about the mean in calculating skewness and the
fourth moment about the mean in the calculation of kurtosis.
2.6 Covariance and Correlation

Let’s say we have two different attributes of something.


Covariance and Correlation are the tools that we have to measure if the
two attributes are related to each other or not.
Covariance:
Covariance measures how two variables vary in tandem to
their means. The formula to calculate covariance is shown below.

where x and y are the individual values of X and Y ranging


from i = 1,2, .., n where the probability that each value may occur is
equal and is equal to (1/n). E(x) and E(y)are the means of X and Y.
Correlation:

Correlation also measures how two variables move with respect


to each other. A perfect positive correlation means that the correlation
coefficient is 1. A perfect negative correlation means that the correlation
coefficient is -1. A correlation coefficient of 0 means that the two variables are
independent of each other. The formula for finding the correlation coefficient
can be found using the following formula.

Both correlation and covariance only measure the linear


relationship between data. They will fail to discover any nth order relationship
between the two. Correlation is a special case of covariance when the data is
standardized. If we are interested in only knowing if there is a relationship
then correlation is a better measure as they also measure the extent of the
relationship.

2.7 Conditional Probability:


Conditional Probability is the study of the probability of two
things happening together. The way to do this is by applying Bayes’
theorem which provides a simple way for calculating conditional
probabilities.
Speaking mathematically, the probability of the model given
the data is probability of the data given the model times the ratio of the
independent probability of the model and the independent probability of
the data.

Bayes’ theorem is simple but has profound implications. The


degree of belief in a machine learning model can also be thought of as
probabilities and machine learning can be thought of as learning models
of data. Thus, we can consider multiple models, find out the probabilities
they have given the data and then consider the model which has the
higher probability.
2.8 BAYES’ THEOREM

The conditional probability of event C occurring, given that


event A has already occurred, is denoted as P(C|A) , which can be found
using the formula

Below formula obtained with some minor algebra and substitution of


the conditional probability:

where C is the class label C ∈{ c1,c2 ,… cn} and A is the


observed attributes A ={a1, a2… am}. Equation 7-7 is the most common
form of the Bayes‘ theorem.

Mathematically, Bayes‘ theorem gives the relationship between


the probabilities of C and A, P(C) and P(A), and the conditional
probabilities of C given A and A given C, namely P(C|A) and P(A|C).
Bayes‘ theorem is significant because quite often P(C | A) is much
more difficult to compute than P(A|C) and P(C) from the training data.
llustration of Bayes’ theorem with an example.

Example I

John flies frequently and likes to upgrade his seat to first class. He has
determined that if he checks in for his flight at least two hours early, the
probability that he will get an upgrade is 0.75; otherwise, the
probability that he will get an upgrade is
0.35. With his busy schedule, he checks in at least two hours before
his flight only
40% of the time. Suppose John did not receive an upgrade on his most
recent attempt. What is the probability that he did not arrive two hours
early?
Let C = {John arrived at least two hours early}, and

A = {John received an upgrade}, then

¬C = {John did not arrive two hours early}, and

¬A = {John did not receive an upgrade}.


John checked in at least two hours early only 40% of the time, or P(C) =
0.4.

Therefore, P(¬C)= 1−P(C) = 0.6.

The probability that John received an upgrade given that he checked in


early is 0.75, or P(A|C)=0.75.

The probability that John received an upgrade given that he did not
arrive two hours early is 0.35, or P(A|¬C)=0.35.

Therefore, P(¬A|¬C)=0.65.

The probability that John received an upgrade P(A) can be computed


below.
Thus, the probability that John did not receive an upgrade P(¬A)=0.49. Using
Bayes‘ theorem, the probability that John did not arrive two hours early
given that he did not receive his upgrade is shown in below.

Example II
Another example involves computing the probability that a patient carries a
disease based on the result of a lab test. Assume that a patient named Mary
took a lab test for a certain disease and the result came back positive. The
test returns a positive result in 95% of the cases in which the disease is
actually present, and it returns a positive result in 6% of the cases in which
the disease is not present. Furthermore, 1% of the entire population has
this disease. What is the probability that Mary actually has the disease,
given that the test is positive?

Let C = {having the disease} and

A = {testing positive}.

The goal is to solve the probability of having the disease, given that Mary has
a positive test result, P(C|A).
From the problem description, P(C)=0.01, P(¬C)=0.99, P(A|C)=0.95
and P(A|¬C)=0.06.

Bayes‘ theorem defines P(C | A)=P(A|C)P(C)/ P(A). The probability of


testing positive, that is P(A), needs to be computed first. That computation is
shown in below.

According to Bayes‘ theorem, the probability of having the disease, given that
Mary has a positive test result, is shown below.

That means that the probability of Mary actually having the disease
given a positive test result is only 13.79%. This result indicates that the lab test
may not be a good one. The likelihood of having the disease was 1% when the
patient walked in the door and only 13.79% when the patient walked out, which
would suggest further tests.

.
2.9 Univariate, Bivariate and Multivariate statistical analysis.

First we must understand the types of variables:


1. Categorical variables — variables that have a finite number of categories
or distinct groups. Examples: gender, method of payment, horoscope, etc.
2. Numerical variables — variables that consist of numbers. There are two
main numerical variables.
• Discrete variables — variables that can be counted within a
finite time. Examples: the change in your pocket, number of
students in a class, numerical grades, etc.
• Continuous variables — variables that are infinite in number
often measured on a scale of sort. Examples: weight, height,
temperature, date and time of a payment, etc.
Univariate Analysis:

This type of data consists of only one variable. The analysis of


univariate data is thus the simplest form of analysis since the information deals
with only one quantity that changes. It does not deal with causes or relationships
and the main purpose of the analysis is to describe the data and find patterns

that exist within it. The example of a univariate data can be height.
Suppose that the heights of seven students of a class is recorded
there is only one variable that is height and it is not dealing with any cause or
relationship. The description of patterns found in this type of data can be made
by drawing conclusions using central tendency measures (mean, median and
mode), dispersion or spread of data (range, minimum, maximum, quartiles,
variance and standard deviation) and by using frequency distribution tables,
histograms, pie charts, frequency polygon and bar charts.
2. Bivariate Analysis:
This type of data involves two different variables. The analysis of
this type of data deals with causes and relationships and the analysis is done to
find out the relationship among the two variables.Example of bivariate data can
be temperature and ice cream sales in summer season.

Suppose the temperature and ice cream sales are the two variables
of a bivariate data. Here, the relationship is visible from the table that
temperature and sales are directly proportional to each other and thus related
because as the temperature increases, the sales also increase. Thus bivariate
data analysis involves comparisons, relationships, causes and explanations.
These variables are often plotted on X and Y axis on the graph for better
understanding of data and one of these variables is independent while the other
is dependent.
3. Multivariate Analysis:
When the data involves three or more variables, it is categorized
under multivariate. Example of this type of data is suppose an advertiser wants
to compare the popularity of four advertisements on a website, then their click
rates could be measured for both men and women and relationships between
variables can then be examined.
It is similar to bivariate but contains more than one dependent
variable. The ways to perform analysis on this data depends on the goals to be
achieved. Some of the techniques are regression analysis, path analysis, factor
analysis and multivariate analysis of variance (MANOVA).
2.10 Dimensionality Reduction:
The higher the number of features, the harder it gets to visualize the
training set and then work on it. Sometimes, most of these features are
correlated, and hence redundant. This is where dimensionality reduction
algorithms come into play.
Dimensionality reduction is the process of reducing the number of
random variables under consideration, by obtaining a set of principal variables.
It can be divided into feature selection and feature extraction.
There are two components of dimensionality reduction:
Feature selection: In this, we try to find a subset of the original set of
variables, or features, to get a smaller subset which can be used to model the
problem. It usually involves three ways:
Filter
Wrapper
Embedded
Feature extraction: This reduces the data in a high dimensional space to a
lower dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear,
depending upon the method used. The prime linear method, called Principal
Component Analysis, or PCA.
Principal Component Analysis
Principal Component Analysis, or PCA, is a dimensionality-reduction
method that is often used to reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that still contains most
of the information in the large set.
Reducing the number of variables of a data set naturally comes at
the expense of accuracy, but the trick in dimensionality reduction is to trade a
little accuracy for simplicity. Because smaller data sets are easier to explore and
visualize and make analyzing data much easier and faster for machine learning
algorithms without extraneous variables to process.
So to sum up, the idea of PCA is simple — reduce the number of variables of a
data set, while preserving as much information as possible
Step by Step Explanation of PCA
STEP 1: STANDARDIZATION
The aim of this step is to standardize the range of the continuous
initial variables so that each one of them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to
PCA, is that the latter is quite sensitive regarding the variances of the initial
variables. That is, if there are large differences between the ranges of initial
variables, those variables with larger ranges will dominate over those with small
ranges (For example, a variable that ranges between 0 and 100 will dominate
over a variable that ranges between 0 and 1), which will lead to biased results.
So, transforming the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and
dividing by the standard deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to the
same scale.
STEP 2: COVARIANCE MATRIX COMPUTATION
The aim of this step is to understand how the variables of the input
data set are varying from the mean with respect to each other, or in other
words, to see if there is any relationship between them. Because sometimes,
variables are highly correlated in such a way that they contain redundant
information. So, in order to identify these correlations, we compute the
covariance matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of
dimensions) that has as entries the covariances associated with all possible
pairs of the initial variables. For example, for a 3-dimensional data set with 3
variables x, y, and z, the covariance matrix is a 3×3 matrix of this from:

Covariance Matrix for 3-Dimensional Data


Since the covariance of a variable with itself is its variance
(Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we actually
have the variances of each initial variable. And since the covariance is
commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are
symmetric with respect to the main diagonal, which means that the upper and
the lower triangular portions are equal.
It’s actually the sign of the covariance that matters
if positive then : the two variables increase or decrease together (correlated)
if negative then : One increases when the other decreases (Inversely correlated)
Now that we know that the covariance matrix is not more than a
table that summarizes the correlations between all the possible pairs of
variables, let’s move to the next step.
STEP 3: COMPUTE THE EIGENVECTORS AND EIGENVALUES OF THE
COVARIANCE MATRIX TO IDENTIFY THE PRINCIPAL COMPONENTS
Eigenvectors and eigenvalues are the linear algebra concepts that
we need to compute from the covariance matrix in order to determine
the principal components of the data. Before getting to the explanation of these
concepts, let’s first understand what do we mean by principal components.
Principal components are new variables that are constructed as
linear combinations or mixtures of the initial variables. These combinations are
done in such a way that the new variables (i.e., principal components) are
uncorrelated and most of the information within the initial variables is squeezed
or compressed into the first components. So, the idea is 10-dimensional data
gives you 10 principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining information in the
second and so on, until having something like shown in the scree plot below.

Percentage of Variance (Information) for each by PC

Organizing information in principal components this way, will allow


you to reduce dimensionality without losing much information, and this by
discarding the components with low information and considering the remaining
components as your new variables.
An important thing to realize here is that the principal components
are less interpretable and don’t have any real meaning since they are
constructed as linear combinations of the initial variables.
Geometrically speaking, principal components represent the
directions of the data that explain a maximal amount of variance, that is to
say, the lines that capture most information of the data. The relationship
between variance and information here, is that, the larger the variance carried
by a line, the larger the dispersion of the data points along it, and the larger the
dispersion along a line, the more the information it has. To put all this simply,
just think of principal components as new axes that provide the best angle to
see and evaluate the data, so that the differences between the observations are
better visible.
STEP 4: FEATURE VECTOR
As we saw in the previous step, computing the eigenvectors and
ordering them by their eigenvalues in descending order, allow us to find the
principal components in order of significance. In this step, what we do is, to
choose whether to keep all these components or discard those of lesser
significance (of low eigenvalues), and form with the remaining ones a matrix of
vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the
eigenvectors of the components that we decide to keep. This makes it the first
step towards dimensionality reduction, because if we choose to keep
only p eigenvectors (components) out of n, the final data set will have
only p dimensions.
LAST STEP: RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS
AXES
In the previous steps, apart from standardization, you do not make
any changes on the data, you just select the principal components and form the
feature vector, but the input data set remains always in terms of the original
axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature
vector formed using the eigenvectors of the covariance matrix, to reorient the
data from the original axes to the ones represented by the principal components
(hence the name Principal Components Analysis). This can be done by
multiplying the transpose of the original data set by the transpose of the feature
vector.

2.11 Linear Discriminant Analysis (LDA)


LDA is a type of Linear combination, a mathematical process using
various data items and applying a function to that site to separately analyze
multiple classes of objects or items.
Linear discriminant analysis can be useful in areas like image
recognition and predictive analysis in marketing.
The fundamental idea of linear combinations goes back as far as the
1960s with the Altman Z-scores for bankruptcy and other predictive constructs.
Now LDA helps in preventative data for more than two classes, when Logistics
Regression is not sufficient. The linear Discriminant analysis takes the mean
value for each class and considers variants to make predictions assuming a
Gaussian distribution.
How the Linear Discriminant Analysis (LDA) work?
First general steps for performing a Linear Discriminant Analysis
1. Compute the d-dimensional mean vector for the different classes from the
dataset.
2. Compute the Scatter matrix (in between class and within the class scatter
matrix)
3. Sort the Eigen Vector by decrease Eigen Value and choose k eigenvector with
the largest eigenvalue to from a d x k dimensional matrix w (where every
column represent an eigenvector)
4. Used d * k eigenvector matrix to transform the sample onto the new
subspace.
This can be summarized by the matrix multiplication.
Y = X x W (where X is a n * d dimension matrix representing the n samples
and you are transformed n * k dimensional samples in the new subspace.
Video Links
Standard deviation:

https://www.youtube.com/watch?v=MRqtXL2WX2M
Assignments
Q. Question CO K Level
No. Level
1 A radar system is designed such that the CO2 K4
probability of detecting the presence of an
aircraft in its range is 98%. However if no aircraft
is present in its range it still report (falsely) that
an aircraft is present with a probability of 5%. At
any time, the probability that an aircraft is
present within the range of the radar is 7%.
a) What is the probability that no aircraft is
present in the range of the radar given that an
aircraft is detected?
b) What is the probability that an aircraft is
present in the range of the radar given that an
aircraft is detected
c) What is the probability that an aircraft is
present in the range of the radar given that no
aircraft is detected?
d) What is the probability that no aircraft is
present in the range of the radar given that no
aircraft is detected?
Part-A Questions and Answers
1. What Is Mean-Variance and Standard Deviation in Statistics?
Variance is the sum of squares of differences between all numbers
and means...where μ is Mean, N is the total number of elements or frequency of
distribution. Standard Deviation is the square root of variance. It is a measure of
the extent to which data varies from the mean.
2. Why Do We Use Standard Deviation and Variance?
Standard deviation looks at how spread out a group of numbers is
from the mean, by looking at the square root of the variance. The variance
measures the average degree to which each point differs from the mean—the
average of all data points.
3. Why Is Variance Important?
Variance is important for two main reasons: For use of Parametric
statistical tests, as they are sensitive to variance. The variances of the samples
to assess whether the populations they come from differ from each other.
4. What Is Mean-Variance and Standard Deviation in Statistics?
Variance is the sum of squares of differences between all numbers
and means...where μ is Mean, N is the total number of elements or frequency of
distribution. Standard Deviation is the square root of variance. It is a measure of
the extent to which data varies from the mean.
5. What is Dimensionality Reduction?
Dimensionality Reduction refers to reducing dimensions or features
so that we can get a more interpretable model, and improves the performance
of the model.
6.What is Principal Component Analysis?
• Principal Component Analysis is a well-known dimension reduction technique.
• It transforms the variables into a new set of variables called as principal
components.
• These principal components are linear combination of original variables and
are orthogonal.
• The first principal component accounts for most of the possible variation of
original data.
• The second principal component does its best to capture the variance in the
data.
• There can be only two principal components for a two-dimensional data set.
7.Write down the steps involved in PCA Algorithm?
The steps involved in PCA Algorithm are as follows-
Step-01: Get data.
Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.
8.Benefits-0f Dimension reduction?
Dimension reduction offers several benefits such as-
• It compresses the data and thus reduces the storage space requirements.
• It reduces the time required for computation since less dimensions require
less computation.
• It eliminates the redundant features.
• It improves the model performance.
9. What is Linear Discriminant Analysis (LDA)?
LDA is a type of Linear combination, a mathematical process
using various data items and applying a function to that site to
separately analyze multiple classes of objects or items.
10. List Commonly used multivariate analysis technique ?
• Factor Analysis
• Cluster Analysis
• Variance Analysis
• Discriminant Analysis
• Multidimensional Scaling
• Principal Component Analysis
• Redundancy Analysis
11.How Univariate analysis is conducted through several ways ?
•Frequency Distribution Tables
•Histograms
•Frequency Polygons
•Pie Charts
•Bar Charts
12.What is a Conditional Probability?
Sometimes an event or an outcome occurs based on previous
occurrences of events or outcomes, this is known as conditional
probability. We can calculate the conditional probability if we multiply the
probability of the preceding event by the updated probability of the
succeeding, or conditional, event.
13. What are independent events?
Independent events are the ones that take place without getting
influenced by the probability of other events. These are the events whose
probability of happening totally depends upon themselves. Taking the example
of tossing two coins, if a person tosses both the coins at the same time, the
probability of getting heads or tails on one coin does not depend upon the
other. This phenomenon under Bayes' Theorem is regarded as an independent
event.
14. What is Percentile Formula?
P = (n/N) × 100
Where,
n = ordinal rank of the given value or value below the number
N = number of values in the data set
P = percentile
Or
Percentile = (Number of Values Below “x” / Total Number of Values) × 100
15. The scores obtained by 10 students are 38, 47, 49, 58, 60, 65, 70,
79, 80, 92. Using the percentile formula, calculate the percentile for
score 70?
Given:
Scores obtained by students are 38, 47, 49, 58, 60, 65, 70, 79, 80, 92
Number of scores below 70 = 6
Using percentile formula,
Percentile = (Number of Values Below “x” / Total Number of Values) × 100
Percentile of 70
= (6/10) × 100
= 0.6 × 100 = 60
Therefore, the percentile for score 70 = 60
16. What are the 4 moments in statistics?
The shape of any distribution can be described by its various
‘moments’. The first four are:
1) The mean, which indicates the central tendency of a distribution.
2) The second moment is the variance, which indicates the width or deviation.
3) The third moment is the skewness, which indicates any asymmetric ‘leaning’
to either left or right.
4) The fourth moment is the Kurtosis, which indicates the degree of central
‘peakedness’ or, equivalently, the ‘fatness’ of the outer tails.

17. List of Probability and Statistics Symbols

Symbol Symbol Name Meaning / definition

covariance of random
cov(X,Y) covariance
variables X and Y

correlation of random
ρX,Y correlation
variables X and Y

value that occurs most


Mo mode
frequently in population

half the population is


Md sample median
below this value

18. What are the two types of covariance?


Types of Covariance:
• Positive Covariance.
• Negative Covariance.

.
Part-B Questions
Q. Questions CO K Level
No. Level
1 Practical Implementation of Principle CO2 K2
Component Analysis(PCA).
2 Let X be a continuous random variable with the CO2 K2
PDF given by:
𝑥; 0 < 𝑥 < 1
𝑓(𝑥) = {2 − 𝑥; 1 < 𝑥 < 2
0; 𝑥 > 2
Find P(0.5 < x < 1.5).
3 Explain the Univariate, Bivariate and Multivariate CO2 K2
statistical analysis.
4 Explain Percentiles and Moments CO2 K2

5 PROBLEMS BASED ON PRINCIPAL COMPONENT CO2 K2


ANALYSIS-
Given data = { 2, 3, 4, 5, 6, 7 ; 1, 5, 3, 6, 7, 8 }.
Compute the principal component using PCA
Algorithm.
6 Practical Implementation of Linear Discriminant CO2 K2
Analysis (LDA).
7 Explain in detail the Bayes Theorem. CO2 K2

8 Consider the two dimensional patterns (2, 1), (3, CO2 K3


5), (4, 3), (5, 6), (6, 7), (7, 8).
Compute the principal component using PCA
Algorithm.
Supportive Online Courses

Sl. Courses Platform


No.
1 Big Data Computing Swayam

2 Python for Data Science Swayam


Real Time Applications in Day
to Day life and to Industry

Sl. No. Questions

1
Explain in detail the case study of medical Diagnosis using Bayesian
Theorem.
Mini Project Suggestions
Sl. Questions Platform
No.
1 Implementation of Bayes theorem with the real time R
application. Programming
(Or)
Python
Text & Reference Books

Sl. Book Name & Author Book


No.
1 EMC Education Services, "Data Science and Big Data Analytics: Text Book
Discovering, Analyzing, Visualizing and Presenting Data", Wiley
publishers, 2015.
2 Anand Rajaraman and Jeffrey David Ullman, "Mining of Massive Text Book
Datasets", Cambridge University Press, 2012.

3 An Introduction to Statistical Learning: with Applications in R Text Book


(Springer Texts in Statistics) Hardcover – 2017

4 Dietmar Jannach and Markus Zanker, "Recommender Systems: Reference


An Introduction", Cambridge University Press, 2010. Book

5 Kim H. Pries and Robert Dunnigan, "Big Data Analytics: A Reference


Practical Guide for Managers " CRC Press, 2015. Book

6 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Thank you

Disclaimer:

This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contents of this information is strictly prohibited.

You might also like