20IT503 - Big Data Analytics - Unit2
20IT503 - Big Data Analytics - Unit2
20IT503 - Big Data Analytics - Unit2
This document is confidential and intended solely for the educational purpose of
RMK Group of Educational Institutions. If you have received this document
through email in error, please notify the system manager. This document
contains proprietary information and is intended only to the respective group /
learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender
immediately by e-mail if you have received this document by mistake and delete
this document from your system. If you are not the intended recipient you are
notified that disclosing, copying, distributing or taking any action in reliance on
the contents of this information is strictly prohibited.
20IT503
S NO CONTENTS PAGE NO
1 Table of Contents 5
2 Course Objectives 6
10 Reference Videos 50
11 Assignments 51
12 Part A (Q & A) 52
13 Part B Qs 57
2 3 3 3 3 1 1 - 1 2 1 1 2 2 2
CO1
2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO2
2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO3
2 3 2 3 3 1 1 - 1 2 1 1 2 2 2
CO4
2 3 2 3 3 1 1 - 1 2 1 1 1 1 1
CO5
Lecture Plan
UNIT II DESCRIPTIVE ANALYTICS USING STATISTICS 9
Mean, Median and Mode – Standard Deviation and Variance – Probability – Probability
Density Function – Percentiles and Moments – Correlation and Covariance – Conditional
Probability – Bayes’ Theorem – Introduction to Univariate, Bivariate and Multivariate
Analysis – Dimensionality Reduction using Principal Component Analysis (PCA) and
LDA.
Session Mode of Referenc
Topics to be covered
No. delivery e
Mean, Median and Mode Chalk&board
1 2
Standard Deviation and Variance Chalk&board 2
2
CORRECTIVE MEASURES :
S NO TOPICS
13
2. Descriptive Analytics Using statistics
Types of Data:
The data available for analysis always have two main categories,
Quantitative and Qualitative.
Qualitative data : Non-numerical values such as color, yes or no, etc. There
are two types of Quantitative data.
1. Discrete:
The Discrete data has values that can not be broken down into
fractions such as numbers on dice, number of students in a class, etc.
2. Continuous:
The data can be further classified into Nominal, Ordinal, Interval, and
Ratio data.
Nominal data: It is categorical data such as gender, religion, etc.
Ordinal data: is a measure of rank or order such as exam grades, position in the
company, etc.
Data tend to accumulate around the average value of the total data
under consideration. Measures of central tendency will help us to find the middle,
or the average, of a data set. If most of the data is centrally located and there is
a very small spread it will form an asymmetric bell curve. In such conditions
values of mean, median and mode are equal.
Median: It is the centrally located value of the data set sorted in ascending
order. Consider 11 (ODD) values 1,2,3,7,8,3,2,5,4,15,16. We first sort the values
in ascending order 1,2,2,3,3,4,5,7,8,15,16 then the median is 4 which is located
at the 6th number and will have 5 numbers on either side.
If the data set is having an even number of values then the median
can be found by taking the average of the two middle values. Consider 10
(EVEN) values 1,2,3,7,8,3,2,5,4,15. We first sort the values in ascending order
1,2,2,3,3,4,5,7,8,15 then the median is (3+4)/2=3.5 which is the average of the
two middle values i.e. the values which are located at the 5th and 6th number in
the sequence and will have 4 numbers on either side.
Mode: It is the most frequent value in the data set. We can easily get the
mode by counting the frequency of occurrence. Consider a data set with the
values 1,5,5,6,8,2,6,6. In this data set, we can observe the following,
We can say that the mean is being dragged in the direction of the
skew. In this skewed distribution, mode < median < mean. The more skewed
the distribution, the greater the difference between the median and mean, here
we consider median for the conclusion. The best example of the right-skewed
distribution is salaries of employees, where higher-earners provide a false
representation of the typical income if expressed as mean salaries and not the
median salaries.
For left-skewed distribution mean < median < mode. In such a case
also, we emphasize the median value of the distribution.
2.2 Standard Deviation and Variance:
Variance:
To understand what it means, let’s first take a look at the data set
where we have a list of 10 salaries, as below:
Notice that most of the values are concentrated around 15,000 and
35,000, but there is an extreme value (an outlier) of 200,000 that pushes up the
mean to 40,500 and dilates the range to 185,000.
Now, going back to the concept introduced earlier, let’s calculate the
variance. We are going to add together the square of the differences of each
point from the mean, and then divide it by the number of values in the set.
Where x represents each term in the set, μ is the mean, and n is the number of
terms in the set.
There is also a quicker way to find the variance. Please check the equation
below.
so now we have the value of the variance, but you might have
noticed that it’s extremely large! That is because the variance is measured using
the values squared.
Although the variance is widely used, there is another statistical
concept that is more intuitive when we are measuring the variability of data
around the mean. That is the Standard Deviation.
Standard Deviation
After identifying the variance, finding the standard deviation is pretty
straightforward. It’s the square root of the variance.
Remember that the symbol of the variance is σ²? The standard deviation is
represented by σ.
Notice that 53,500 feels more connected to the values in our list, but
what does it mean? It means that the salaries in our list are, on average,
$53,500 away from the mean.
The closer the values are from the mean, the smaller the standard
deviation. In our case, the value of the standard deviation was stretched because
we have an outlier.
Now, just to see how the standard deviation changes, let’s eliminate
the outlier. Our salaries’ list now remains with 9 values:
𝐛
𝐏 𝐚<𝐗<𝐛 =න 𝐟 𝐱 𝐝𝐱
𝐚
Or
𝐛
𝐏(𝐚 ≤ 𝐗 ≤ 𝐛) = න 𝐟(𝐱)𝐝𝐱
𝐚
This is because, when X is continuous, we can ignore the
endpoints of intervals while finding probabilities of continuous random
variables. That means, for any constants a and b,
P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b).
Probability Density Function Graph
The probability density function is defined as an integral of the density of the
variable density over a given range. It is denoted by f (x). This function is
positive or non-negative at any point of the graph, and the integral, more
specifically the definite integral of PDF over the entire space is always equal to
one. The graph of PDFs typically resembles a bell curve, with the probability
of the outcomes below the curve. The below figure depicts the graph of a
probability density function for a continuous random variable x with function
f(x).
Probability Density Function Properties
Let x be the continuous random variable with density function f(x),
and the probability density function should satisfy the following conditions:
For a continuous random variable that takes some value between certain limits,
say a and b, the PDF is calculated by finding the area under its curve and the
X-axis within the lower limit (a) and upper limit (b). Thus, the PDF is given by
𝑏
𝑃(𝑥) = න 𝑓(𝑥)𝑑𝑥
𝑎
• The probability density function is non-negative for all the possible values,
i.e. f(x)≥ 0, for all x.
• The area between the density curve and horizontal X-axis is equal to 1, i.e.
∞
න 𝑓(𝑥)𝑑𝑥 = 1
−∞
𝑥; 0 < 𝑥 < 1
𝑓(𝑥) = {2 − 𝑥; 1 < 𝑥 < 2
0; 𝑥 > 2
Find P(0.5 < x < 1.5).
Solution:
Given PDF is:
𝑥; 0 < 𝑥 < 1
𝑓(𝑥) = {2 − 𝑥; 1 < 𝑥 < 2
0; 𝑥 > 2
1.5
𝑃(0.5 < 𝑋 < 1.5) = න 𝑓(𝑥)𝑑𝑥
0.5
1 1.5
=න 𝑓(𝑥)𝑑𝑥 + න 𝑓(𝑥)𝑑𝑥
0.5 1
Percentiles:
Percentile Formula:
Percentiles for the values in a given data set can be calculated
using the formula:
n = (P/100) x N
where N = number of values in the data set, P = percentile, and n
= ordinal rank of a given value (with the values in the data set sorted from
smallest to largest). For example, take a class of 20 students that earned the
following scores on their most recent test: 75, 77, 78, 78, 80, 81, 81, 82, 83,
84, 84, 84, 85, 87, 87, 88, 88, 88, 89, 90. These scores can be represented as
a data set with 20 values: {75, 77, 78, 78, 80, 81, 81, 82, 83, 84, 84, 84, 85,
87, 87, 88, 88, 88, 89, 90}.
We can find the score that marks the 20th percentile by plugging
in known values into the formula and solving for n:
n = (20/100) x 20
n=4
The fourth value in the data set is the score 78. This means that
78 marks the 20th percentile; of the students in the class, 20 percent earned
a score of 78 or lower.
Deciles Common Percentiles:
Given a data set that has been ordered in increasing magnitude,
the median, first quartile, and third quartile can be used split the data into
four pieces. The first quartile is the point at which one-fourth of the data lies
below it. The median is located exactly in the middle of the data set, with half
of all the data below it. The third quartile is the place where three-fourths of
the data lies below it.
The median, first quartile, and third quartile can all be stated in
terms of percentiles. Since half of the data is less than the median, and one-
half is equal to 50 percent, the median marks the 50th percentile. One-fourth
is equal to 25 percent, so the first quartile marks the 25th percentile. The
third quartile marks the 75th percentile.
Besides quartiles, a fairly common way to arrange a set of data is
by deciles. Each decile includes 10 percent of the data set. This means that
the first decile is the 10th percentile, the second decile is the 20th percentile,
etc. Deciles provide a way to split a data set into more pieces than quartiles
without splitting the set into 100 pieces as with percentiles.
Applications of Percentiles
Percentile scores have a variety of uses. Anytime that a set of data
needs to be broken into digestible chunks, percentiles are helpful. They are
often used to interpret test scores—such as SAT scores—so that test-takers
can compare their performance to that of other students. For example, a
student might earn a score of 90 percent on an exam. That sounds pretty
impressive; however, it becomes less so when a score of 90 percent
corresponds to the 20th percentile, meaning only 20 percent of the class
earned a score of 90 percent or lower.
Moments:
Moments in mathematical statistics involve a basic
calculation. These calculations can be used to find a probability distribution's
mean, variance, and skewness.
Suppose that we have a set of data with a total
of n discrete points. One important calculation, which is actually several
numbers, is called the sth moment. The sth moment of the data set with
values x1, x2, x3, ... , xn is given by the formula:
(x1s + x2s + x3s + ... + xns)/n
Using this formula requires us to be careful with our order of
operations. We need to do the exponents first, add, then divide this sum
by n the total number of data values.
First Moment:
For the first moment, we set s = 1. The formula for the first moment is thus:
(x1x2 + x3 + ... + xn)/n
This is identical to the formula for the sample mean.
The first moment of the values 1, 3, 6, 10 is (1 + 3 + 6 + 10) / 4 = 20/4 .
Second Moment:
For the second moment we set s = 2. The formula for the second moment is:
(x12 + x22 + x32 + ... + xn2)/n
The second moment of the values 1, 3, 6, 10 is (12 + 32 + 62 + 102) / 4 = (1
+ 9 + 36 + 100)/4 = 146/4 = 36.5.
Third Moment:
For the third moment we set s = 3. The formula for the third moment is:
(x13 + x23 + x33 + ... + xn3)/n
The third moment of the values 1, 3, 6, 10 is (13 + 33 + 63 + 103) / 4 = (1 +
27 + 216 + 1000)/4 = 1244/4 = 311.
Higher moments can be calculated in a similar way. Just
replace s in the above formula with the number denoting the desired moment.
Moments About the Mean
A related idea is that of the sth moment about the mean. In this
calculation we perform the following steps:
1. First, calculate the mean of the values.
2.Next, subtract this mean from each value.
3.Then raise each of these differences to the sth power.
4.Now add the numbers from step #3 together.
Finally, divide this sum by the number of values we started with.
The formula for the sth moment about the mean m of the values
values x1, x2, x3, ..., xn is given by:
ms = ((x1 - m)s + (x2 - m)s + (x3 - m)s + ... + (xn - m)s)/n
First Moment About the Mean
The first moment about the mean is always equal to zero, no
matter what the data set is that we are working with. This can be seen in the
following:
m1 = ((x1 - m) + (x2 - m) + (x3 - m) + ... + (xn - m))/n = ((x1+ x2 + x3 + ...
+ xn) - nm)/n = m - m = 0.
Second Moment About the Mean
The second moment about the mean is obtained from the above
formula by settings = 2:
m2 = ((x1 - m)2 + (x2 - m)2 + (x3 - m)2 + ... + (xn - m)2)/n
This formula is equivalent to that for the sample variance.
For example, consider the set 1, 3, 6, 10. We have already
calculated the mean of this set to be 5. Subtract this from each of the data
values to obtain differences of:
1 – 5 = -4
3 – 5 = -2
6–5=1
10 – 5 = 5
We square each of these values and add them together: (-4)2 + (-
2)2 + 12 + 52 = 16 + 4 + 1 + 25 = 46. Finally divide this number by the
number of data points: 46/4 = 11.5
Applications of Moments
As mentioned above, the first moment is the mean and the second
moment about the mean is the sample variance. Karl Pearson introduced the
use of the third moment about the mean in calculating skewness and the
fourth moment about the mean in the calculation of kurtosis.
2.6 Covariance and Correlation
Example I
John flies frequently and likes to upgrade his seat to first class. He has
determined that if he checks in for his flight at least two hours early, the
probability that he will get an upgrade is 0.75; otherwise, the
probability that he will get an upgrade is
0.35. With his busy schedule, he checks in at least two hours before
his flight only
40% of the time. Suppose John did not receive an upgrade on his most
recent attempt. What is the probability that he did not arrive two hours
early?
Let C = {John arrived at least two hours early}, and
The probability that John received an upgrade given that he did not
arrive two hours early is 0.35, or P(A|¬C)=0.35.
Therefore, P(¬A|¬C)=0.65.
Example II
Another example involves computing the probability that a patient carries a
disease based on the result of a lab test. Assume that a patient named Mary
took a lab test for a certain disease and the result came back positive. The
test returns a positive result in 95% of the cases in which the disease is
actually present, and it returns a positive result in 6% of the cases in which
the disease is not present. Furthermore, 1% of the entire population has
this disease. What is the probability that Mary actually has the disease,
given that the test is positive?
A = {testing positive}.
The goal is to solve the probability of having the disease, given that Mary has
a positive test result, P(C|A).
From the problem description, P(C)=0.01, P(¬C)=0.99, P(A|C)=0.95
and P(A|¬C)=0.06.
According to Bayes‘ theorem, the probability of having the disease, given that
Mary has a positive test result, is shown below.
That means that the probability of Mary actually having the disease
given a positive test result is only 13.79%. This result indicates that the lab test
may not be a good one. The likelihood of having the disease was 1% when the
patient walked in the door and only 13.79% when the patient walked out, which
would suggest further tests.
.
2.9 Univariate, Bivariate and Multivariate statistical analysis.
that exist within it. The example of a univariate data can be height.
Suppose that the heights of seven students of a class is recorded
there is only one variable that is height and it is not dealing with any cause or
relationship. The description of patterns found in this type of data can be made
by drawing conclusions using central tendency measures (mean, median and
mode), dispersion or spread of data (range, minimum, maximum, quartiles,
variance and standard deviation) and by using frequency distribution tables,
histograms, pie charts, frequency polygon and bar charts.
2. Bivariate Analysis:
This type of data involves two different variables. The analysis of
this type of data deals with causes and relationships and the analysis is done to
find out the relationship among the two variables.Example of bivariate data can
be temperature and ice cream sales in summer season.
Suppose the temperature and ice cream sales are the two variables
of a bivariate data. Here, the relationship is visible from the table that
temperature and sales are directly proportional to each other and thus related
because as the temperature increases, the sales also increase. Thus bivariate
data analysis involves comparisons, relationships, causes and explanations.
These variables are often plotted on X and Y axis on the graph for better
understanding of data and one of these variables is independent while the other
is dependent.
3. Multivariate Analysis:
When the data involves three or more variables, it is categorized
under multivariate. Example of this type of data is suppose an advertiser wants
to compare the popularity of four advertisements on a website, then their click
rates could be measured for both men and women and relationships between
variables can then be examined.
It is similar to bivariate but contains more than one dependent
variable. The ways to perform analysis on this data depends on the goals to be
achieved. Some of the techniques are regression analysis, path analysis, factor
analysis and multivariate analysis of variance (MANOVA).
2.10 Dimensionality Reduction:
The higher the number of features, the harder it gets to visualize the
training set and then work on it. Sometimes, most of these features are
correlated, and hence redundant. This is where dimensionality reduction
algorithms come into play.
Dimensionality reduction is the process of reducing the number of
random variables under consideration, by obtaining a set of principal variables.
It can be divided into feature selection and feature extraction.
There are two components of dimensionality reduction:
Feature selection: In this, we try to find a subset of the original set of
variables, or features, to get a smaller subset which can be used to model the
problem. It usually involves three ways:
Filter
Wrapper
Embedded
Feature extraction: This reduces the data in a high dimensional space to a
lower dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear,
depending upon the method used. The prime linear method, called Principal
Component Analysis, or PCA.
Principal Component Analysis
Principal Component Analysis, or PCA, is a dimensionality-reduction
method that is often used to reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that still contains most
of the information in the large set.
Reducing the number of variables of a data set naturally comes at
the expense of accuracy, but the trick in dimensionality reduction is to trade a
little accuracy for simplicity. Because smaller data sets are easier to explore and
visualize and make analyzing data much easier and faster for machine learning
algorithms without extraneous variables to process.
So to sum up, the idea of PCA is simple — reduce the number of variables of a
data set, while preserving as much information as possible
Step by Step Explanation of PCA
STEP 1: STANDARDIZATION
The aim of this step is to standardize the range of the continuous
initial variables so that each one of them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to
PCA, is that the latter is quite sensitive regarding the variances of the initial
variables. That is, if there are large differences between the ranges of initial
variables, those variables with larger ranges will dominate over those with small
ranges (For example, a variable that ranges between 0 and 100 will dominate
over a variable that ranges between 0 and 1), which will lead to biased results.
So, transforming the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and
dividing by the standard deviation for each value of each variable.
Once the standardization is done, all the variables will be transformed to the
same scale.
STEP 2: COVARIANCE MATRIX COMPUTATION
The aim of this step is to understand how the variables of the input
data set are varying from the mean with respect to each other, or in other
words, to see if there is any relationship between them. Because sometimes,
variables are highly correlated in such a way that they contain redundant
information. So, in order to identify these correlations, we compute the
covariance matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of
dimensions) that has as entries the covariances associated with all possible
pairs of the initial variables. For example, for a 3-dimensional data set with 3
variables x, y, and z, the covariance matrix is a 3×3 matrix of this from:
https://www.youtube.com/watch?v=MRqtXL2WX2M
Assignments
Q. Question CO K Level
No. Level
1 A radar system is designed such that the CO2 K4
probability of detecting the presence of an
aircraft in its range is 98%. However if no aircraft
is present in its range it still report (falsely) that
an aircraft is present with a probability of 5%. At
any time, the probability that an aircraft is
present within the range of the radar is 7%.
a) What is the probability that no aircraft is
present in the range of the radar given that an
aircraft is detected?
b) What is the probability that an aircraft is
present in the range of the radar given that an
aircraft is detected
c) What is the probability that an aircraft is
present in the range of the radar given that no
aircraft is detected?
d) What is the probability that no aircraft is
present in the range of the radar given that no
aircraft is detected?
Part-A Questions and Answers
1. What Is Mean-Variance and Standard Deviation in Statistics?
Variance is the sum of squares of differences between all numbers
and means...where μ is Mean, N is the total number of elements or frequency of
distribution. Standard Deviation is the square root of variance. It is a measure of
the extent to which data varies from the mean.
2. Why Do We Use Standard Deviation and Variance?
Standard deviation looks at how spread out a group of numbers is
from the mean, by looking at the square root of the variance. The variance
measures the average degree to which each point differs from the mean—the
average of all data points.
3. Why Is Variance Important?
Variance is important for two main reasons: For use of Parametric
statistical tests, as they are sensitive to variance. The variances of the samples
to assess whether the populations they come from differ from each other.
4. What Is Mean-Variance and Standard Deviation in Statistics?
Variance is the sum of squares of differences between all numbers
and means...where μ is Mean, N is the total number of elements or frequency of
distribution. Standard Deviation is the square root of variance. It is a measure of
the extent to which data varies from the mean.
5. What is Dimensionality Reduction?
Dimensionality Reduction refers to reducing dimensions or features
so that we can get a more interpretable model, and improves the performance
of the model.
6.What is Principal Component Analysis?
• Principal Component Analysis is a well-known dimension reduction technique.
• It transforms the variables into a new set of variables called as principal
components.
• These principal components are linear combination of original variables and
are orthogonal.
• The first principal component accounts for most of the possible variation of
original data.
• The second principal component does its best to capture the variance in the
data.
• There can be only two principal components for a two-dimensional data set.
7.Write down the steps involved in PCA Algorithm?
The steps involved in PCA Algorithm are as follows-
Step-01: Get data.
Step-02: Compute the mean vector (µ).
Step-03: Subtract mean from the given data.
Step-04: Calculate the covariance matrix.
Step-05: Calculate the eigen vectors and eigen values of the covariance matrix.
Step-06: Choosing components and forming a feature vector.
Step-07: Deriving the new data set.
8.Benefits-0f Dimension reduction?
Dimension reduction offers several benefits such as-
• It compresses the data and thus reduces the storage space requirements.
• It reduces the time required for computation since less dimensions require
less computation.
• It eliminates the redundant features.
• It improves the model performance.
9. What is Linear Discriminant Analysis (LDA)?
LDA is a type of Linear combination, a mathematical process
using various data items and applying a function to that site to
separately analyze multiple classes of objects or items.
10. List Commonly used multivariate analysis technique ?
• Factor Analysis
• Cluster Analysis
• Variance Analysis
• Discriminant Analysis
• Multidimensional Scaling
• Principal Component Analysis
• Redundancy Analysis
11.How Univariate analysis is conducted through several ways ?
•Frequency Distribution Tables
•Histograms
•Frequency Polygons
•Pie Charts
•Bar Charts
12.What is a Conditional Probability?
Sometimes an event or an outcome occurs based on previous
occurrences of events or outcomes, this is known as conditional
probability. We can calculate the conditional probability if we multiply the
probability of the preceding event by the updated probability of the
succeeding, or conditional, event.
13. What are independent events?
Independent events are the ones that take place without getting
influenced by the probability of other events. These are the events whose
probability of happening totally depends upon themselves. Taking the example
of tossing two coins, if a person tosses both the coins at the same time, the
probability of getting heads or tails on one coin does not depend upon the
other. This phenomenon under Bayes' Theorem is regarded as an independent
event.
14. What is Percentile Formula?
P = (n/N) × 100
Where,
n = ordinal rank of the given value or value below the number
N = number of values in the data set
P = percentile
Or
Percentile = (Number of Values Below “x” / Total Number of Values) × 100
15. The scores obtained by 10 students are 38, 47, 49, 58, 60, 65, 70,
79, 80, 92. Using the percentile formula, calculate the percentile for
score 70?
Given:
Scores obtained by students are 38, 47, 49, 58, 60, 65, 70, 79, 80, 92
Number of scores below 70 = 6
Using percentile formula,
Percentile = (Number of Values Below “x” / Total Number of Values) × 100
Percentile of 70
= (6/10) × 100
= 0.6 × 100 = 60
Therefore, the percentile for score 70 = 60
16. What are the 4 moments in statistics?
The shape of any distribution can be described by its various
‘moments’. The first four are:
1) The mean, which indicates the central tendency of a distribution.
2) The second moment is the variance, which indicates the width or deviation.
3) The third moment is the skewness, which indicates any asymmetric ‘leaning’
to either left or right.
4) The fourth moment is the Kurtosis, which indicates the degree of central
‘peakedness’ or, equivalently, the ‘fatness’ of the outer tails.
covariance of random
cov(X,Y) covariance
variables X and Y
correlation of random
ρX,Y correlation
variables X and Y
.
Part-B Questions
Q. Questions CO K Level
No. Level
1 Practical Implementation of Principle CO2 K2
Component Analysis(PCA).
2 Let X be a continuous random variable with the CO2 K2
PDF given by:
𝑥; 0 < 𝑥 < 1
𝑓(𝑥) = {2 − 𝑥; 1 < 𝑥 < 2
0; 𝑥 > 2
Find P(0.5 < x < 1.5).
3 Explain the Univariate, Bivariate and Multivariate CO2 K2
statistical analysis.
4 Explain Percentiles and Moments CO2 K2
1
Explain in detail the case study of medical Diagnosis using Bayesian
Theorem.
Mini Project Suggestions
Sl. Questions Platform
No.
1 Implementation of Bayes theorem with the real time R
application. Programming
(Or)
Python
Text & Reference Books
6 Jimmy Lin and Chris Dyer, "Data-Intensive Text Processing with Reference
MapReduce", Synthesis Lectures on Human Language Book
Technologies, Vol. 3, No. 1, Pages 1-177, Morgan Claypool
publishers, 2010.
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of
Educational Institutions. If you have received this document through email in error, please notify the
system manager. This document contains proprietary information and is intended only to the
respective group / learning community as intended. If you are not the addressee you should not
disseminate, distribute or copy through e-mail. Please notify the sender immediately by e-mail if you
have received this document by mistake and delete this document from your system. If you are not
the intended recipient you are notified that disclosing, copying, distributing or taking any action in
relianceon the contents of this information is strictly prohibited.