0% found this document useful (0 votes)
57 views26 pages

Statistics For Data Science

Uploaded by

metig83904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views26 pages

Statistics For Data Science

Uploaded by

metig83904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Statistics

Statistics for Data Science

shyam prasad MNS [Date] [Course title]


Statistics for Data Science

Contents
Statistics ................................................................................................... Error! Bookmark not defined.
Descriptive statistics vs Inferential Statistics ......................................... Error! Bookmark not defined.
Central Tendency ................................................................................... Error! Bookmark not defined.
Mean ...................................................................................................... Error! Bookmark not defined.
Median .................................................................................................................................................. 3
Mode ..................................................................................................................................................... 4
Inter-quartile Range .............................................................................................................................. 4
Calculating Interquartile Range. ....................................................................................................... 5
Variance: ............................................................................................................................................... 5
Co-Variance: .......................................................................................................................................... 6
Standard deviation ................................................................................................................................ 6
Correlation ............................................................................................................................................ 7
Skewness and Kurtosis .......................................................................................................................... 8
Types of Distributions ......................................................................................................................... 10
Hypothesis Testing .............................................................................................................................. 11
Types in Hypothesis Testing ................................................................................................................ 11
Error in Hypothesis Testing ................................................................................................................. 12
T-test ................................................................................................................................................... 13
One sample t-test……………………………………………………………………………………………………………………...13

Two-Sample t-test ........................................................................................................................... 13


Z-test ................................................................................................................................................... 14
One sample z-test ............................................................................................................................... 14
Two sample z-test (two-tailed z-test) ................................................................................................. 15
Annova ................................................................................................................................................ 15
Understanding ANOVA .................................................................................................................. 16
Understanding P-Values…… .................................................................................................................. 16
Chi-Square Test ..................................................................................................................................... 17
Mann-Whitney U-Test .......................................................................................................................... 17
Recapping with questions and answers................................................................................................ 19
Work it out ............................................................................................. Error! Bookmark not defined.

1|Page
Statistics for Data Science

Statistics

Statistics is a branch of applied mathematics concerned with collecting, organizing, and


interpreting data.

In general words a branch of mathematics dealing with the collection, analysis,


interpretation, and presentation of masses of numerical data or represented by means of
graphs.

For example, if we consider one live class to be a sample of the population of all live
classes, then the average number of points earned by students in that one live class at the
end of the term is an example of a statistic.
Two main branches of statistics are
descriptive statistics
uses the data to provide the descriptions of the population.
- It analyses the data in a meaningful way
- Descriptive statistics is very important to present our raw data in
effective/meaningful way
Inferential Statistics
makes the predictions and inferences.
- Predictions are made by group of data we are interested
- It can be defined as a random sample of data taken from a population

2|Page
Statistics for Data Science

Central Tendency
Central tendency is a descriptive summery of a dataset through a single value that reflects
the centre of the data distribution. Along with the variability of the dataset, central tendency
is a branch of descriptive statistics

Central Tendency

Mean Median Mode

The most common measures of central tendency are the mean, median, and mode. A
middle tendency can be calculated for either a finite set of values or for a theoretical
distribution, such as the normal distribution.
Mean
The mean represents the average value of the dataset. It can be calculated as the sum of all
the values in the dataset divided by the number of values. In general, it is considered as the
arithmetic mean.

Formulae
x̄ = Σfx/N
Where,
• x = the mean value of the set of given data.
• f = frequency of the individual data
• N = sum of frequencies

Ex: Calculating scores of students picked randomly


10 + 6 + 20 = 36 / 3 = 12
Here 12 is the mean (Average)

Median

The median tells you where the middle of a data set is


Median = ((n/2)th term + ((n/2) + 1)th term)/2

3|Page
Statistics for Data Science

Similarly, we have median formula for grouped data. The median formula for grouped data
is given as,
Median = Lm+[n/2−F/fm]i
Where,
• n = the total frequency.
• F = The cumulative frequency of before class median
• fm = the frequency of the class median
• i = the class width
• Lm = the lower boundary of the class median

Ex: lets take 5 bus tickets price as data (12,23,34,45,56)


• Data should be sorted before performing median operation
Here median is 34 is it’s the middle number

Mode

The mode is the value that appears most often in a set of data values

Mode formula = L + h (fm−f1)/(fm−f1)+(fm−f2)

Where,

'L' is the lower limit of the modal class.


'h' is the size of the class interval.
'fm' is the frequency of the modal class.
'f1' is the frequency of the class preceding the modal class.
'f2' is the frequency of the class succeeding the modal class.
Ex: let’s take 5 bus tickets price as data (12,34,34,45,56)
Here mode is 34 because its most repetitive number

Inter-quartile Range

The interquartile range defines the difference between the third and the first quartile.
Quartiles are the partitioned values that divide the whole series into equal parts. So, there
are 3 quartiles. First Quartile is denoted by Q1 known as the lower quartile, the second
Quartile is denoted by Q2 and the third Quartile is denoted by Q3 known as the upper
quartile. Therefore, the interquartile range is equal to the upper quartile minus lower
quartile.
Formulae for interquartile
Interquartile range = Upper Quartile – Lower Quartile = Q3 – Q1
where Q1 is the first quartile and Q3 is the third quartile of the series

4|Page
Statistics for Data Science

Calculating Interquartile Range.

Arrange the given data ascending or descending

Then count the given values. If it is odd, then the center value is median otherwise obtain
the mean value for two center values. This is known as Q2 value. If there are even number
of values, the median will be the average of the middle two values.

Median cuts the given values into two equal parts. They are described as Q1 and Q3 parts.

The median of data values below the median represents Q1.

The median of data values above the median value represents Q3.

Finally, we can subtract the median values of Q1 and Q3.

Getting to know IQR graphically,

Max\High value

Upper Quartile \Q3


Inter Quartile Range (IQR)

Median \Q2

Lower Quartile\Q1

Min\Low value

1-High value

2-75th percentile-Q1

3-Median (50th percentile)

4-25th percentile-Q3

5-Low value

Variance:

In statistics, variance is the expectation of the squared deviation of a random


variable from its population mean or sample mean. Variance is a measure of dispersion,
meaning it is a measure of how far a set of numbers is spread out from their average
value.

5|Page
Statistics for Data Science

xi = The ith data point


xbar = The mean of all points
n = The number of datapoints

Co-Variance:

Co-Variance gives the relationship between any two variables


the covariance formula helps to assess the relationship between two variables. It is
essentially a measure of the variance between two variables.

Co-Variance formulae

Cov(X,Y)= ∑(xi−¯¯¯x)(yi−¯¯¯y)/N−1

xi = the values of the X-variable

Yi = The values of the y-Variable

X̄ = The mean average of x- variable

Ȳ = the mean average of y-varable

N = the number of data points

Standard deviation

A standard deviation is a measure of how dispersed the data is in relation to the mean.
Low standard deviation means data are clustered around the mean, and high standard
deviation indicates data are more spread out.

The steps in calculating the standard deviation are as follows:

1. For each value, find its distance to the mean


2. For each value, find the square of this distance
3. Find the sum of these squared values
4. Divide the sum by the number of values in the data set
5. Find the square root of this

6|Page
Statistics for Data Science

Lets understand standard deviation with an example

Lets say we have chairs measured in mm

The heights are: 600mm, 470mm, 170mm, 430mm and 300mm.

Lets find the Mean, Variance, and the Standard Deviation.

Lets find the Mean:

Answer:
Mean = 600 + 470 + 170 + 430 + 300
= 1970
= 394

Mean is 394 mm(average chair height).

Now we calculate each difference from the Mean:

To calculate the Variance, take each difference, square it, and then average the result:

Variance
σ2 = 2062 + 762 + (−224)2 + 362 + (−94)2
σ2 = 42436 + 5776 + 50176 + 1296 + 8836
σ2 = 108520
σ2 = 21,704

So the Variance is 21,704

And the Standard Deviation is just the square root of Variance, so:

Standard Deviation
σ = √21704
= 147.32...
= 147 (mm)

Correlation

A statistic function that measures the degree to which two variables move in relation
to each other
Correlation shows the strength of a relationship between two variables and is expressed
numerically by the correlation coefficient. The correlation coefficient's values range
between -1.0 and 1.0.
A perfect positive correlation means that the correlation coefficient is exactly 1. This
implies that as one security moves, either up or down, the other security moves in
lockstep, in the same direction. A perfect negative correlation means that two assets

7|Page
Statistics for Data Science

move in opposite directions, while a zero correlation implies no linear relationship at


all
Formulae
ρxy=Cov(x,y)/σxσy
ρxy=Pearson product-moment correlation coefficient
Cov(x,y)=covariance of variables x and y
σx=standard deviation of x
σy=standard deviation of y
Types of correlation
1. Positive correlation
The more time you spend on practice material given by skillovilla the greater te chances
becoming Data Scientist
2. negative correlation
The less time you spend on practice material given by skillovilla the greater the chances
becoming Data Scientist
3. neutral correlation
The nights you spend on practice material given by skillovilla the greater the chances
becoming Data Scientist
How correlation is different from causation
If two variables are correlated, it does not imply that one variable causes the changes in
another variable. Correlation only assesses relationship between variables, and there
may be different factors that lead to the relationships. Causation may be a reason for the
correlation, but it is not the only possible explanation

Skewness and Kurtosis

skewness

skewness is a degree of asymmetry observed in a probability distribution that deviates


from the symmetrical normal distribution (bell curve) in a given set of data.

8|Page
Statistics for Data Science

1. Positive skewed or right-skewed

A positively skewed distribution is a sort of distribution where, unlike symmetrically


distributed data where all measures of the central tendency (mean, median, and mode)
equal each other

+ve skewed

2. Negative skewed or left-skewed

In negatively skewed, the mean of the data is less than the median (a large number of
data-pushed on the left-hand side). Negatively Skewed Distribution is a type of
distribution where the mean, median, and mode of the distribution are negative rather
than positive or zero

-Ve Skewed

9|Page
Statistics for Data Science

Kurtosis

Kurtosis is a statistical measure, whether the data is heavy-tailed or light-tailed in a


normal distribution.

Positive Kurtosis

Neutral Kurtosis

Negative Kurtosis

Types of Distributions

Bernoulli Distribution

A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0
(failure), and a single trial. So the random variable X which has a Bernoulli distribution
can take value 1 with the probability of success, say p, and the value 0 with the
probability of failure, say q or 1-p.

The probability mass function is given by: px(1-p)1-x where x € (0, 1).

Uniform Distribution

When you toss a dice, the outcomes are 1, 2, 3, 4, 5 and 6 . The probabilities of getting
these outcomes are equally likely and that is the basis of a uniform distribution.

Binomial Distribution

Lets say we have tossed the coin, what are the possible outcomes?

There are only two possible outcomes. Head denoting success and tail denoting failure.
Therefore, probability of getting a head = 0.5 and the probability of failure can be easily
computed as: q = 1- p = 0.5

10 | P a g e
Statistics for Data Science

Normal Distribution

Normal distribution represents the behaviour of most of the situations in the universe
(That is why it’s called a “normal” distribution. I guess!). The large sum of (small)
random variables often turns out to be normally distributed, contributing to its
widespread application.

Any distribution is known as Normal distribution if it has the following characteristics

The mean, median and mode of the distribution coincide.

The curve of the distribution is bell-shaped and symmetrical about the line

Exactly half of the values are to the left of the centre and the other half right

Poisson Distribution

Suppose you work at a call centre, how many calls do you get in a day? It can be any
number right, the entire number of calls at a call centre in a day is modelled by Poisson
distribution.

Exponential Distribution

Considering the above Example, the time gap between each call can be called as
Exponential Distribution

Exponential distribution is widely used for survival analysis. From the expected life of
a machine to the expected life of a human

Hypothesis Testing

Hypothesis testing is a statistical method that is used in making a statistical decision


using experimental data. Hypothesis testing is basically an assumption that we make
about a population parameter. It evaluates two mutually exclusive statements about a
population to determine which statement is best supported by the sample data.

Types in Hypothesis Testing

1. Null Hypothesis

2. Alternative Hypothesis

Let’s get into those

Null hypothesis(H0)

In statistics, the null hypothesis is a general given statement or default position that
there is no relationship between two measured cases or no relationship among groups.

11 | P a g e
Statistics for Data Science

Alternative hypothesis(H1)
The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary
to the null hypothesis

Calculating Hypothesis
𝑥−𝜇
t = 𝑠/
√𝑛
t = tested static
xbar= mean of the samples
𝜇 = null hypothesis
S = standard deviation
n = number of samples

Level of significance

It refers to the degree of significance in which we accept or reject the null-hypothesis.


100% accuracy is not possible for accepting a hypothesis, so we, therefore, select a level
of significance that is usually 5%. This is normally denoted with and generally, it is
0.05 or 5%, which means your output should be 95% confident to give similar kind of
result in each sample.

P-value

The P value, or calculated probability, is the probability of finding the observed/extreme


results when the null hypothesis(H0) of a study given problem is true. If your P-value is
less than the chosen significance level then you reject the null hypothesis i.e. accept that
your sample claims to support the alternative hypothesis.

Error in Hypothesis Testing

• Type I error
When we reject the null hypothesis, although that hypothesis was true. Type I error
is denoted by alpha.

• Type II errors

When we accept the null hypothesis but it is false. Type II errors are denoted by beta.

Let’s look into Hypothesis testing with an example:


Hypothesis on Vitamin C Factor

1. Is it true that vitamin C has the ability to cure or prevent the common cold? Or its just
a myth?

2. Null hypothesis: Children who take vitamin C are no less likely to become ill during flu

3. Alternative Hypothesis: Children who take Vitamin C are likely to be ill

4. P-value the P-value is 0.20

12 | P a g e
Statistics for Data Science

5. Conclusion: By observing p value which is 0.2 and hence rejecting null hypothesis and
accepting Alternative hypothesis

T-test

A t-test allows us to compare the average values of the two data sets and determine if they
came from the same population. if we were to take a sample of students from class A and
another sample of students from class B, we would not expect them to have exactly the
same mean and standard deviation.

A t-test is a type of inferential statistic used to determine if there is a significant difference


between the means of two groups, which may be related in certain features.

Calculating a t-test requires three key data values. They include the difference between the
mean values from each data set (called the mean difference), the standard deviation of each
group, and the number of data values of each group.

Types of T-Tests

one-sample t-test
In a one-sample t-test, we compare the average (or mean) of one group against the set
average (or mean). This set average can be any theoretical value (or it can be the population
mean).

t=(x¯−μ0)/sx¯

μ0 = The test value -- the proposed constant for the population mean
x¯ = Sample mean
n = Sample size (i.e., number of observations)
s = Sample standard deviation

sx¯ = Estimated standard error of the mean (s/sqrt(n))

Two-Sample t-test

The two-sample t-test is used to compare means of two different samples.

• x1 and x2 are the means of two different samples


• n1 and n2 are the sample sizes
• S2 is an estimator of the common variance of two samples

13 | P a g e
Statistics for Data Science

Z-test

Z-test is a statistical test where normal distribution is applied and is basically used for dealing
with problems relating to large samples when n ≥ 30.
z test single proportion is used to test a hypothesis on a specific value of the population
proportion.
z test for difference of proportions is used to test the hypothesis that two populations have
the same proportion.
z test for single variance is used to test a hypothesis on a specific value of the population
variance.
Z-test for testing equality of variance is used to test the hypothesis of equality of
two population variances when the sample size of each sample is 30 or larger.
z-scores
z-score, or z-statistic, is a number representing how many standard deviations above or
below the mean population the score derived from a z-test is. Essentially, it is a numerical
measurement that describes a value's relationship to the mean of a group of values. If a z-
score is 0, it indicates that the data point's score is identical to the mean score. A z-score of
1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be
positive or negative, with a positive value indicating the score is above the mean and a
negative score indicating it is below the mean.
For the normal population with one sample:

where x̄ is the mean of the sample, and µ is the assumed mean, σ is the standard deviation,
and n is the number of observations.

One sample z-test

• One sample z-test is used to determine whether a particular population parameter, which
is mostly mean, significantly different from an assumed value

• It helps to estimate the relationship between the mean of the sample and the assumed
mean

• In this case, the standard normal distribution is used to calculate the critical value of the
test

• If the z-value of the sample being tested falls into the criteria for the one-sided test, the
alternative hypothesis will be accepted instead of the null hypothesis

14 | P a g e
Statistics for Data Science

• A one-tailed test would be used when the study has to test whether the population
parameter being tested is either lower than or higher than some hypothesized value

• A one-sample z-test assumes that data are a random sample collected from a normally
distributed population that all have the same mean and same variance

• This hypothesis implies that the data is continuous, and the distribution is symmetric

• Based on the alternative hypothesis set for a study, a one-sided z-test can be either a
left-sided z-test or a right-sided z-test

• For instance, if our H0: µ0 = µ and Ha: µ < µ0, such a test would be a one-sided test or
more precisely, a left-tailed test and there is one rejection area only on the left tail of the
distribution

• However, if H0: µ = µ0 and Ha: µ > µ0, this is also a one-tailed test (right tail), and the
rejection region is present on the right tail of the curve

Two sample z-test (two-tailed z-test)

• In the case of two sample z-test, two normally distributed independent samples are
required

• A two-tailed z-test is performed to determine the relationship between the population


parameters of the two samples

• In the case of the two-tailed z-test, the alternative hypothesis is accepted as long as the
population parameter is not equal to the assumed value

• The two-tailed test is appropriate when we have H0: µ = µ0 and Ha: µ ≠ µ0 which may
mean µ > µ0 or µ < µ0

• Thus, in a two-tailed test, there are two rejection regions, one on each tail of the curve

Annova

Analysis of variance (ANOVA) is a statistical technique that is used to check if the means
of two or more groups are significantly different from each other. ANOVA checks the
impact of one or more factors by comparing the means of different samples.

The ANOVA test is the initial step in analyzing factors that affect a given data set. Once
the test is finished, an analyst performs additional testing on the methodical factors that
measurably contribute to the data set's inconsistency. The analyst utilizes the ANOVA test
results in an f-test to generate additional data that aligns with the
proposed regression models.
If no real difference exists between the tested groups, which is called the null hypothesis,
the result of the ANOVA's F-ratio statistic will be close to 1. The distribution of all possible
values of the F statistic is the F-distribution. This is actually a group of distribution
functions, with two characteristic numbers, called the numerator degrees of freedom and
the denominator degrees of freedom.

15 | P a g e
Statistics for Data Science

There are two main types of ANOVA


one-way (or unidirectional) and two-way.
One-way or two-way refers to the number of independent variables in your analysis of
variance test. A one-way ANOVA evaluates the impact of a sole factor on a sole response
variable.
F = MSE / MST
F=ANOVA coefficient
MST=Mean sum of squares due to treatment
MSE=Mean sum of squares due to error
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have
one independent variable affecting a dependent variable. With a two-way ANOVA, there
are two independents.
Understanding ANOVA

Biologist want to know how different levels of sunlight exposure (no sunlight, low sunlight,
medium sunlight, high sunlight) and watering frequency (daily, weekly) impact the growth
of a certain plant. In this case, two factors are involved (level of sunlight exposure and water
frequency), so they will conduct a two-way ANOVA to see if either factor significantly
impacts plant growth and whether or not the two factors are related to each other.

The results of the ANOVA will tell us whether each individual factor has a significant effect
on plant growth. Using this information, the biologists can better understand which level of
sunlight exposure and/or watering frequency leads to optimal growth.

Understanding P-Values

- The p-value, or probability value, tells you how likely it is that your data could have
occurred under the null hypothesis
- p-value is the probability of obtaining results at least as extreme as the observed results
of a statistical hypothesis test
- The P value is used all over statistics, from t-tests to regression analysis
- It does this by calculating the likelihood of your test statistic, which is the number
calculated by a statistical test using your data
- n statistics, the p-value is the probability of obtaining results at least as extreme as the
observed results of a statistical hypothesis test, assuming that the null hypothesis is
correct
- The p-value is used as an alternative to rejection points to provide the smallest level of
significance at which the null hypothesis would be rejected. A smaller p-value means
that there is stronger evidence in favour of the alternative hypothesis
- The p-value is a proportion: if your p-value is 0.05, that means that 5% of the time you
would see a test statistic at least as extreme as the one you found if the null hypothesis
was true.
- P-values are usually automatically calculated by our statistical program.

16 | P a g e
Statistics for Data Science

Parametric and Non-Parametric Tests

- Parametric Test
The basic principle behind the parametric tests is that we have a fixed set of parameters
that are used to determine a probabilistic model that may be used in Machine Learning
as well.

- Non-Parametric Test
In non-parametric tests we don’t assume any parameters for the population and these
tests too doesn’t depend on the population, Hence there is no fixed distribution like
normal distribution or whatever

Chi-Square Test
Chi- square is non-parametric test

This test can be used

• To find goodness of fit.


• as a test of independence of two variables.

Chi-square supports in assessing the goodness of fit between a set of observed and those
expected theoretically.

It compares between the expected frequencies and the observed frequencies.

Chi-Square the value of Chi-square is high then the difference is greater

If there is no difference between the expected and observed frequencies,

then the value of chi-square is equal to zero.

Chi-Square is calculated as

Oi =Observed value

Ei = Expected value

XC = Chi-squared

Chi-square is also used to test the independence of two variables.

Mann-Whitney U-Test
Mann – Whitney U-Test is non-parametric test

17 | P a g e
Statistics for Data Science

This is used to check whether two independent samples were selected from a population
having the same distribution.

It Calculates based on the comparison of every observation in the first sample with every
observation.

The test statistic used here is “U”.

Maximum value of “U” is ‘n1*n2‘ and the minimum value is zero.

Mathematically, U is given by:

U1 = R1 – n1(n1+1)/2

where n1 is the sample size for sample 1, and R1 is the sum of ranks in Sample 1.

U2 = R2 – n2(n2+1)/2

When consulting the significance tables, the smaller values of U1 and U2 are used. The sum
of two values is given by,

U1 + U2 = { R1 – n1(n1+1)/2 } + { R2 – n2(n2+1)/2 }

Knowing that R1+R2 = N(N+1)/2 and N=n1+n2, and doing some algebra, we find that the
sum is:

U1 + U2 = n1*n2 .

18 | P a g e
Statistics for Data Science

Recapping with questions and answers


Q1. What is probability distribution?
Graphical representation of a random variable on x axis and the corresponding
probabilities on y axis.
Q2. Probability distributions are classified into how many major types
2 types: Continuous and Discrete
Q3. What is Normal distribution
Normal Distribution is a type of Continuous probability distribution
Q4. Properties of Normal distribution
It has a bell-shaped curve which is symmetrical across measures of central tendency.
Mean = median= mode
The total area under the curve is 1.
Q5. Normal distribution is characterized by
Mean and Standard Deviation
Q6. Notation used to represent Normal distribution
N(mu, sigma)
Q7. Properties of standard normal distribution
A Normal distribution with mu= 0 and sigma=1
Q8. How is standard normal distribution represented
N(0,1)
Q9. What are the other applications of scale function?
To make your data scale free and unit free.
Q10. What is Confidence Interval?
A Confidence Interval is a range of values we are fairly sure our true value lies in.
Q11. What is Margin of Error
A small amount of error that is allowed to express the sampling error.
MOE= z*sigma/sqrt(n)
Q12. Formula for standardization
z=x-mu/sigma
Q13. What is 1-alpha
Confidence level

19 | P a g e
Statistics for Data Science

Q14. Function to calculate z value in Python


scipy.stats.norm.cdf()
Q15. What we do if population standard deviation, sigma is not known
Resort to sample standard deviation and t distribution
Q16. Characterize t distribution
The t-distribution is symmetric and bell-shaped, like the normal distribution, but has
heavier tails and charactered by degrees of freedom
Q9. Function to calculate student’s t-dist in Python
scipy.stats.t.cdf()
Q17. How to interpret the findings from sample data for population
by calculating confidence interval.
Q18 . What is an outlier? How can outliers be determined in a dataset?
Outliers are data points that vary in a large way when compared to other observations in
the dataset. Depending on the learning process, an outlier can worsen the accuracy of a
model and decrease its efficiency sharply.
Q19. Outliers are determined by using two methods
Standard deviation/z-score, Interquartile range (IQR)
Q20. What is the meaning of six sigma in statistics?
Six sigma is a quality assurance methodology used widely in statistics to provide ways
to improve processes and functionality when working with data.
A process is considered as six sigma when 99.99966% of the outcomes of the model are
considered to be defect-free.
Q21. What is the Pareto principle?
The Pareto principle is also called the 80/20 rule, which means that 80 percent of the
results are obtained from 20 percent of the causes in an experiment.
Q22. What are quantitative data and qualitative data?
Quantitative data is also known as numeric data. And Qualitative data is also known as
categorical data.

Q23. What is a bell-curve distribution?

A normal distribution can be called a bell-curve distribution. It gets its name from the
bell curve shape that we get when we visualize the distribution.
Q24. What is skewness?
Skewness measures the lack of symmetry in a data distribution. It indicates that there are
significant differences between the mean, the mode, and the median of data. Skewed
data cannot be used to create a normal distribution.

20 | P a g e
Statistics for Data Science

Q25. What is kurtosis?


Kurtosis is used to describe the extreme values present in one tail of distribution versus
the other. It is actually the measure of outliers present in the distribution. A high value
of kurtosis represents large amounts of outliers being present in data. To overcome this,
we have to either add more data into the dataset or remove the outliers.
Q26. What is correlation?
Correlation is used to test relationships between quantitative variables and categorical
variables. Unlike covariance, correlation tells us how strong the relationship is between
two variables. The value of correlation between two variables ranges from -1 to +1.
Q27. What are left-skewed and right-skewed distributions?
A left-skewed distribution is one where the left tail is longer than that of the right tail.
Here, it is important to note that the mean < median < mode.
Similarly, a right-skewed distribution is one where the right tail is longer than the left
one. But, here mean > median > mode.
Q28. What is the impact of outliers in statistics?
Outliers in statistics have a very negative impact as they skew the result of any statistical
query. For example, if we want to calculate the mean of a dataset that contains outliers,
then the mean calculated will be different from the actual mean (i.e., the mean we will
get once we remove the outliers).

21 | P a g e
Statistics for Data Science

Work it out

Height (in cm) Frequency Cumulative frequency

>140 5 4

>145 6 11

>150 18 29

>155 11 40

>160 6 46

>165 7 52
*Find the median height, mean height and mode height of the same

* A survey conducted on 20 houses in an area by a group of people resulted in the subsequent


frequency table for the number of family members in a house

Size of family 1-3 3-5 5-7 7-9 9-11

No. of families 7 8 2 2 1

* Find Mean, Median and mode of this data. Which Among the Following Cannot be
Represented Graphically?
(a) Median
(b) Mean
(c) Mode
(d) None of the above option

* In Case of Computation of Mean within a Grouped Data, the Assumption is that


Frequencies are –
(a) Cantered at lower limit among classes
(b) Centered at upper limit among classes
(c) Evenly placed across all classes

22 | P a g e
Statistics for Data Science

(d) Centered within class marks among classes


* Which is the following is correct?
a. The probability of a type I error is β.
b. The probability of a type II error is (1 - β).
c. The probability of a type II error is α.
d. The probability of a type I error is (1 - α).
e. none of the above

* the bell or mound shaped distribution will have approximately 68% of the data within
what number of standard deviations of the mean?
a. one standard deviation
b. two standard deviations
c. three standard deviations
d. four standard deviations
e. none of the above

* A random sample of 5 mosquitos is sampled. The number of mosquitos carrying the


West Nile Virus in the sample is an example of which random variable?
a. normal
b. student’s t
c. binomial
d. uniform
e. none of the above

* A political scientist is studying voters in California. It is appropriate for him to use a


mean to describe
a. the age of a typical voter.
b. the party affiliation of a typical voter.
c. the sex of a typical voter.
d. the county of residence of a typical voter.
e. none of the above

23 | P a g e
Statistics for Data Science

* The long-run average of a random variable is


a. the expected value
b. the coefficient of determination
c. the standard deviation
d. the mode
e. none of the above

24 | P a g e
Statistics for Data Science

25 | P a g e

You might also like