Statistics for Data Science PDF
Statistics for Data Science PDF
DatabaseTown.co
STATISTICS FOR DATA SCIENCE
A. DESCRIPTIVE STATISTICS:
Before going to discuss about descriptive statistics, first we recall the basic concept of data
and its types again here before starting descriptive statistics …..
Data:
TYPES OF DATA:
DATA (plural)
Singular form is datum
Discrete Data
Continuous Data
Bionomial Data This data is countable
This data is measureable
Variable data with only two options e.g. no. of children, whole numbers
e.g. height, width, length
e.g. good or bad, true or false
Nominal or Unordered Data
Variable data which is in
Interval
unordered form No true zero
e.g. red, green, man e.g. absence of temperature
Ordinal Data
Variable data with proper order
e.g. short, medium, long Ratio
Absolute zero
e.g. height can be zero
The crude dataset is the basic foundation of data science and it may be of different kinds like
Structured Data (Tabular structure), Unstructured Data (pictures, recordings, messages, PDF
documents and so forth.) and Semi Structured.
https:// Page
©
DatabaseTown.co
DATA
UNSTRUCTURED DATA
STRUCTURED DATA
unformated, unorganized, cannot be processed and and analyzed by utilizing conventional m
Formated , highly organized, easily searchable and
e.g.understandable by Maching
text, audio, video, Language
social media
e.g. name, address, dates, etc. activity, etc.
RDBMS, CRM, ERP are suitable for structured data
Non-relational and NoSQL databases are best for unstructured data
Furthermore, there are two kinds of data i.e. population data and sample data.
Population Data:
Population data is the collection of all items of interest which is denoted by ‘N’ and the
numbers we obtained when using population are called parameters.
Sample Data:
Sample data is a subset of the population which is denoted by ‘n’ and the numbers we
obtained when using sample are called statistics.
i. Bar Chart:
Bar charts are frequently being used to display data. In bar chart, each bar represents a
category and y-axis shows the frequency as shown in figure
6
5
4
3 Series 1
2 Series 2
1 Series 3
0
https:// Page
©
DatabaseTown.co
ii. Pie chart:
Pie Charts are frequently being used to display market share. If we want to see the share of
any item as a part of the total then we utilized pie chart, as shown in figure below:
Sales
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Category Frequency
Black 12
Brown 5
Blond 3
Red 7
i. Mean
ii. Median
iii. Mode
https:// Page
©
DatabaseTown.co
i. Mean:
It is most popular to measures the central tendency. It is used with both discrete and
continuous data. The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set. Therefore, if we have n values in a data set and they have values
x1, x2, ..., xn, the sample mean, usually denoted by is given by,
If we intend to calculate the population means instead of sample mean then we use the
greet letter µ as
ii. Median:
It is the mid score of a dataset that has been arranged in order of magnitude. In order to
calculate the median, suppose we have the following dataset:
10 20 30 15 20 30 15 20 30 15 20
First of all, we re-arrange this data into order of magnitude from smaller to larger
10 15 15 15 20 20 20 20 30 30 30
Therefore, in this case bold figure 20 is our median. It is the middle mark, as there are 5
scores before it and 5 scores after it. However, if we have an odd number of scores like this one,
10 20 30 15 20 15 20 30 15 20
https:// Page
©
DatabaseTown.co
In this case, we have to take two values i.e. 20, 20 and average them to get a median i.e. 20.
iii. Mode:
It is a value that most often score in our dataset. A dataset can have no mode, one mode or
multiple modes. It can be calculated by finding the value with the maximum frequency. For
instance,
Mode
Two modes
Measure of Asymmetry:
https:// Page
©
DatabaseTown.co
Skewness:
It is the measure of asymmetry that shows whether the observations in a dataset are
focused on one side. Skewness can be calculated by the following formula
( 𝑥̅)3
∑𝑛 𝑥
1
𝑖=1 𝑖−
𝑛
3
( 𝑖 𝑥̅)
𝑛
√𝑛− ∑
1 2
𝑖=
1 1 𝑥−
There are two types of skewness,
i. Right or Positive
Skewness
ii. Left or Negative Skewness
Frequency
40
30
20
10
Frequency
40
30
20
10
https:// Page
©
DatabaseTown.co
Mean < median
https:// Page
©
DatabaseTown.co
However, if mean = median = mode then no skew and therefore, distribution will be
symmetrical.
∑𝑛 (𝑥𝑖− 𝑥̅)2
Sample Variance formula:
𝑠2 =
𝑖=1
𝑛−1
https:// Page
©
DatabaseTown.co
A covariance of 0 means that the two variables A correlation of 0 means that the two variables
https:// Page
©
DatabaseTown.co
are independent. are independent
A positive covariance means that two variables A correlation of 1 means perfect positive
move together correlation
A negative covariance means that the two A correlation of -1 means perfect negative
variables move in opposite directions correlation.
∑ 𝑠𝑥𝑦
1(𝑥𝑖− 𝑥̅ ) ∗ (𝑦𝑖− 𝑦̅)
𝑛
𝑆𝑥𝑦 =
𝑖= 𝑟=
𝑛−1 𝑠𝑥 𝑠
𝑦
𝜎𝑥𝑦
Population Correlation formula:
∑ 𝑁 1(𝑥𝑖− 𝜇𝑥) ∗ (𝑦𝑖− 𝜇𝑦) 𝜌=
Population Covariance formula:
𝜎𝑥𝑦 =
𝑖= 𝜎𝑥 𝜎𝑦
𝑁
B. INFERENTIAL STATISTICS:
Probability distribution:
It is a statistical function that explains all the possible values and likelihoods that a random
variable can take within a given range. This range will be bounded between the least and the
highest possible values, but precisely where the possible value is likely to be plotted on the
probability distribution depends on a number of factors like distribution's mean, standard
deviation, skewness, and kurtosis. Few examples of distributions are,
Normal distribution
Binominal distribution
Student’s T distribution
Uniform distribution
Poission distribution
Mostly, there is a confusion that distribution is a graph but in fact, it is the rule that help us
in determining how the values are positioned in relation to each other.
i. Normal Distribution:
It is also known as Gaussian distribution or Bell Curve. It is mostly used in regression
analysis. A lot of things closely follow this distribution:
heights of people
https:// Page
©
DatabaseTown.co
size of things produced by machines
errors in measurements
blood pressure
marks on a test
stock market information
𝑁 ~(𝜇, 𝜎2)
𝑥−µ
𝑎=
distribution can be standardized using the following formula
𝑁 ~(0, 1)
This theorem states that the distribution of sample means approximates a normal
distribution as the sample size gets larger (assuming that all samples are the same in size),
regardless of population distribution shape. If the sample sizes= or >30 are considered enough for
the Central Limit Theorem to hold. The main aspect of this theorem is that the average of the
sample means and standard deviations will equal the population mean and standard deviation.
Furthermore, an adequately large sample size can forecast the characteristics of a population
accurately. In Central Limit Theorem,
https:// Page
©
DatabaseTown.co
Estimators and Estimates:
Estimators:
It is a mathematical function of the sample that tell us that how to calculate an estimate of
a parameter from a sample. Smaller the variance, most efficient the estimator. Hence, we required
to find what are the “good” estimators. Few vital criteria for goodness of an estimator are based
on these properties: -
- Bias
- Variance
- Mean Square Error
Estimates:
An estimate is the output value that you can get from an estimator. There are following
types of estimates:
Confidence Interval:
It is an interval within which we are assured with certain %age of confidence, the
population parameter will fall.
Margin of Error:
https:// Page
©
DatabaseTown.co
A margin of error explains how many percentage points your results will differ from the
real population value. It can be calculated by the following two ways:
Student’s T Distribution:
It is mostly used to estimate population parameters when the sample size is small and/or
population variance are not known. It is pertinent to mention here that it is very useful in such
cases where we have not enough information or too much cost is involve to acquire the requisite
information. It has fatter tails as compare to normal distribution and lower peak. Following
formula can be used to get the student’s T distribution for a variable with a normally distributed
population:
𝑡𝑣,𝛼 = 𝑥̅−𝜇
𝑠/√𝑛
where v are the degree of freedom
C. HYPOTHESIS TESTING:
Scientific Method:
The scientific method is a process for gathering data and processing information. It was
first sketched by Sir Francis Bacon (1561-1626) to provide logical, rational problem solving
across many scientific fields. The main principle of scientific method is systematic observation,
predictability, verifiability and amendment of hypothesis. The basic steps of the scientific method
are:
What is hypothesis?
A hypothesis is an assumption based on inadequate evidence that requires further testing
and experimentation. After further testing, a hypothesis can generally be confirmed true or false.
https:// Page
©
DatabaseTown.co
Null Hypothesis (H0):
A null hypothesis is a hypothesis which is required to be tested. It is the hypothesis that
the investigator is trying to show to be false. It is a status-quo. The concept of null is similar to
someone remain innocent until enough evidence to prove guilty. For instance, someone say, data
engineer normal salary is Rs.1,25,000/- but in our opinion he may be wrong, so, we make
statistical testing to reject this hypothesis, it is called null hypothesis.
DECISIONS:
After testing, there will be two possibility of decisions i.e. accept the null hypothesis or
reject the null hypothesis. Accept the null hypothesis means there is insufficient data to support
the alteration or novelty brought by the unconventional. Reject the null hypothesis means there is
sufficient statistical evidence that show this null hypothesis is false.
Level of Significance:
It is the probability of rejecting a null hypothesis by the test when it is really true. It is
denoted by α (Alpha).
Confidence Level:
It is a possibility of a parameter that lies within a specified range of values. It is denoted
as C. Level of significance is connected with the confidence level and the relationship between
them is denoted by c = 1 – α. The common level of significance and the corresponding
confidence level are given below:-
Rejection region:
The rejection region is the values of test statistic for which the null hypothesis is rejected.
One sided (one-tailed) test is used when the null does not contain equality or inequality
sign (<, >, ≤, ≥). The rejection region for one-sided (one-tailed) test is shown in figure:
In the left-tailed test, the rejection region is shaded in left side (as shown in above figure).
Two sided (two-tailed) test is used when the null contains equality (=) or inequality (≠)
sign. The rejection region for two-sided (two-tailed) test is shown in figure:-
Statistical Errors:
There are two types of statistical errors:
https:// Page
©
DatabaseTown.co
ii. Type II Error (False Negative)
P-value:
The p-value is the smallest level of marginal significance at which the null
hypothesis would be rejected. A smaller p-value means that there is stronger evidence in support
of the alternative hypothesis. Usually, p-value is found with 3 digits after the dot (x.xxx).
https:// Page