DS Chapter - 2
DS Chapter - 2
1
Introduction
2
Steps for processing data
3
Roles of statistics in Data Science
Data Exploration
Data Cleaning
Data Transformation
Data Visualization
Finding Similarity/Dissimilarity
Model Selection and Evaluation
Hypothesis Testing
Statistical Modeling
Probability Distribution and Estimation
4
Types of Statistics
Types of Statistics
5
Descriptive Statistics
● Provides ways for describing,presenting,summarizing and organizing
the data
6
Types of Descriptive Statistics
Descriptive Statistics
Measures of Measures of
Frequency Measures of Dispersion
Central Tendency
Mean
Range
Mode
Interquartile
Range
Median Standard
Deviation
7
Measures of Frequency
22,13,4,6,13,11,10
8
DATA FREQUENCY
VALUE
2 3
3 5
4 3
5 6
6 2
7 1
9
Measures of Central Tendency
10
Mean
● The most common and effective numeric measure of the center of a set of
data .
● It is the sum of all the observations divided by the sample size.
● The types of mean
Arithmetic Mean
Harmonic Mean
Geometric mean
11
Arithmetic mean
● It is obtained by adding all the values and then dividing the sum by the total
number of values.
● Let x1,x2,x3,x4…..xn be a set of N values or observation. The arithmetic
mean of this set of values is :
12
● Suppose the marks obtained by 10 students in a quiz are 8,3,7,6,9,10,5,7,8,5
● We can calculate
(8+3+7+6+9+10+5+7+8+5)
10 =6.8
The arithmetic mean can be calculated by using mean () function from Numpy
library
13
Harmonic Mean
● The harmonic mean is used when we want to find the reciprocal of the
average of the reciprocal terms in a series. The formula to determine
harmonic mean is n / [1/x1 + 1/x2 + 1/x3 + ... + 1/xn].
● Example x=(6,3,1,5,2)
● HM= ?
14
Geometric Mean
15
Median
16
Mode
17
● Advantages
○ Can be used for categorical values
○ Determined for qualitative and quantitative values
○ Not affected by extreme values
● Disadvantages
○ Not based on all values
○ Mode can not clearly defined in case of multi model series
○ Not applicable for further statistical analysis and algebraic calculation
18
Measures of Dispersion
● Dispersion is the extent to which values in a distribution differ from the average of
distribution
● Measures of central tendency is alone not sufficient to describe the data.
● Measures of dispersion helps us to know the degree of variability in the data and
provide better understanding of data
● Measures of dispersion indicate the measures to assess the dispersion or spread
of numeric data.
● The measures are:
o Rage
o Quantiles
o Quartiles
o Percentiles
o Interquartile range
19
Range
20
Standard Deviation
● It is a measure of how much the data values deviate from the mean value
● σ = √(∑x−x̄)2 /n)
21
Variance
22
Interquartile Range
● Interquartile range is a measure of variation, which describes how spread out
the data is.
● The interquartile range is a measure of variability based on splitting data
into quartiles.
● Interquartile range is the difference between the first and
third quartiles (Q1 and Q3).
● Quartile divides the range of data into four equal parts That are demarcated
by the three quartiles Q1,Q2,Q3
23
Python statistical Functions:
Method Description
statistics.harmonic_mean() Calculates the harmonic mean (central location) of the given data
statistics.mode() Calculates the mode (central tendency) of the given numeric or nominal data
•Sample is considered as a
representative of the entire universe or
population
● Most promising tech. used in data analysis to check whether a stated hypothesis
is accepted or rejected.( process is called Hypothesis Testing)
● Hypothesis testing is mainly used to determine whether there is sufficient
evidence in a data sample to conclude that a particular condition holds for an
entire population.
● There are two hypothesis
○ Null Hypothesis
○ Alternative Hypothesis
● The null hypothesis in statistics states that there is no difference between groups
or no relationship between variables. Ex Private coaching for students
● The alternative hypothesis states that there is a relationship between the two
variables being studied (one variable has an effect on the other).
26
Steps for Hypothesis Testing
Four basic steps to be followed for Hypothesis Testing-
Step 1: State the null and alternative hypothesis
Step 2: Select the appropriate significance level and check the specified test
assumption
Step 3: Analyze the data by computing appropriate statistical tests
Step 4: Interpret the result.
Two conclusions that can be inferred:
1. Reject the null hypothesis by showing enough evidence to support alternative
hypothesis.
2. Accept the null hypothesis by showing evidence to prove that there is not
enough evidence to support alternative ghypothesis.
27
Example of Hypothesis
● For example, suppose a biologist believes that a certain fertilizer will cause
plants to grow more during a one-month period than they normally do, which
is currently 20 inches. To test this, she applies the fertilizer to each of the
plants in her laboratory for one month.
● She then performs a hypothesis test using the following hypotheses:
● H0: μ = 20 inches (the fertilizer will have no effect on the mean plant growth)
● HA: μ > 20 inches (the fertilizer will cause mean plant growth to increase)
H0 True False
28
● For example, suppose a doctor believes that a new drug is able to reduce
blood pressure in obese patients. To test this, he may measure the blood
pressure of 40 patients before and after using the new drug for one month.
● He then performs a hypothesis test using the following hypotheses:
● H0: μafter = μbefore (the mean blood pressure is the same before and after using
the drug)
● HA: μafter < μbefore (the mean blood pressure is less after using the drug)
29
Parametric hypothesis tests: Parametric tests & Non
parametric tests
Information about the population is completely known and can be used for statistical inference in the case
of parametric tests. The type of parametric test to be considered is a decision-making task.
Steps for Parametric test:
Step -1 State Null and Alternate hypothesis
Step -2 Consider the level of significance
Step- 3 Identify the type of parametric test to be conducted
Step- 4 Find the Critical value to decide the accept/reject regions
Step- 5 Consider the sample and find the obtained parametric test value
Step-6 Compare obtained value critical value to decide whether the null hypothesis is accepted or
rejected
30
Core Terms related with Parametric test
The Null hypothesis & alternative hypothesis are mutually exclusive
1. Acceptance and critical regions :
All sets of possible values can be divided into two mutually exclusive groups:
● Acceptance Region: Values that appear consistent with the null hypothesis.
● Rejection Region: Consists of values unlikely to occur if the null hypothesis is true.
The value(s) that separate the critical region from the acceptance region are called critical values.
31
One tailed test and Two tailed Test
If the specified problem has an equal sign it is two tailed test
If the problem has a greater than or less than sign it is one tailed test
Case 1 :A government school states that dropout of female students between ages 12 and 18
years is 28%. Fig no 3(Two tailed test)
32
Significance Level
It is denoted by α
For example a significance level of 0.03 indicates that a 3 % risk is being taken
that a difference in values exists when there is no difference.
33
Calculated probability( r )-statistical expectation is true
It is calculated probability that states that when the null hypothesis is true, the
statistical summary will be greater than or equal to the actual observed results.
It is the probability of finding the observed or more extreme results when the null
hypothesis is true.
Some of the widely used hypothesis testing types are:
Z-test
T-test
Chi-Square
34
Types of Hypothesis Testing
Hypothesis Test
35
Z -Test
● This test is used for comparing the mean of a sample to some hypothesized mean
of a given population.
● The method for carrying out z-test for one sample is
z=X-µH0
σp /√n
Where µH0 =hypothesized population mean
σp Standard deviation
36
Example
● For a sample of 500 female students having a mean height of 5.4 feet.The
task is to find whether it can be reasonably regarded as a sample from a large
population with a mean height of 5.6 feet and standard deviation of 1.45
feet.Let us consider 5 % level of significance to solve the problem.
37
T-test
● The one sample t-test is mainly used for determining whether the mean of
sample is statistically different from a known or hypothesized mean of a given
population.
● The test variable needs to be continuous
z=X-µH0
σs /√n
38
Chi Square test
39
Z TEST
40
Z-TEST
Import math
Import numpy as np
From numpy.random import randn
From statsmodels.stats.weightstats import ztest
# Here we are considering random array of 50 numbers having mean=110 sd=15
Mean_1=110
Sd_1=15/math.sqrt(50)
Null_mean=100
data=sd_1*randn(50)+mean_1
#print mean and sd
Print(‘mean=%.2f stdv=%.2f’ % (np.mean(data),np.std(data)))
41
● P_value=ztest(data,value-null_mean,alternate=‘larger’)
● If(p_value<0.05):
Print(“Reject Null hypothesis”)
else:
Print(“fail to reject null hypothesis”)
42
ANOVA
43
Two sample parametric tests
45
Paired Sample t-test
● To carried out to compare two population means for given two samples in
which observation in one sample can be paired with observations in one
sample can be compared with observations in other.
● This test is usually used in case of before-and-after observations for
considered subject
46
Non Parametric Hypothesis test
● Information about the population is unknown and hence no assumption can be made regarding the
population
● It is more suitable for data that can be represented in qualitative scales
(nominal or ordinal )
● Cover techniques that do not rely on data belonging to any particular distribution
● The distribution of data can be skewed as well as the population variace can be non homogeneous
● One sample non-parametric test
One factor Chi-Square
Binomial
Wilcoxon Signed Rank Test
● Two Independent Sample
Mann-Whitney Test
Kolmogorov-Smirnov/s Test
● Two Paired Samples
Sign
Chi-Square
Wilcoxon Signed rank
47
Estimation of Parameter values
● In statistics finding estimation or inference refers to the task of drawing conclusion
about a population based on information provided about the population
● This can be done in two ways
Point estimate
Interval estimate
● Point estimation considers only single value of a statistics.
● Point estimation is based on single random sample its value will vary when
different random samples will considered from sample population.
● Few of the standard Point estimation methods are
Maximum Likelihood Estimator
Minimum Variance mean Unbiased Estimator
Minimum mean squared error
Best Linear Unbiased Estimator
48
Interval Estimate
It considers two values between which the population parameter considers two
values between which the population parameter is likely to lie.
The two values
49
Measuring Data Similarity and Dissimilarity
● Similarity measure is a way of measuring how data samples are related or
close to each other.
● Dissimilarity measure is to tell how much the data objects are distinct.
● Similarity measures are expressed as numerical value
● It gets higher when the data samples are more alike
● Zero means low similarity and one means very similar)
● Data structures
The data matrix
The dissimilarity matrix
● Object dissimilarity can be computed for objects described by nominal
attributes, binary attributes, numerical attributes, ordinal attributes.
50
Proximity measures for Nominal Attributes
51
Proximity measures for Numeric data
● Minkowski distance
( |X1 – Y1|p + |X2 – Y2|p + |X2 – Y2|p )1/p
52
● SET A
1. Write a Python program to find the maximum and minimum value of a given
flattened array.
import numpy as np
ar=np.array([[0,1],[2,3]])
print("Original Flattened Array");
print(ar)
print("-----------------")
print("Maximum value of the above flattened array:")
print(np.amax(ar))
print("Minimum value of the above flattened array:")
print(np.amin(ar))
53
Write a python program to compute Euclidian Distance between two data
points in a dataset. [Hint: Use linalgo.norm function from NumPy]
import numpy as np
point1 = np.array((1, 2, 3))
point2 = np.array((1, 1, 1))
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(point1 - point2)
# printing Euclidean distance
print(dist)
54
3. Create one dataframe of data values. Find out mean, range and IQR for
this data
.
import pandas as pd
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=["Apple", "Orange", "Banana", "Pear"],
index=["Basket1", "Basket2", "Basket3", "Basket4",
"Basket5", "Basket6"])
Print(“\n----------- Calculate Mean -----------\n”)
print(df.mean())
print("-----Maximum Value-------")
a=df.max()
print(a)
print("-----Minimum Value-------")
b=df.min()
print(b)
r=a-b
print("-------Range-------")
print(r)
55
4 find sum of Manhattan distance between all the pairs of given points
Return the sum of distance between all the pair of points.
56
5. Write a NumPy program to compute the histogram of nums against the
bins.
import numpy as np
import matplotlib.pyplot as plt
nums = np.array([0.5, 0.7, 1.0, 1.2, 1.3, 2.1])
bins = np.array([0, 1, 2, 3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()
57
6. Create a dataframe for students’ information such name, graduation
percentage and age.
#Display average age of students, average of graduation percentage.
#And, also describe all basic statistics of data. (Hint: use describe()).
import pandas as pd
import numpy as np
stud_data = {"name": ["Akanksha", "Diya", "Komal", "James“,"Emily","Jonas"],"grade": [78, 69, 65, 90,
45,89],
"age": [21,23,22,19,20,18]}
df = pd.DataFrame(stud_data)
print(df)
print("------average of graduation percentage-------")
mean_grade = df["grade"].mean()
print(mean_grade)
print("------average of graduation age-------")
mean_age = df["age"].mean()
print(mean_age)
print("------Describe basic statistics of data-------")
df.describe()
58
Concept of outlier
● An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population.
● Outlier detection is the process of finding
data objects with behaviors that are
different from the expectation
● They can be caused by measurement or
execution errors.
● 1.weight vs Height
● 2.Fraud transaction in case of credit card
data
59
Examples:
● Real-World example, the average height of a giraffe is about 16 feet tall. However, there
have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet,
respectively. These two giraffes would be considered outliers compared to the general
giraffe population.
● When going through data analysis outliers can cause anomalies in the results obtained.
This means that they require some special attention and, in some cases, will need to be
removed to analyze data effectively.
Here are two main reasons why giving outliers special attention is a necessary aspect of the
data analytics process:
● Example: With small datasets, it can be easy to spot outliers manually (for example,
with a set of data being 28, 26, 21, 24, 78, you can see that 78 is the outlier) but
when it comes to large datasets or big data, other tools are required.
62
There are eight main causes of outliers.
63
64
Global Outlier
65
Contextual Outlier
● Contextual outliers are also known as Conditional
outliers. These types of outliers happen if a data
object deviates from the other data points because of
any specific condition in a given data set.
● As we know, there are two types of attributes of
objects of data: contextual attributes and behavioral
attributes.
● Contextual outlier analysis enables the users to
examine outliers in different contexts and conditions,
which can be useful in various applications.
● For example, A temperature reading of 45 degrees
Celsius may behave as an outlier in a rainy season.
Still, it will behave like a normal data point in the
context of a summer season. In the given diagram, a
green dot representing the low-temperature value in
June is a contextual outlier since the same value in
66
December is not an outlier.
● Collective outliers are groups of data
points that collectively deviate significantly
from the overall distribution of a dataset.
● Collective outliers may not be outliers
when considered individually, but as a
group, they exhibit unusual behavior.
● Detecting and interpreting collective
outliers can be more complex than
individual outliers, as the focus is on group
behavior rather than individual data
points.
67
Outlier detection Method
● Supervised
● Semi Supervised
● Unsupervised
68
Supervised methods
69
Unsupervised methods
•In various application methods, objects labeled as “normal” or “outlier” are
not applicable.
•Therefore, an unsupervised learning approach has to be used.
•Unsupervised outlier detection methods create an implicit assumption such
as the normal objects are considerably “clustered.”
•An unsupervised outlier detection method predict that normal objects follow a
pattern far more generally than outliers.
•Normal objects do not have to decline into one team sharing large similarity.
Instead, they can form several groups, where each group has multiple
features.
70
Semi-Supervised Methods
● In several applications, although obtaining some labeled instance is possible,
the number of such labeled instances is small.
● It can encounter cases where only a small group of the normal and outlier
objects are labeled, but some data are unlabeled.
● Semi-supervised outlier detection methods were produced to tackle such
methods.
● Semi-supervised outlier detection methods can be concerned as applications
of semisupervised learning approaches. For example, when some labeled
normal objects are accessible, it can use them with unlabeled objects that are
nearby, to train a model for normal objects. The model of normal objects is
used to identify outliers—those objects not suitable the model of normal
objects are defined as outliers.
71
Statistical Method
72
Proximity Methods
● They assume that an object is an outlier if the nearest neighbors of the object
are far away in feature space;
● The proximity of the object to its neighbors significantly deviates from the
proximity of most of the other objects to their neighbors in the same data set.
● Proximity-based methods are classified into two types: Distance-based
methods judge a data point based on the distance(s) to its neighbors.
Density-based determines the degree of outlines of each data instance based
on its local density.
73