0% found this document useful (0 votes)

13 views73 pages

DS Chapter - 2

Uploaded by

Omkar Shinde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views73 pages

DS Chapter - 2

Uploaded by

Omkar Shinde

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

Chapter -2

Statistical Data Analysis

1
Introduction

● Data Science is as interdisciplinary field which requires a strong

understanding of mathematics,statistical reasoning and computer science
● Statistics is the science of collecting ,analyzing and interpreting data
● The data is usually numerical data in large quantities
● Statistics serve as a foundation while dealing with data and its analysis in
data science.
● It provides tools and methods to find structure in ans to give deeper insight
into data
● Data scientist use the combination of statistical formulae and computer
algorithms to notice patterns and trends within data

2
Steps for processing data

1. Identify the importance feature in the data

2. Finding relationship between features
3. Converting the features into the required format
4. Nomalizing and scaling the data
5. Identifying the distribution and nature of the data
6. Performing adjustment in the data
7. Identifying the right mathematical approach
8. Verify the results using different accuracy measurement scales

3
Roles of statistics in Data Science
Data Exploration
Data Cleaning
Data Transformation
Data Visualization
Finding Similarity/Dissimilarity
Model Selection and Evaluation
Hypothesis Testing
Statistical Modeling
Probability Distribution and Estimation

4
Types of Statistics

Types of Statistics

Descriptive statistics Inferential Statistics

Measures of Measures of Measures of

Frequency Central Tendency Hypothesis Testing Parameter Estimation
Dispersion

5
Descriptive Statistics
● Provides ways for describing,presenting,summarizing and organizing
the data

● Descriptive statistics summarizes this large amount of data and

presents it in a simple and understandable form.

● The summarization is done from the sample of the population using

different parameters like mean,median,standard deviation.
● It help us to organize ,represent & describe data using tables,graph &
summary measures.
● Summarization is done with the help of simple parameteres like
Mean,median,Standard deviation.

6
Types of Descriptive Statistics
Descriptive Statistics

Measures of Measures of
Frequency Measures of Dispersion
Central Tendency

Mean
Range

Mode
Interquartile
Range

Median Standard
Deviation

7
Measures of Frequency

● Frequency is statistical quantity in data science.

● It is number of times a value of the data occurs.
● In a dataset it analyzes how often a particular data value in a feature occurs.
● The frequency distribution can be tabulated as a frequency chart
Twenty students were asked how many hours they worked per day.
Their responses, in hours, are as
follows:

22,13,4,6,13,11,10

8
DATA FREQUENCY
VALUE
2 3

3 5

4 3

5 6

6 2

7 1

9
Measures of Central Tendency

● It is important measures of statistical analysis is to find one value that

describes the characteristics of the entire set of data.
● This single value is referred to as a central tendency that describes a whole
set of data with single value that represents the center of its distribution.
● Measure of central Tendency is also known as summary statistics that is used
to represent the center point.

10
Mean

● The most common and effective numeric measure of the center of a set of
data .
● It is the sum of all the observations divided by the sample size.
● The types of mean
Arithmetic Mean
Harmonic Mean
Geometric mean

11
Arithmetic mean
● It is obtained by adding all the values and then dividing the sum by the total
number of values.
● Let x1,x2,x3,x4…..xn be a set of N values or observation. The arithmetic
mean of this set of values is :

12
● Suppose the marks obtained by 10 students in a quiz are 8,3,7,6,9,10,5,7,8,5
● We can calculate
(8+3+7+6+9+10+5+7+8+5)
10 =6.8
The arithmetic mean can be calculated by using mean () function from Numpy
library

13
Harmonic Mean

● The harmonic mean is used when we want to find the reciprocal of the
average of the reciprocal terms in a series. The formula to determine
harmonic mean is n / [1/x1 + 1/x2 + 1/x3 + ... + 1/xn].
● Example x=(6,3,1,5,2)
● HM= ?

14
Geometric Mean

● A geometric mean is a mean or average which shows the central tendency of

a set of numbers by using the product of their values.

15
Median

● It is middle value of data.

● It is the value that separates the higher half of a data set from the lower half.
● It splits the data in half and also called 50 th percentile
● If the number of elements in the data set is odd then middle element is
median
● If the number of elements in the data set is even then average of two
central elements.
Advantages
Less affected by the outliers and skewed data as compared to mean
Appropriate for Skewed data

16
Mode

● It is value that occur more frequently in a dataset.

● It is possible for several different values to have the maximum frequency
which result in more than one mode.
● Dataset with one mode is called unimodal.
● Dataset with two mode is called bimodal.
● Dataset with three mode is called trimodal.

17
● Advantages
○ Can be used for categorical values
○ Determined for qualitative and quantitative values
○ Not affected by extreme values
● Disadvantages
○ Not based on all values
○ Mode can not clearly defined in case of multi model series
○ Not applicable for further statistical analysis and algebraic calculation

18
Measures of Dispersion
● Dispersion is the extent to which values in a distribution differ from the average of
distribution
● Measures of central tendency is alone not sufficient to describe the data.
● Measures of dispersion helps us to know the degree of variability in the data and
provide better understanding of data
● Measures of dispersion indicate the measures to assess the dispersion or spread
of numeric data.
● The measures are:
o Rage
o Quantiles
o Quartiles
o Percentiles
o Interquartile range

19
Range

● It is simplest measure of dispersion.Let x1,x2,….xn be a set of observations

for some numeric attributes X.
● The range of the set is the difference between the largest(max() and the
smallest (min() values)
● Range=max-min

20
Standard Deviation

● It is a measure of how much the data values deviate from the mean value
● σ = √(∑x−x̄)2 /n)

Find the SD for 4,9,11,12,17,5,8,12,14

21
Variance

● Variance measures how far a data set is spread out.It is mathematically

defined ad the average of the squared differences from the mean.
● Variance = (Standard deviation)2= σ2

22
Interquartile Range
● Interquartile range is a measure of variation, which describes how spread out
the data is.
● The interquartile range is a measure of variability based on splitting data
into quartiles.
● Interquartile range is the difference between the first and
third quartiles (Q1 and Q3).
● Quartile divides the range of data into four equal parts That are demarcated
by the three quartiles Q1,Q2,Q3

● Consider the following data

2,3,4,7,10,15,22,26,27,30,32

23
Python statistical Functions:
Method Description

statistics.harmonic_mean() Calculates the harmonic mean (central location) of the given data

statistics.mean() Calculates the mean (average) of the given data

statistics.median() Calculates the median (middle value) of the given data

statistics.median_grouped() Calculates the median of grouped continuous data

statistics.median_high() Calculates the high median of the given data

statistics.median_low() Calculates the low median of the given data

statistics.mode() Calculates the mode (central tendency) of the given numeric or nominal data

statistics.pstdev() Calculates the standard deviation from an entire population

statistics.stdev() Calculates the standard deviation from a sample of data

statistics.pvariance() Calculates the variance of an entire population

statistics.variance() Calculates the variance from a sample of data

24
Inferential Statistics
•Inferential Statistics draw inferences
and prediction about a population
based on data chosen from the
population in question

•Sample is considered as a
representative of the entire universe or
population

•Statistical Inference mainly deals with

two different kinds of problems
Hypothesis testing
Estimation of parameter values
25
Hypothesis testing

● Most promising tech. used in data analysis to check whether a stated hypothesis
is accepted or rejected.( process is called Hypothesis Testing)
● Hypothesis testing is mainly used to determine whether there is sufficient
evidence in a data sample to conclude that a particular condition holds for an
entire population.
● There are two hypothesis
○ Null Hypothesis
○ Alternative Hypothesis
● The null hypothesis in statistics states that there is no difference between groups
or no relationship between variables. Ex Private coaching for students
● The alternative hypothesis states that there is a relationship between the two
variables being studied (one variable has an effect on the other).

26
Steps for Hypothesis Testing
Four basic steps to be followed for Hypothesis Testing-
Step 1: State the null and alternative hypothesis
Step 2: Select the appropriate significance level and check the specified test
assumption
Step 3: Analyze the data by computing appropriate statistical tests
Step 4: Interpret the result.
Two conclusions that can be inferred:
1. Reject the null hypothesis by showing enough evidence to support alternative
hypothesis.
2. Accept the null hypothesis by showing evidence to prove that there is not
enough evidence to support alternative ghypothesis.

27
Example of Hypothesis

● For example, suppose a biologist believes that a certain fertilizer will cause
plants to grow more during a one-month period than they normally do, which
is currently 20 inches. To test this, she applies the fertilizer to each of the
plants in her laboratory for one month.
● She then performs a hypothesis test using the following hypotheses:
● H0: μ = 20 inches (the fertilizer will have no effect on the mean plant growth)
● HA: μ > 20 inches (the fertilizer will cause mean plant growth to increase)
H0 True False

Rejected Type I Error √

Not Rejected √ Type II Error

28
● For example, suppose a doctor believes that a new drug is able to reduce
blood pressure in obese patients. To test this, he may measure the blood
pressure of 40 patients before and after using the new drug for one month.
● He then performs a hypothesis test using the following hypotheses:
● H0: μafter = μbefore (the mean blood pressure is the same before and after using
the drug)
● HA: μafter < μbefore (the mean blood pressure is less after using the drug)

29
Parametric hypothesis tests: Parametric tests & Non
parametric tests
Information about the population is completely known and can be used for statistical inference in the case
of parametric tests. The type of parametric test to be considered is a decision-making task.
Steps for Parametric test:
Step -1 State Null and Alternate hypothesis
Step -2 Consider the level of significance
Step- 3 Identify the type of parametric test to be conducted
Step- 4 Find the Critical value to decide the accept/reject regions
Step- 5 Consider the sample and find the obtained parametric test value
Step-6 Compare obtained value critical value to decide whether the null hypothesis is accepted or
rejected

30
Core Terms related with Parametric test
The Null hypothesis & alternative hypothesis are mutually exclusive
1. Acceptance and critical regions :
All sets of possible values can be divided into two mutually exclusive groups:

● Acceptance Region: Values that appear consistent with the null hypothesis.
● Rejection Region: Consists of values unlikely to occur if the null hypothesis is true.

The value(s) that separate the critical region from the acceptance region are called critical values.

31
One tailed test and Two tailed Test
If the specified problem has an equal sign it is two tailed test

If the problem has a greater than or less than sign it is one tailed test

Case 1 :A government school states that dropout of female students between ages 12 and 18
years is 28%. Fig no 3(Two tailed test)

Case 2 :A government school states that dropout of female

students between ages 12 and 18 years greater than 28%

Case 3 :A government school states that dropout of female

students between ages 12 and 18 years less than 28%

32
Significance Level

It is denoted by α

It is probability of rejecting null hypothesis being rejected even if it is true.This is

because 100% accuracy is practically not possible for accepting & rejecting a
hypothesis.

For example a significance level of 0.03 indicates that a 3 % risk is being taken
that a difference in values exists when there is no difference.

Typical values of significance level is 0.01,0.05,0.1

33
Calculated probability( r )-statistical expectation is true
It is calculated probability that states that when the null hypothesis is true, the
statistical summary will be greater than or equal to the actual observed results.

It is the probability of finding the observed or more extreme results when the null
hypothesis is true.
Some of the widely used hypothesis testing types are:
Z-test
T-test
Chi-Square

34
Types of Hypothesis Testing
Hypothesis Test

Parametric Test NonParametric Test

One Two One Two

Sample Sample Sample Sample

Independent Paired Samples

Z-Test Samples

T-test Z-Test Paired-Test

Chi-Square Two group
Test test

35
Z -Test
● This test is used for comparing the mean of a sample to some hypothesized mean
of a given population.
● The method for carrying out z-test for one sample is

z=X-µH0
σp /√n
Where µH0 =hypothesized population mean

σp Standard deviation

36
Example

● For a sample of 500 female students having a mean height of 5.4 feet.The
task is to find whether it can be reasonably regarded as a sample from a large
population with a mean height of 5.6 feet and standard deviation of 1.45
feet.Let us consider 5 % level of significance to solve the problem.

37
T-test

● The one sample t-test is mainly used for determining whether the mean of
sample is statistically different from a known or hypothesized mean of a given
population.
● The test variable needs to be continuous

z=X-µH0

σs /√n

38
Chi Square test

● A chi square test is a test of statistical significance for categroical variables.

● It is used to find difference between the observed and expected data
● To find the correlation beween categorical variables In our data

39
Z TEST

40
Z-TEST

Import math
Import numpy as np
From numpy.random import randn
From statsmodels.stats.weightstats import ztest
# Here we are considering random array of 50 numbers having mean=110 sd=15
Mean_1=110
Sd_1=15/math.sqrt(50)
Null_mean=100
data=sd_1*randn(50)+mean_1
#print mean and sd
Print(‘mean=%.2f stdv=%.2f’ % (np.mean(data),np.std(data)))
41
● P_value=ztest(data,value-null_mean,alternate=‘larger’)

● If(p_value<0.05):
Print(“Reject Null hypothesis”)
else:
Print(“fail to reject null hypothesis”)

42
ANOVA

● Analysis of Variance (ANOVA) is an extension of t-test. It is used to check if

the mean of two or more groups are significantly different from each other.

43
Two sample parametric tests

● Independent samples z-test

This test is carried out on two normally distributed but independent population
for comparing the means of the samples.
The population variances of both the samples are already known.
Original size of samples considered should be larger than 30

Where S1 is Standard deviation of sample 1 44

Where S2 is Standard deviation of sample 2
Independent sample t-test

● This test is carried out to test the statistical difference between

1. The means of two groups
2. The means of two interventions
3. The means of two change scores

45
Paired Sample t-test

● To carried out to compare two population means for given two samples in
which observation in one sample can be paired with observations in one
sample can be compared with observations in other.
● This test is usually used in case of before-and-after observations for
considered subject

46
Non Parametric Hypothesis test
● Information about the population is unknown and hence no assumption can be made regarding the
population
● It is more suitable for data that can be represented in qualitative scales
(nominal or ordinal )
● Cover techniques that do not rely on data belonging to any particular distribution
● The distribution of data can be skewed as well as the population variace can be non homogeneous
● One sample non-parametric test
One factor Chi-Square
Binomial
Wilcoxon Signed Rank Test
● Two Independent Sample
Mann-Whitney Test
Kolmogorov-Smirnov/s Test
● Two Paired Samples
Sign
Chi-Square
Wilcoxon Signed rank
47
Estimation of Parameter values
● In statistics finding estimation or inference refers to the task of drawing conclusion
about a population based on information provided about the population
● This can be done in two ways
Point estimate
Interval estimate
● Point estimation considers only single value of a statistics.
● Point estimation is based on single random sample its value will vary when
different random samples will considered from sample population.
● Few of the standard Point estimation methods are
Maximum Likelihood Estimator
Minimum Variance mean Unbiased Estimator
Minimum mean squared error
Best Linear Unbiased Estimator

48
Interval Estimate
It considers two values between which the population parameter considers two
values between which the population parameter is likely to lie.
The two values

49
Measuring Data Similarity and Dissimilarity
● Similarity measure is a way of measuring how data samples are related or
close to each other.
● Dissimilarity measure is to tell how much the data objects are distinct.
● Similarity measures are expressed as numerical value
● It gets higher when the data samples are more alike
● Zero means low similarity and one means very similar)
● Data structures
The data matrix
The dissimilarity matrix
● Object dissimilarity can be computed for objects described by nominal
attributes, binary attributes, numerical attributes, ordinal attributes.

50
Proximity measures for Nominal Attributes

● Nominal Attributes means relating to names. The. value of nominal attribute

are symbols or names or things.
● Let M be the total number of states in nominal attribute .Then status can be
numbered from 1 to M.
● Let m be the total number of attributes for which I and j are in same state and
p the total number of attributes then dissimilarity can be calculated as
d(i,j)=(p-m)/p
Similarity as
s(I,j)=1-d(I,j)

51
Proximity measures for Numeric data

● Euclidean distance d = √[ (xi1 – xj1)2 + …(yi2 – yj2)2]

● Manhattan distance The Manhattan Distance between two points (X1,

● Minkowski distance
( |X1 – Y1|p + |X2 – Y2|p + |X2 – Y2|p )1/p

52
● SET A
1. Write a Python program to find the maximum and minimum value of a given
flattened array.
import numpy as np
ar=np.array([[0,1],[2,3]])
print("Original Flattened Array");
print(ar)
print("-----------------")
print("Maximum value of the above flattened array:")
print(np.amax(ar))
print("Minimum value of the above flattened array:")
print(np.amin(ar))

53
Write a python program to compute Euclidian Distance between two data
points in a dataset. [Hint: Use linalgo.norm function from NumPy]
import numpy as np
point1 = np.array((1, 2, 3))
point2 = np.array((1, 1, 1))
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(point1 - point2)
# printing Euclidean distance
print(dist)

54
3. Create one dataframe of data values. Find out mean, range and IQR for
this data
.
import pandas as pd
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=["Apple", "Orange", "Banana", "Pear"],
index=["Basket1", "Basket2", "Basket3", "Basket4",
"Basket5", "Basket6"])
Print(“\n----------- Calculate Mean -----------\n”)
print(df.mean())
print("-----Maximum Value-------")
a=df.max()
print(a)
print("-----Minimum Value-------")
b=df.min()
print(b)
r=a-b
print("-------Range-------")
print(r)
55
4 find sum of Manhattan distance between all the pairs of given points
Return the sum of distance between all the pair of points.

def distancesum (x, y, n):

sum = 0
# for each point, finding distance
# to rest of the point
for i in range(n):
for j in range(i+1,n):
sum += (abs(x[i] - x[j]) +
abs(y[i] - y[j]))
return sum
# Driven Code
x = [ -1, 1, 3, 2 ]
y = [ 5, 6, 5, 3 ]
n = len(x)
print(distancesum(x, y, n) )

56
5. Write a NumPy program to compute the histogram of nums against the
bins.
import numpy as np
import matplotlib.pyplot as plt
nums = np.array([0.5, 0.7, 1.0, 1.2, 1.3, 2.1])
bins = np.array([0, 1, 2, 3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()

57
6. Create a dataframe for students’ information such name, graduation
percentage and age.
#Display average age of students, average of graduation percentage.
#And, also describe all basic statistics of data. (Hint: use describe()).
import pandas as pd
import numpy as np
stud_data = {"name": ["Akanksha", "Diya", "Komal", "James“,"Emily","Jonas"],"grade": [78, 69, 65, 90,
45,89],
"age": [21,23,22,19,20,18]}
df = pd.DataFrame(stud_data)
print(df)
print("------average of graduation percentage-------")
mean_grade = df["grade"].mean()
print(mean_grade)
print("------average of graduation age-------")
mean_age = df["age"].mean()
print(mean_age)
print("------Describe basic statistics of data-------")
df.describe()

58
Concept of outlier
● An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population.
● Outlier detection is the process of finding
data objects with behaviors that are
different from the expectation
● They can be caused by measurement or
execution errors.
● 1.weight vs Height
● 2.Fraud transaction in case of credit card
data

59
Examples:
● Real-World example, the average height of a giraffe is about 16 feet tall. However, there
have been recent discoveries of two giraffes that stand at 9 feet and 8.5 feet,
respectively. These two giraffes would be considered outliers compared to the general
giraffe population.
● When going through data analysis outliers can cause anomalies in the results obtained.
This means that they require some special attention and, in some cases, will need to be
removed to analyze data effectively.
Here are two main reasons why giving outliers special attention is a necessary aspect of the
data analytics process:

1. Outliers may harm the result of an analysis

2. Outliers—or their behavior—may be the information that a data analyst requires from
the analysis
60
How to identify outliers using Z-score & Interquartile Range,
visualizations.
●

● Example: With small datasets, it can be easy to spot outliers manually (for example,
with a set of data being 28, 26, 21, 24, 78, you can see that 78 is the outlier) but
when it comes to large datasets or big data, other tools are required.

● In data analytics, analysts create data visualizations to present data graphically in a

meaningful and impactful way, to present their findings to relevant stakeholders.
These visualizations can easily show trends, patterns, and outliers from a large set of
data in the form of maps, graphs and charts.
61
● https://youtu.be/R-P8qEGXnBs

62
There are eight main causes of outliers.

● Incorrect data entry by humans

● Codes used instead of values
● Sampling errors or data has been extracted from the wrong place or mixed
with other data
● Unexpected distribution of variables
● Measurement errors caused by the application or system
● Experimental errors in extracting the data or planning errors
● Intentional dummy outliers inserted to test the detection methods
● Natural deviations in data, not an error, that indicate fraud or some other
anomaly you are trying to detect.

63
64
Global Outlier

● Global outliers are also called point outliers. Global

outliers are taken as the simplest form of outliers.
● When data points deviate from all the rest of the data
points in a given data set, it is known as the global
outlier.
● In most cases, all the outlier detection procedures are
targeted to determine the global outliers. The green
data point is the global outlier.

65
Contextual Outlier
● Contextual outliers are also known as Conditional
outliers. These types of outliers happen if a data
object deviates from the other data points because of
any specific condition in a given data set.
● As we know, there are two types of attributes of
objects of data: contextual attributes and behavioral
attributes.
● Contextual outlier analysis enables the users to
examine outliers in different contexts and conditions,
which can be useful in various applications.
● For example, A temperature reading of 45 degrees
Celsius may behave as an outlier in a rainy season.
Still, it will behave like a normal data point in the
context of a summer season. In the given diagram, a
green dot representing the low-temperature value in
June is a contextual outlier since the same value in
66
December is not an outlier.
● Collective outliers are groups of data
points that collectively deviate significantly
from the overall distribution of a dataset.
● Collective outliers may not be outliers
when considered individually, but as a
group, they exhibit unusual behavior.
● Detecting and interpreting collective
outliers can be more complex than
individual outliers, as the focus is on group
behavior rather than individual data
points.

67
Outlier detection Method

● Supervised
● Semi Supervised
● Unsupervised

68
Supervised methods

● Supervised methods model data normality and abnormality.

● Domain professionals tests and label a sample of the basic data.
● Outlier detection can be modeled as a classification issue. The service is to
understand a classifier that can identify outliers.
● The sample can be used for training and testing.
● In some application the experts may label just the normal objects and any
other objects not matching the model of normal objects are reported as
outlier.

69
Unsupervised methods
•In various application methods, objects labeled as “normal” or “outlier” are
not applicable.
•Therefore, an unsupervised learning approach has to be used.
•Unsupervised outlier detection methods create an implicit assumption such
as the normal objects are considerably “clustered.”
•An unsupervised outlier detection method predict that normal objects follow a
pattern far more generally than outliers.
•Normal objects do not have to decline into one team sharing large similarity.
Instead, they can form several groups, where each group has multiple
features.

70
Semi-Supervised Methods
● In several applications, although obtaining some labeled instance is possible,
the number of such labeled instances is small.
● It can encounter cases where only a small group of the normal and outlier
objects are labeled, but some data are unlabeled.
● Semi-supervised outlier detection methods were produced to tackle such
methods.
● Semi-supervised outlier detection methods can be concerned as applications
of semisupervised learning approaches. For example, when some labeled
normal objects are accessible, it can use them with unlabeled objects that are
nearby, to train a model for normal objects. The model of normal objects is
used to identify outliers—those objects not suitable the model of normal
objects are defined as outliers.

71
Statistical Method

● This are also known as mode based method

● Simply starting with visual analysis of the Univariate data by using Boxplots,
Scatter plots, Whisker plots, etc., can help in finding the extreme values in the
data.
● Assuming a normal distribution, calculate the z-score, which means the
standard deviation (σ) times the data point is from the sample’s mean.
● Another way would be to use InterQuartile Range (IQR) as a criterion and
treating outliers outside the range of 1.5 times from the first or the third
quartile.

72
Proximity Methods

● They assume that an object is an outlier if the nearest neighbors of the object
are far away in feature space;
● The proximity of the object to its neighbors significantly deviates from the
proximity of most of the other objects to their neighbors in the same data set.
● Proximity-based methods are classified into two types: Distance-based
methods judge a data point based on the distance(s) to its neighbors.
Density-based determines the degree of outlines of each data instance based
on its local density.

Essentials of Modern Business Statistics With Microsoft Excel 8th Edition David R. Anderson - Ebook PDF Download
100% (10)
Essentials of Modern Business Statistics With Microsoft Excel 8th Edition David R. Anderson - Ebook PDF Download
67 pages
Presentation On Data Analysis: Submitted by
No ratings yet
Presentation On Data Analysis: Submitted by
38 pages
Sociology Knu Question Paper 21,22,23.........
No ratings yet
Sociology Knu Question Paper 21,22,23.........
63 pages
Statistics Notes Self Made
100% (1)
Statistics Notes Self Made
41 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
Descriptive Statistics PDF
100% (1)
Descriptive Statistics PDF
40 pages
Statistics
No ratings yet
Statistics
21 pages
Final SRB Unit 2
No ratings yet
Final SRB Unit 2
162 pages
Ssmda End Sem
No ratings yet
Ssmda End Sem
152 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Ai - Ssmda
No ratings yet
Ai - Ssmda
142 pages
Thesis - Intan Normaya Binti Hairuddin (2009469922)
100% (1)
Thesis - Intan Normaya Binti Hairuddin (2009469922)
65 pages
Module 3 Data Analysis Techniques
No ratings yet
Module 3 Data Analysis Techniques
55 pages
Statistics Report, Group I
No ratings yet
Statistics Report, Group I
44 pages
Shubh Am
No ratings yet
Shubh Am
70 pages
Research Sales
No ratings yet
Research Sales
184 pages
ML Lab Manual
No ratings yet
ML Lab Manual
27 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
Statistics & Psychology
No ratings yet
Statistics & Psychology
47 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
39 pages
RSCH Methods 2010.doc New
No ratings yet
RSCH Methods 2010.doc New
82 pages
Lectures - ProbaStat For Engineers
No ratings yet
Lectures - ProbaStat For Engineers
60 pages
Session 1 On Descriptive Statistics
No ratings yet
Session 1 On Descriptive Statistics
24 pages
Central Tendency - Lecture Notes
No ratings yet
Central Tendency - Lecture Notes
34 pages
Third Quarterly Examination: Final Pointers: Grade 8
No ratings yet
Third Quarterly Examination: Final Pointers: Grade 8
8 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
STTF 112 - TEST 3 and Memo-1
No ratings yet
STTF 112 - TEST 3 and Memo-1
5 pages
Statistics
No ratings yet
Statistics
152 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
Business Analytics
No ratings yet
Business Analytics
44 pages
PRW Questions
No ratings yet
PRW Questions
31 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Business Statistics Chapter 2
No ratings yet
Business Statistics Chapter 2
13 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Social Science Statistics (June-Aug) 2025-Topic 2
No ratings yet
Social Science Statistics (June-Aug) 2025-Topic 2
21 pages
DSOST2
No ratings yet
DSOST2
44 pages
Topic 2 - Descriptive - Statistics
No ratings yet
Topic 2 - Descriptive - Statistics
36 pages
Statistics, Statistical Modelling & Data Analytics
No ratings yet
Statistics, Statistical Modelling & Data Analytics
68 pages
Week - 1 Day - 1 Descriptive Statistics
No ratings yet
Week - 1 Day - 1 Descriptive Statistics
40 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Math Grade 7 Unit 16
No ratings yet
Math Grade 7 Unit 16
9 pages
Ads Exp1
No ratings yet
Ads Exp1
6 pages
Statistics
No ratings yet
Statistics
23 pages
Descriptive Stat Lec 1
No ratings yet
Descriptive Stat Lec 1
32 pages
Data Mining Lab Maual Through Python 031023
No ratings yet
Data Mining Lab Maual Through Python 031023
22 pages
Basic Statistical Description of Data
No ratings yet
Basic Statistical Description of Data
13 pages
Program-1
No ratings yet
Program-1
15 pages
Ms Data Science S, 24 (WEEK# 1) Unlock
No ratings yet
Ms Data Science S, 24 (WEEK# 1) Unlock
31 pages
Ms Data Science S, 24 (WEEK# 1)
No ratings yet
Ms Data Science S, 24 (WEEK# 1)
30 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Session 12
No ratings yet
Session 12
8 pages
Unit I Bbbbbbbbbbbbbba
No ratings yet
Unit I Bbbbbbbbbbbbbba
8 pages
Notebook Statistics
No ratings yet
Notebook Statistics
6 pages
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
No ratings yet
Click To Add Text Dr. Cemre Erciyes: Soc 2003 Statistical Methods and Computer Applications in Social Sciences 18/19
69 pages
Angilan, Ef
No ratings yet
Angilan, Ef
5 pages
1 Statistics, Graphs, Curves
No ratings yet
1 Statistics, Graphs, Curves
65 pages
FDS CH 2
No ratings yet
FDS CH 2
2 pages
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
13 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
No ratings yet
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
9 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
Prac 11 Heteroscedasticity Solution
No ratings yet
Prac 11 Heteroscedasticity Solution
11 pages
DA Practical Lab 02 Statistical Functions
No ratings yet
DA Practical Lab 02 Statistical Functions
6 pages
Statistics and Probability T-Test
No ratings yet
Statistics and Probability T-Test
37 pages
Bachelor of Business Administration: Chhatrapati Sahu Ji Maharaj University, Kanpur
No ratings yet
Bachelor of Business Administration: Chhatrapati Sahu Ji Maharaj University, Kanpur
64 pages
163-Article Text-608-1-10-20200820
No ratings yet
163-Article Text-608-1-10-20200820
12 pages
BA 216 Lecture 4 Notes
No ratings yet
BA 216 Lecture 4 Notes
28 pages
Edaunit IV
No ratings yet
Edaunit IV
15 pages
Uji Likelihood Ratio
No ratings yet
Uji Likelihood Ratio
5 pages
AP Stats Investigative Task 2 - Education Money
100% (1)
AP Stats Investigative Task 2 - Education Money
3 pages
24 Factorial Design
No ratings yet
24 Factorial Design
47 pages
Golden Gate Colleges: Raw Data (Result of Pretest and Post Test in English Iv) Respondent Pretest Post Test
No ratings yet
Golden Gate Colleges: Raw Data (Result of Pretest and Post Test in English Iv) Respondent Pretest Post Test
12 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
11241-Article Text-23214-1-10-20190524 PDF
No ratings yet
11241-Article Text-23214-1-10-20190524 PDF
13 pages
Statistics: Descriptive Statistics Inferntial Statistics
No ratings yet
Statistics: Descriptive Statistics Inferntial Statistics
5 pages
Simplified Field Tables: Height-For-Age GIRLS 2 To 5 Years (Percentiles)
No ratings yet
Simplified Field Tables: Height-For-Age GIRLS 2 To 5 Years (Percentiles)
2 pages
Note On Desmoothing
No ratings yet
Note On Desmoothing
2 pages
Article 3
No ratings yet
Article 3
5 pages
Math 221 Week 1 Quiz
No ratings yet
Math 221 Week 1 Quiz
10 pages
Business Statistics Operations Research
No ratings yet
Business Statistics Operations Research
8 pages
Week 3: Central Tendency: Problem Set 3.1: Characteristics of The Mean
No ratings yet
Week 3: Central Tendency: Problem Set 3.1: Characteristics of The Mean
6 pages
Field Work Project 1
No ratings yet
Field Work Project 1
7 pages
Activity - Problems Involving Areas Under The Normal Curve
No ratings yet
Activity - Problems Involving Areas Under The Normal Curve
1 page
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet

DS Chapter - 2

Uploaded by

DS Chapter - 2

Uploaded by

Chapter -2

Statistical Data Analysis

● Data Science is as interdisciplinary field which requires a strong

1. Identify the importance feature in the data

Descriptive statistics Inferential Statistics

Measures of Measures of Measures of

● Descriptive statistics summarizes this large amount of data and

● The summarization is done from the sample of the population using

● Frequency is statistical quantity in data science.

● It is important measures of statistical analysis is to find one value that

● A geometric mean is a mean or average which shows the central tendency of

● It is middle value of data.

● It is value that occur more frequently in a dataset.

● It is simplest measure of dispersion.Let x1,x2,….xn be a set of observations

Find the SD for 4,9,11,12,17,5,8,12,14

● Variance measures how far a data set is spread out.It is mathematically

● Consider the following data

statistics.mean() Calculates the mean (average) of the given data

statistics.median() Calculates the median (middle value) of the given data

statistics.median_grouped() Calculates the median of grouped continuous data

statistics.median_high() Calculates the high median of the given data

statistics.median_low() Calculates the low median of the given data

statistics.pstdev() Calculates the standard deviation from an entire population

statistics.stdev() Calculates the standard deviation from a sample of data

statistics.pvariance() Calculates the variance of an entire population

statistics.variance() Calculates the variance from a sample of data

•Statistical Inference mainly deals with

Rejected Type I Error √

Not Rejected √ Type II Error

Case 2 :A government school states that dropout of female

students between ages 12 and 18 years greater than 28%

Case 3 :A government school states that dropout of female

students between ages 12 and 18 years less than 28%

It is probability of rejecting null hypothesis being rejected even if it is true.This is

Typical values of significance level is 0.01,0.05,0.1

Parametric Test NonParametric Test

One Two One Two

Independent Paired Samples

T-test Z-Test Paired-Test

● A chi square test is a test of statistical significance for categroical variables.

● Analysis of Variance (ANOVA) is an extension of t-test. It is used to check if

● Independent samples z-test

Where S1 is Standard deviation of sample 1 44

● This test is carried out to test the statistical difference between

● Nominal Attributes means relating to names. The. value of nominal attribute

● Euclidean distance d = √[ (xi1 – xj1)2 + …(yi2 – yj2)2]

● Manhattan distance The Manhattan Distance between two points (X1,

def distancesum (x, y, n):

1. Outliers may harm the result of an analysis

● In data analytics, analysts create data visualizations to present data graphically in a

● Incorrect data entry by humans

● Global outliers are also called point outliers. Global

● Supervised methods model data normality and abnormality.

● This are also known as mode based method

You might also like