Statistics For Data Science

This document provides an overview of key concepts in statistics for data science. It defines statistics as the science of collecting, analyzing, and interpreting quantitative data. There are two main types of data: quantitative data that can be measured and qualitative data that describes characteristics. Descriptive statistics summarize and describe data through measures of central tendency like mean, median, and mode, and measures of variability like range and standard deviation. Inferential statistics are used to draw conclusions about populations based on samples through hypothesis testing, which involves stating hypotheses and using statistical tests to reject or fail to reject the null hypothesis.

Uploaded by

Dr. Sanjay Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

225 views27 pages

Statistics For Data Science

Uploaded by

Dr. Sanjay Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Statistics for Data Science

Objectives
At the end of this chapter, students will be able to

• Describe how statistics play a major role in data science.

• Understand the different types of Statistical data.
• Understand the different types of Statistical variables.
What is Statistics?
• Statistics – science of collecting, analyzing, and interpreting
data in such a way that the conclusions can be objectively
evaluated.
• It also helps us in taking informed decisions based on the
evidence.
• There are 3 Phases:
1.Collecting data
2.Analyzing data
3.Interpreting data
Types of Data
There are two main types of data:
• Quantitative data: This is numerical data that can be measured
and analyzed mathematically.
• Examples - height, weight, age, and income.
• Two categories:
Discrete Data : Discrete data are quantitative data that are counted. (1,2,3,4,…)
Continuous Data : Continuous data are quantitative data that are measured.(0.5, 1.678, …)
• Qualitative data: This is non-numerical data that describes
qualities or characteristics.
• Examples - gender, occupation, and favorite color.
Types of Statistics
• Descriptive Statistics – summarize and describe a characteristic
of a group . It can be used to identify patterns, outliers, and
trends in data.
• Example: batting average for a player, average marks of a class

• Inferential Statistics – used to estimate, infer, or conclude

something about a larger group
• Example: polls Sample – subset of the group of data available for
analysis
What is Statistics?
• Population – refers to the entire set.
• Sample - refers to a subset of the population that is selected
for analysis
• Bias – favoring of certain outcomes over others
• Census – collects data from all members of the population
• Parameter – characteristic value of a population
• Statistic – characteristic value of a sample
Organizing Data – Frequency Table
• Frequency is the number of times that a particular result
occurs.
• Ways to organize data:
Simple frequency table

Note: There are many other types of frequency tables depending on information you want
to record.
Displaying Data
• Ways to display data:
• Frequency histogram
• Relative frequency Histogram
• Multiple bar graph
• Stacked bar graph
• Line graph
• Pie chart
Displaying Data
Descriptive Statistics - Measures of
Central Tendency
Central Tendency – the propensity of data to be located or
clustered about some point.

Usually Mean, Median, Mode, etc. will be used to measure the

central tendency.
Descriptive Statistics
Mean
The mean is the same as the average. To find the mean, add all the values and divide by
the total number of values.
2+3+5+6
Example: for the data {2, 3, 5, 6}, mean will be µ = 1/𝑛 σ𝑛𝑖=1 𝑥𝑖 = =4
4
The letter x with a bar over it, represents the sample mean.
Mode
The mode is the most frequent value in the set of numbers.
Example: In the data set 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95, the
most frequent value is 78. The mode = 78.
Example: In the data set 52, 53, 53, 53, 60, 67, 72,72,72, 90, both 53 and 72 occur the
most number of times (3 times each) so there are two modes, 53 and 72. We call this set
of data bimodal meaning it has two modes.
Descriptive Statistics
Median
The median is the middle value of a set of numbers that has been ordered from smallest
to largest. The upper case letter M is used for the median.
Example: A sample of statistics exam scores for 14 students are (in order from smallest to
largest) as follows: 53, 59, 63, 63, 72, 72, 76, 78, 81, 83, 84, 84, 90, 93
Notice that 14 is an even number. The median is between 7th and 8th values (the middle
76+78
two values). 𝑚𝑒𝑑𝑖𝑎𝑛 = = 77
2

Example: A second sample of statistics exam scores for 15 students are (in order from
smallest to largest) as follows: 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95
Notice that 15 is an odd number. The median is the 8th value (the middle value). The 8th
value is 76 so the median Median = 76.
Descriptive Statistics- Measures of
Variability
Measures of variability describe how the data is spread out. The
most commonly used measures of variability are:
• Range: This is the difference between the largest and smallest
values in a dataset.
• Variance: This is the average of the squared differences from
the mean. It measures how much the data deviates from the
mean.
• Standard Deviation: This is the square root of the variance. It
is a measure of how spread out the data is from the mean.
Variance
A deviation is the difference between a value and the mean and is written as: x-µ
The variance is the average of the squares of the deviations.
Example: {2, 3, 5, 6} is a set of data. The sample mean is 4. The deviations are:
• 2 - 4 = -2
• 3 - 4 = -1
• 5 - 4 = 1
• 6 - 4 = 2
The deviations squared are:
• (-2)2 = 4
• (-1)2 = 1
• (1)2 = 1
• (2)2 = 4
4+1+1+4
• An average of the deviations squared is 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = = 2.5
4
Standard Deviation
The standard deviation is a special average of the deviations. It measures how the data is
spread out from its mean.

It is calculated as the square root of the variance of the data. The formula for calculating
standard deviation is:

Standard deviation = sqrt(Variance)

In our example, standard deviation = σ = sqrt (2.5) = 1.58

Descriptive Statistics- Derived Score
Percentile/Quantile –
• A percentile is a statistical measure that represents a point below which a given
percentage of observations in a dataset falls. In other words, if we say that the 25th
percentile of a dataset is 60, then it means that 25% of the values in the dataset are
below 60, and 75% of the values are above 60.
• Percentiles are often used in statistics to summarize the distribution of a dataset. For
example, the median (50th percentile) is a commonly used percentile that gives a
measure of the central tendency of the data. Other percentiles, such as the 25th and
75th percentiles, can give a measure of the spread of the data around the median.
• To calculate the percentile of a dataset, we first order the values from smallest to
largest, and then identify the value that corresponds to the desired percentile. If the
desired percentile falls between two values in the dataset, we can interpolate to
estimate the exact value.
Descriptive Statistics- Derived Score
Z-Score
• A z-score (or standard score) is a statistical measure that represents the number of
standard deviations an observation or data point is away from the mean of its
distribution. In other words, a z-score indicates how far a given value is from the mean
in terms of the standard deviation of the data.
• The formula for calculating the z-score of a value x in a dataset with mean μ and
standard deviation σ is:
z = (x - μ) / σ
• A z-score of 0 indicates that the value is equal to the mean of the distribution, while a
z-score of +1 (or -1) indicates that the value is one standard deviation above (or below)
the mean. A z-score of +2 (or -2) indicates that the value is two standard deviations
above (or below) the mean, and so on.
Descriptive Statistics- Derived Score
• Normal Distribution Definition: Standardizing – converting data to z-
scores.
• Some empirical rules:
• 1.About 68% of data is within one σ of the mean.
• 2.About 95% of data is within two σ of the mean.
• 3.About 99% of data is within three σ of the mean
Descriptive Statistics-Regression &
Correlation
• Linear Regression – modeling the data with the line that “best
fits” – usually a “least squares” line or regression line
• Least Squares Line – is the line that minimizes the sum of the
squared errors for a set of data points
• Correlation Coefficient r – is a measure of the strength of the
linear relationship between the 2 random variables x and y.
Note: The closer the correlation is to 1 or – 1, the stronger the relationship between
the x and y variables.

• A correlation of zero means there is no evidence of a linear

pattern.
Inferential Statistics
• Inferential statistics are used to draw conclusions about a
population by examining the sample
Inferential Statistics- Chain of Reasoning
• Are our inferences valid?.. Best way we can do is to calculate
probability about inferences.
Inferential Statistics-
• Accuracy of inference depends on representativeness of
sample from population.
• Random selection – equal chance for anyone to be selected
makes sample more representative.
• It helps the researchers to test hypotheses and answer
research questions and derive meaning from the results.
Inferential Statistics- Steps
1. State Hypothesis
2. Level of Significance
3. Computing calculated value
4. Obtain Critical value
5. Reject or fail to reject Ho.
Inferential Statistics-Hypothesis
• It uses sample data to evaluate the credibility of a hypothesis
about a population
• Null hypothesis – No differences between means

• Alternative hypothesis – Predicts that there are differences

between the groups
Inferential Statistics-Possible outcomes
in Hypothesis Testing
Identifying the appropriate statistical
test of difference.
References
• Walpole et.al (2016). "Probability and Statistics for Engineers
and Scientists" published by Pearson.
• Anderson et. al (2019). "Statistics for Business and Economics"
published by Cengage Learning.
• Bluman (2017). “Elementary Statistics” A step-by-step
approach “, published by McGraw Hill

Python Data Associate Certification Study Guide
No ratings yet
Python Data Associate Certification Study Guide
2 pages
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
23 pages
Building A Career in Data Science - The Overview
No ratings yet
Building A Career in Data Science - The Overview
2 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
Introductory Concepts of Probabability & Statistics
No ratings yet
Introductory Concepts of Probabability & Statistics
6 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Chapter 5,6 Regression Analysis
50% (2)
Chapter 5,6 Regression Analysis
44 pages
K-Means Clustering Using Python
No ratings yet
K-Means Clustering Using Python
30 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
ML Interview
No ratings yet
ML Interview
17 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Machine Learning Engineer
No ratings yet
Machine Learning Engineer
4 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Introduction To Python and Computer Programming 1704298503
No ratings yet
Introduction To Python and Computer Programming 1704298503
44 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Statistic Interview Questions and Answers by Jeevan Raj
No ratings yet
Statistic Interview Questions and Answers by Jeevan Raj
21 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
Cluster
100% (1)
Cluster
72 pages
Cheatsheet Machine Learning Tips and Tricks PDF
No ratings yet
Cheatsheet Machine Learning Tips and Tricks PDF
2 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
771 A18 Lec4
100% (1)
771 A18 Lec4
128 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
No ratings yet
A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science
20 pages
Feature Engineering
No ratings yet
Feature Engineering
13 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
SQL For Beginners - Nettuts
100% (1)
SQL For Beginners - Nettuts
11 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Supervised Learning
No ratings yet
Supervised Learning
3 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
Numpy
No ratings yet
Numpy
15 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
A Starter Pack To Exploratory Data Analysis With Python, Pandas, Seaborn, and Scikit-Learn
No ratings yet
A Starter Pack To Exploratory Data Analysis With Python, Pandas, Seaborn, and Scikit-Learn
40 pages
Machine Learning With Python Nitin Sharma
No ratings yet
Machine Learning With Python Nitin Sharma
18 pages
10.1 Time Series Analysis Sales Forecast
No ratings yet
10.1 Time Series Analysis Sales Forecast
7 pages
Python Material
No ratings yet
Python Material
13 pages
Different Types of Regression Models
No ratings yet
Different Types of Regression Models
18 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Exploratory Data Analysis
100% (3)
Exploratory Data Analysis
26 pages
Heart Prediction
No ratings yet
Heart Prediction
15 pages
Data Science - Sem6
100% (3)
Data Science - Sem6
118 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Superkart
No ratings yet
Superkart
13 pages
Linear Regression
100% (1)
Linear Regression
51 pages
PCA Using Python
No ratings yet
PCA Using Python
18 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Logistic Regression
100% (2)
Logistic Regression
30 pages
Chapter1 Statistics
No ratings yet
Chapter1 Statistics
17 pages
Statistics
No ratings yet
Statistics
21 pages
Numpy Tutorial
No ratings yet
Numpy Tutorial
1 page
Chapter 1 - Intro To Data Science
No ratings yet
Chapter 1 - Intro To Data Science
24 pages
Exploratory Data Analysis Updated
No ratings yet
Exploratory Data Analysis Updated
44 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
A Wandering Mind Is An Unhapy Mind Killings Wort He Ma Science 2010
No ratings yet
A Wandering Mind Is An Unhapy Mind Killings Wort He Ma Science 2010
8 pages
Society For The Scientific Study of Religion, Wiley Journal For The Scientific Study of Religion
No ratings yet
Society For The Scientific Study of Religion, Wiley Journal For The Scientific Study of Religion
10 pages
QM 8 Panel Regression, Random Effects
No ratings yet
QM 8 Panel Regression, Random Effects
39 pages
Econometrics 2
100% (1)
Econometrics 2
69 pages
Untangling The Association Between Urban Mobility and Urban Elements
No ratings yet
Untangling The Association Between Urban Mobility and Urban Elements
20 pages
Problem in Regression Analysis
No ratings yet
Problem in Regression Analysis
7 pages
1 s2.0 S2666764923000437 Main
No ratings yet
1 s2.0 S2666764923000437 Main
8 pages
Exercise#8 Instructions Linear Regression Model
No ratings yet
Exercise#8 Instructions Linear Regression Model
4 pages
(Ebook PDF) Introductory Econometrics Asia Pacific Editionpdf Download
100% (5)
(Ebook PDF) Introductory Econometrics Asia Pacific Editionpdf Download
45 pages
Econometrics I AMU
No ratings yet
Econometrics I AMU
145 pages
A Robust Regression Method Based On Exponential-Type Kernel Functions - de Carvalho Et Al
No ratings yet
A Robust Regression Method Based On Exponential-Type Kernel Functions - de Carvalho Et Al
47 pages
Análise Espacial Com Regressão Linear e Kernel
No ratings yet
Análise Espacial Com Regressão Linear e Kernel
12 pages
Volatility and Jump Risk in Option Returns
No ratings yet
Volatility and Jump Risk in Option Returns
26 pages
HWRexam Econ en BA1
No ratings yet
HWRexam Econ en BA1
5 pages
Corporate Governance and Stock Market Liquidity in India by
No ratings yet
Corporate Governance and Stock Market Liquidity in India by
25 pages
Ts
No ratings yet
Ts
726 pages
Implementation of A Double-Hurdle Model: 13, Number 4, Pp. 776-794
No ratings yet
Implementation of A Double-Hurdle Model: 13, Number 4, Pp. 776-794
19 pages
Solution To Exercise 6.2
No ratings yet
Solution To Exercise 6.2
2 pages
Study Material - Econometrics
No ratings yet
Study Material - Econometrics
20 pages
Data Structures For Statistical Computing in Python
No ratings yet
Data Structures For Statistical Computing in Python
6 pages
Econ 231 Chapter 10 HW Solutions
No ratings yet
Econ 231 Chapter 10 HW Solutions
8 pages
Basic Econometrics - II
No ratings yet
Basic Econometrics - II
30 pages
Racial Discrimination in The Soccer Transfer Market: Evidence From England and Russia
No ratings yet
Racial Discrimination in The Soccer Transfer Market: Evidence From England and Russia
43 pages
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX: ECO 204 (Statistics For Business and Economics II) Section: 07
No ratings yet
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX: ECO 204 (Statistics For Business and Economics II) Section: 07
16 pages
MF 12th
No ratings yet
MF 12th
16 pages
Econometric Analysis of Panel Data: William Greene Department of Economics Stern School of Business
No ratings yet
Econometric Analysis of Panel Data: William Greene Department of Economics Stern School of Business
37 pages
New Evidence About Relationship Between Trade Openness and Food Security
No ratings yet
New Evidence About Relationship Between Trade Openness and Food Security
17 pages
Watson Introduccion A La Econometria PDF
No ratings yet
Watson Introduccion A La Econometria PDF
253 pages
Machine Learning
100% (1)
Machine Learning
185 pages

Statistics For Data Science

Uploaded by

Statistics For Data Science

Uploaded by

Statistics for Data Science

• Describe how statistics play a major role in data science.

• Inferential Statistics – used to estimate, infer, or conclude

Usually Mean, Median, Mode, etc. will be used to measure the

Standard deviation = sqrt(Variance)

In our example, standard deviation = σ = sqrt (2.5) = 1.58

• A correlation of zero means there is no evidence of a linear

• Alternative hypothesis – Predicts that there are differences

You might also like