Statistics For Data Science 1
Statistics For Data Science 1
1
Doaa M. Abdel-Aty
- Senior business intelligence at Badr University
3
Benefits of Enrolling in a Course:
• Understand the fundamentals of statistics.
• Work with different types of data.
• How to plot different types of data
• Calculate the measures of central tendency, asymmetry,
and variability.
• Calculate correlation and covariance.
• Distinguish and work with different types of distribution.
• Estimate confidence intervals.
• Perform hypothesis testing.
• Make data-driven decisions.
• Understand the mechanics of regression analysis.
• Carry out regression analysis.
• Use and understand dummy variables.
• Understand the concepts needed for data science even
with Python.
4
Understanding Statistics
The Science of collecting, organizing, presenting, analyzing and interpreting the data is
statistics. It is one of the most important disciplines or methods to get a deeper insight into
data. Statistical analysis is implemented to manipulate, summarize and investigate data so
that useful information can be obtained.
✔ It is a branch of mathematics.
• Market Segmentation
• Recommendation Systems
• Clustering
Two fundamental ideas in the field of statistics are uncertainty and variation.
13
Statistics In Data Science
14
Skills Needed ?
● Data manipulation
● Critical thinking and attention to detail
● Curiosity
● Organization
● Innovation and problem solving
● Communication
● Statistics
15
Descriptive Statistics
• uses the data to provide descriptions of the population, either through numerical
calculations or graphs or tables.
• helps organize data and focuses on the characteristics of data providing parameters
Inferential Statistics
• Makes inferences and predictions about a population based on a sample of data taken from the
population in question.
• Generalizes a large data set and applies probability to arrive at a conclusion. It allows you to infer
parameters of the population based on sample stats and build models on it.
Frequency Table
A frequency table lists a set of values and how often each one appears. Frequency is the number of times a
specific data value occurs in your dataset. These tables help you understand which data values are common
and which are rare. These tables organize your data and are an effective way to present the results to
others. Frequency tables are also known as frequency distributions because they allow you to understand
the distribution of values in your dataset.
19
Frequency Table
20
Population and Sample
Population
is the collection of all outcomes, responses, measurement, or counts that are of interest.
sample
is a subset of a population.
Variable:
Any measured or counted aspect of the individual that can vary from one person to another
e.g. age, sex, blood pressure, etc.
Raw Data:
Are the values taken by different variables i.e. the observations, counts, measurements, or responses.made on the
individuals.
Information:
Are data that have been transformed (through analysis and interpretation) into a form useful for
drawing conclusions and making decisions.
Variables
A variable is any characteristics, number, or quantity that can be measured
or counted. A variable may also be called a data item. Age, sex, business
income and expenses, country of birth, capital expenditure, class grades, eye
colour and vehicle type are examples of variables. It is called a variable
because the value may vary between data units in a population, and may
change in value over time.
22
Types of Data
24
Understanding The Variables Using a Dataset
Description statistics
Central Tendency of Data
• The arithmetic mean (often just called the “mean”) is the most common
measure of central tendency
The ith
• For a sample of size n: value
Pronounced x-bar
Sample Observed
size values
• The most common measure of central tendency
• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Mean = 13 Mean = 14
Outliers
Effect of Outliers
Central Tendency of Data: 2-Median
11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20
Median = 13 Median = 13
• Less sensitive than the mean to extreme values
• The location of the median when the values are in numerical order
(smallest to largest):
• If the number of values is odd, the median is the middle number
• If the number of values is even, the median is the average of the two
middle numbers
Note that is not the value of the median, only the position
of
Mode = 9 No Mode
Mode
• The value of the observation that appears most frequently.
Mode
Measures of Central Tendency: Review Example
▪ In some situations it makes sense to report both the mean and the
median.
Measures of Variation
Measures of Variation
● Variance
● Standard Deviation
● Range
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = ………..
Why The Range Can Be Misleading
▪ Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
VARIANCE
● VARIANCE The arithmetic mean of the squared
deviations from the mean.
● The variance is non-negative and is zero only if all
observations are the same.
● For populations whose values are near the mean,
the variance will be small.
● For populations whose values are dispersed from
the mean, the population variance will be large.
● The variance overcomes the weakness of the
range by using all the values in the population.
Computing the Variance
• Sample variance:
Where
X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Standard Deviation
10 12 14 15 17 18 18 24
n=8 Mean = X = 16
Measures of Variation: Comparing Standard Deviations
▪ The more the data are spread out, the greater the
▪ If the values are all the same (no variation), all these
Pearson
Correlation
Coefficient
Q&A
66