0% found this document useful (0 votes)
83 views65 pages

Statistics For Data Science 1

some knowledge about statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views65 pages

Statistics For Data Science 1

some knowledge about statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Statistics for Data Science

1
Doaa M. Abdel-Aty
- Senior business intelligence at Badr University

- Data science Consultant / Instructor

- PhD Student of Data science (FGSSR)

- Master of Data Science - Faculty of Statistical


Studies and Research (FGSSR) - Cairo University

- " Big Data analysis using Statistical intelligent


Techniques"
And precisely in " Energy Time series " (2020-2021).
Topics
• Practical example: inferential statistics
• Sample and population data. • Hypothesis testing: Introduction
• The fundamentals of descriptive statistics • introduction to Probability
• Measures of central tendency, asymmetry, and variability • Conditional Probability.
• Practical example: descriptive statistics • Bay’s theorem
• Distributions
• Vectors vs Matrices.
• Estimators and estimates
• Vectors Magnitude and direction.
• Confidence intervals: advanced topics
• Cos similarity between vectors.
• Matrix operations (dot product, multiplication,
addition, subtraction, inverse).

3
Benefits of Enrolling in a Course:
• Understand the fundamentals of statistics.
• Work with different types of data.
• How to plot different types of data
• Calculate the measures of central tendency, asymmetry,
and variability.
• Calculate correlation and covariance.
• Distinguish and work with different types of distribution.
• Estimate confidence intervals.
• Perform hypothesis testing.
• Make data-driven decisions.
• Understand the mechanics of regression analysis.
• Carry out regression analysis.
• Use and understand dummy variables.
• Understand the concepts needed for data science even
with Python.

4
Understanding Statistics
The Science of collecting, organizing, presenting, analyzing and interpreting the data is
statistics. It is one of the most important disciplines or methods to get a deeper insight into
data. Statistical analysis is implemented to manipulate, summarize and investigate data so
that useful information can be obtained.

✔ It is a branch of mathematics.

✔ It is the science of dealing with numbers.

✔ It is used for collection, summarization, presentation


and analysis of data.
Statistics helps answer questions like...

• What features are the most important?


• How should we design the experiment to develop our product strategy?
• What performance metrics should we measure?
• What is the most common and expected outcome?
• How do we differentiate between noise and valid data?
The General Applications of Statistical Models in Data Science
• Time Series

• Market Segmentation

• Recommendation Systems

• Association Rule Learning

• Clustering

• Dimension Reduction like PCA


Cycle of data
There is a requirement of statistics at every single step. That’s why a good statistician can be a good
Data Scientist as well
What is Statistics ?
Statistics is the science concerned with developing and studying methods for collecting, analyzing,
interpreting and presenting empirical data. Statistics is a highly interdisciplinary field; research in statistics
finds applicability in virtually all scientific fields and research questions in the various scientific fields
motivate the development of new statistical methods and theory. In developing methods and studying the
theory that underlies the methods statisticians draw on a variety of mathematical and computational tools.

Two fundamental ideas in the field of statistics are uncertainty and variation.

13
Statistics In Data Science

In data science, statistics is at the core of sophisticated machine

learning algorithms, capturing and translating data patterns into

actionable evidence. Data scientists use statistics to gather,

review, analyze, and draw conclusions from data, as well as apply

quantified mathematical models to appropriate variables.

14
Skills Needed ?
● Data manipulation
● Critical thinking and attention to detail
● Curiosity
● Organization
● Innovation and problem solving
● Communication
● Statistics

15
Descriptive Statistics

• uses the data to provide descriptions of the population, either through numerical
calculations or graphs or tables.
• helps organize data and focuses on the characteristics of data providing parameters
Inferential Statistics
• Makes inferences and predictions about a population based on a sample of data taken from the
population in question.

• Generalizes a large data set and applies probability to arrive at a conclusion. It allows you to infer
parameters of the population based on sample stats and build models on it.
Frequency Table
A frequency table lists a set of values and how often each one appears. Frequency is the number of times a
specific data value occurs in your dataset. These tables help you understand which data values are common
and which are rare. These tables organize your data and are an effective way to present the results to
others. Frequency tables are also known as frequency distributions because they allow you to understand
the distribution of values in your dataset.

19
Frequency Table

20
Population and Sample
Population
is the collection of all outcomes, responses, measurement, or counts that are of interest.
sample
is a subset of a population.
Variable:
Any measured or counted aspect of the individual that can vary from one person to another
e.g. age, sex, blood pressure, etc.
Raw Data:
Are the values taken by different variables i.e. the observations, counts, measurements, or responses.made on the
individuals.
Information:
Are data that have been transformed (through analysis and interpretation) into a form useful for
drawing conclusions and making decisions.
Variables
A variable is any characteristics, number, or quantity that can be measured
or counted. A variable may also be called a data item. Age, sex, business
income and expenses, country of birth, capital expenditure, class grades, eye
colour and vehicle type are examples of variables. It is called a variable
because the value may vary between data units in a population, and may
change in value over time.

22
Types of Data
24
Understanding The Variables Using a Dataset
Description statistics
Central Tendency of Data

• Single value that try to describe the whole data using a


central point or central location of the data.
Central Tendency of Data: 1- Mean

• The arithmetic mean (often just called the “mean”) is the most common
measure of central tendency
The ith
• For a sample of size n: value
Pronounced x-bar

Sample Observed
size values
• The most common measure of central tendency
• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Mean = 13 Mean = 14
Outliers
Effect of Outliers
Central Tendency of Data: 2-Median

• In an ordered array, the median is the “middle” number (50%


above, 50% below)

11 12 13 14 15 16 17 18 19 20 11 12 13 14 15 16 17 18 19 20

Median = 13 Median = 13
• Less sensitive than the mean to extreme values
• The location of the median when the values are in numerical order
(smallest to largest):
• If the number of values is odd, the median is the middle number

• If the number of values is even, the median is the average of the two
middle numbers

Note that is not the value of the median, only the position
of

the median in the ranked data.


Central Tendency of Data: 3- Mode

• Value that occurs most often


• Not affected by extreme values
• Used for either numerical or categorical data
• There may be no mode
• There may be several modes
0 1 2 3 9 5 6 7 8 9 10 11 12 9 14 9 0 1 2 3 4 5 6

Mode = 9 No Mode
Mode
• The value of the observation that appears most frequently.
Mode
Measures of Central Tendency: Review Example

House Prices: ▪ Mean: ($ / )


= $
$2,000,000
$ 500,000 ▪ Median: middle value of ranked
$ 300,000 data
$ 100,000 =$
$ 100,000 ▪ Mode: most frequent value
Sum $ 3,000,000 =$
Which Measure to Choose?

▪ The mean is generally used, unless extreme values (outliers) exist.

▪ The median is often used, since the median is not sensitive to


extreme values. For example, median home prices may be
reported for a region; it is less sensitive to outliers.

▪ In some situations it makes sense to report both the mean and the
median.
Measures of Variation
Measures of Variation

● Variance
● Standard Deviation
● Range

■ Measures of variation give information on the


spread or variability or dispersion of the
data values.
Same center,
different variation
Measures of Variation
Range
▪ Simplest measure of variation
▪ Difference between the largest and the smallest values:

Range = Xlargest – Xsmallest

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = ………..
Why The Range Can Be Misleading

▪ Sensitive to outliers

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
VARIANCE
● VARIANCE The arithmetic mean of the squared
deviations from the mean.
● The variance is non-negative and is zero only if all
observations are the same.
● For populations whose values are near the mean,
the variance will be small.
● For populations whose values are dispersed from
the mean, the population variance will be large.
● The variance overcomes the weakness of the
range by using all the values in the population.
Computing the Variance

Steps in computing the variance:

• Step 1: Find the mean.

• Step 2: Find the difference between each observation


and the mean, and square that difference.

• Step 3: Sum all the squared differences found in Step


2.

• Step 4: Divide the sum of the squared differences by


the number of items in the population.
Variance
• Average (approximately) of squared deviations of values
from the mean

• Sample variance:

Where
X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Standard Deviation

• Most commonly used measure of variation


• Shows variation about the mean
• Is the square root of the variance
• Has the same units as the original data

• Sample standard deviation:


Sample Standard Deviation: Calculation Example

Sample Data (Xi) :

10 12 14 15 17 18 18 24

n=8 Mean = X = 16
Measures of Variation: Comparing Standard Deviations

Smaller standard deviation

Larger standard deviation


Measures of Variation: Summary Characteristics

▪ The more the data are spread out, the greater the

range, variance, and standard deviation.

▪ The more the data are concentrated, the smaller the

range, variance, and standard deviation.

▪ If the values are all the same (no variation), all these

measures will be zero.

▪ None of these measures are ever negative.


Normal distribution
Shape of a Distribution
• Describes how data are distributed
• Two useful shape related statistics are:
• Skewness
• Measures the extent to which data values are not symmetrical
• Kurtosis
• Kurtosis affects the peakedness of the curve of the distribution—that
is, how sharply the curve rises approaching the center of the
distribution
Shape of a Distribution (Skewness)

• Measures the extent to which data is not symmetrical

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean
Commonly Observed Shapes
Z-Scores

The z-score is often called the standardized value.

It denotes the number of standard deviations a data


value xi is from the mean.
Outliers
The boxplot (five number summary) consists of :

1. The median ( 2nd quartile, 50th


percentile).

2. The 1st quartile (25th percentile).

3. The 3rd quartile (75th percentile).

4. The maximum value in a data set

5. The minimum value in a data set


IQR= (Q3 - Q1)
Statistically Correlated

● Strength of the correlation – Coefficient of Correlation

● Direction of correlation – Sign of the Coefficient

Pearson
Correlation
Coefficient
Q&A

Questions and answers

66

You might also like