Statistics For Data Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27
At a glance
Powered by AI
The key takeaways are that statistics plays a major role in data science, and it involves collecting, analyzing, and interpreting data. There are different types of statistical data and variables.

The main types of data discussed are quantitative data, which can be measured numerically, and qualitative data, which describes qualities or characteristics. Quantitative data can be further divided into discrete and continuous data.

Some ways to organize data include frequency tables. Some ways to display data include histograms, bar graphs, line graphs, and pie charts. Frequency and relative frequency histograms as well as multiple and stacked bar graphs were also mentioned.

Statistics for Data Science

Objectives
At the end of this chapter, students will be able to

• Describe how statistics play a major role in data science.


• Understand the different types of Statistical data.
• Understand the different types of Statistical variables.
What is Statistics?
• Statistics – science of collecting, analyzing, and interpreting
data in such a way that the conclusions can be objectively
evaluated.
• It also helps us in taking informed decisions based on the
evidence.
• There are 3 Phases:
1.Collecting data
2.Analyzing data
3.Interpreting data
Types of Data
There are two main types of data:
• Quantitative data: This is numerical data that can be measured
and analyzed mathematically.
• Examples - height, weight, age, and income.
• Two categories:
Discrete Data : Discrete data are quantitative data that are counted. (1,2,3,4,…)
Continuous Data : Continuous data are quantitative data that are measured.(0.5, 1.678, …)
• Qualitative data: This is non-numerical data that describes
qualities or characteristics.
• Examples - gender, occupation, and favorite color.
Types of Statistics
• Descriptive Statistics – summarize and describe a characteristic
of a group . It can be used to identify patterns, outliers, and
trends in data.
• Example: batting average for a player, average marks of a class

• Inferential Statistics – used to estimate, infer, or conclude


something about a larger group
• Example: polls Sample – subset of the group of data available for
analysis
What is Statistics?
• Population – refers to the entire set.
• Sample - refers to a subset of the population that is selected
for analysis
• Bias – favoring of certain outcomes over others
• Census – collects data from all members of the population
• Parameter – characteristic value of a population
• Statistic – characteristic value of a sample
Organizing Data – Frequency Table
• Frequency is the number of times that a particular result
occurs.
• Ways to organize data:
Simple frequency table

Note: There are many other types of frequency tables depending on information you want
to record.
Displaying Data
• Ways to display data:
• Frequency histogram
• Relative frequency Histogram
• Multiple bar graph
• Stacked bar graph
• Line graph
• Pie chart
Displaying Data
Descriptive Statistics - Measures of
Central Tendency
Central Tendency – the propensity of data to be located or
clustered about some point.

Usually Mean, Median, Mode, etc. will be used to measure the


central tendency.
Descriptive Statistics
Mean
The mean is the same as the average. To find the mean, add all the values and divide by
the total number of values.
2+3+5+6
Example: for the data {2, 3, 5, 6}, mean will be µ = 1/𝑛 σ𝑛𝑖=1 𝑥𝑖 = =4
4
The letter x with a bar over it, represents the sample mean.
Mode
The mode is the most frequent value in the set of numbers.
Example: In the data set 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95, the
most frequent value is 78. The mode = 78.
Example: In the data set 52, 53, 53, 53, 60, 67, 72,72,72, 90, both 53 and 72 occur the
most number of times (3 times each) so there are two modes, 53 and 72. We call this set
of data bimodal meaning it has two modes.
Descriptive Statistics
Median
The median is the middle value of a set of numbers that has been ordered from smallest
to largest. The upper case letter M is used for the median.
Example: A sample of statistics exam scores for 14 students are (in order from smallest to
largest) as follows: 53, 59, 63, 63, 72, 72, 76, 78, 81, 83, 84, 84, 90, 93
Notice that 14 is an even number. The median is between 7th and 8th values (the middle
76+78
two values). 𝑚𝑒𝑑𝑖𝑎𝑛 = = 77
2

Example: A second sample of statistics exam scores for 15 students are (in order from
smallest to largest) as follows: 52, 60, 65, 67, 70, 71, 74, 76, 78, 78, 78, 80, 86, 89, 95
Notice that 15 is an odd number. The median is the 8th value (the middle value). The 8th
value is 76 so the median Median = 76.
Descriptive Statistics- Measures of
Variability
Measures of variability describe how the data is spread out. The
most commonly used measures of variability are:
• Range: This is the difference between the largest and smallest
values in a dataset.
• Variance: This is the average of the squared differences from
the mean. It measures how much the data deviates from the
mean.
• Standard Deviation: This is the square root of the variance. It
is a measure of how spread out the data is from the mean.
Variance
A deviation is the difference between a value and the mean and is written as: x-µ
The variance is the average of the squares of the deviations.
Example: {2, 3, 5, 6} is a set of data. The sample mean is 4. The deviations are:
• 2 - 4 = -2
• 3 - 4 = -1
• 5 - 4 = 1
• 6 - 4 = 2
The deviations squared are:
• (-2)2 = 4
• (-1)2 = 1
• (1)2 = 1
• (2)2 = 4
4+1+1+4
• An average of the deviations squared is 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = = 2.5
4
Standard Deviation
The standard deviation is a special average of the deviations. It measures how the data is
spread out from its mean.

It is calculated as the square root of the variance of the data. The formula for calculating
standard deviation is:

Standard deviation = sqrt(Variance)

In our example, standard deviation = σ = sqrt (2.5) = 1.58


Descriptive Statistics- Derived Score
Percentile/Quantile –
• A percentile is a statistical measure that represents a point below which a given
percentage of observations in a dataset falls. In other words, if we say that the 25th
percentile of a dataset is 60, then it means that 25% of the values in the dataset are
below 60, and 75% of the values are above 60.
• Percentiles are often used in statistics to summarize the distribution of a dataset. For
example, the median (50th percentile) is a commonly used percentile that gives a
measure of the central tendency of the data. Other percentiles, such as the 25th and
75th percentiles, can give a measure of the spread of the data around the median.
• To calculate the percentile of a dataset, we first order the values from smallest to
largest, and then identify the value that corresponds to the desired percentile. If the
desired percentile falls between two values in the dataset, we can interpolate to
estimate the exact value.
Descriptive Statistics- Derived Score
Z-Score
• A z-score (or standard score) is a statistical measure that represents the number of
standard deviations an observation or data point is away from the mean of its
distribution. In other words, a z-score indicates how far a given value is from the mean
in terms of the standard deviation of the data.
• The formula for calculating the z-score of a value x in a dataset with mean μ and
standard deviation σ is:
z = (x - μ) / σ
• A z-score of 0 indicates that the value is equal to the mean of the distribution, while a
z-score of +1 (or -1) indicates that the value is one standard deviation above (or below)
the mean. A z-score of +2 (or -2) indicates that the value is two standard deviations
above (or below) the mean, and so on.
Descriptive Statistics- Derived Score
• Normal Distribution Definition: Standardizing – converting data to z-
scores.
• Some empirical rules:
• 1.About 68% of data is within one σ of the mean.
• 2.About 95% of data is within two σ of the mean.
• 3.About 99% of data is within three σ of the mean
Descriptive Statistics-Regression &
Correlation
• Linear Regression – modeling the data with the line that “best
fits” – usually a “least squares” line or regression line
• Least Squares Line – is the line that minimizes the sum of the
squared errors for a set of data points
• Correlation Coefficient r – is a measure of the strength of the
linear relationship between the 2 random variables x and y.
Note: The closer the correlation is to 1 or – 1, the stronger the relationship between
the x and y variables.

• A correlation of zero means there is no evidence of a linear


pattern.
Inferential Statistics
• Inferential statistics are used to draw conclusions about a
population by examining the sample
Inferential Statistics- Chain of Reasoning
• Are our inferences valid?.. Best way we can do is to calculate
probability about inferences.
Inferential Statistics-
• Accuracy of inference depends on representativeness of
sample from population.
• Random selection – equal chance for anyone to be selected
makes sample more representative.
• It helps the researchers to test hypotheses and answer
research questions and derive meaning from the results.
Inferential Statistics- Steps
1. State Hypothesis
2. Level of Significance
3. Computing calculated value
4. Obtain Critical value
5. Reject or fail to reject Ho.
Inferential Statistics-Hypothesis
• It uses sample data to evaluate the credibility of a hypothesis
about a population
• Null hypothesis – No differences between means

• Alternative hypothesis – Predicts that there are differences


between the groups
Inferential Statistics-Possible outcomes
in Hypothesis Testing
Identifying the appropriate statistical
test of difference.
References
• Walpole et.al (2016). "Probability and Statistics for Engineers
and Scientists" published by Pearson.
• Anderson et. al (2019). "Statistics for Business and Economics"
published by Cengage Learning.
• Bluman (2017). “Elementary Statistics” A step-by-step
approach “, published by McGraw Hill

You might also like