0% found this document useful (0 votes)
10 views21 pages

Lecture 4 - Data Science Statistics

The document provides an overview of data science statistics, covering both descriptive and inferential statistics. Descriptive statistics summarize data features, while inferential statistics make predictions about populations based on samples. It also discusses measures of central tendency, dispersion, probability concepts, hypothesis testing, and real-world applications in various case studies.

Uploaded by

yashkamra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views21 pages

Lecture 4 - Data Science Statistics

The document provides an overview of data science statistics, covering both descriptive and inferential statistics. Descriptive statistics summarize data features, while inferential statistics make predictions about populations based on samples. It also discusses measures of central tendency, dispersion, probability concepts, hypothesis testing, and real-world applications in various case studies.

Uploaded by

yashkamra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Foundations of Data Science (IT3101) (Section B)

Data Science Statistics


Data Science Statistics
• Statistics is a branch of mathematics that deals with the collection,
organization, analysis, interpretation, and presentation of data. It is
used to make inferences about populations based on samples.

• There are two main types of statistics: descriptive and inferential.


Why and What
• Why statistics in data analysis: Statistics is crucial in understanding
patterns, trends, and relationships in data, which aids in decision-
making and problem-solving.
• What Distinction between descriptive and inferential statistics:
Descriptive statistics summarize and present data, while inferential
statistics make predictions and draw conclusions about a population
based on sample data.
Descriptive Statistics
• Descriptive statistics are used to summarize data and describe its main
features. They do not make inferences about populations, but simply
describe the data that is available.
• Definition and purpose of descriptive statistics: Descriptive statistics
summarize and describe the main features of a dataset, providing a clear
overview of the data.
• Examples of descriptive statistics: Mean, Median, Mode, Standard
Deviation, and Variance. These measures help understand the central
tendencies and variability of the data.
• Utilization of descriptive statistics in summarizing and presenting data:
Descriptive statistics provide a quick and easy way to represent data,
enabling better comprehension of the dataset.
Measures of Central Tendency
• Mean: The arithmetic average of a set of values. It provides the
overall "center" of the data.
• Median: The middle value of a dataset when arranged in ascending or
descending order. It is robust to extreme values.
• Mode: The value that occurs most frequently in the dataset. It
represents the most common observation.
• How to calculate each measure and their interpretations: Explains the
formulas for calculating mean, median, and mode, and provides
examples to illustrate their interpretation: In next class we will solve
the numerical
Measures of Dispersion
• Range: The difference between the maximum and minimum values in
the dataset. It indicates the spread of the data.
• Variance: The average of the squared differences between each data
point and the mean. It quantifies the dispersion of data points from
the mean.
• Standard Deviation: The square root of the variance. It represents the
typical distance between data points and the mean.
• Importance of measures of dispersion in understanding data spread:
Measures of dispersion help to understand the spread and variability
of data, providing additional insights beyond the central tendency
measures.
Inferential Statistics
• Inferential statistics are used to make inferences about populations based
on samples. They do this by using probability theory to estimate the
likelihood that a particular outcome is due to chance.
• Purpose of inferential statistics: Inferential statistics use sample data to
make inferences about a population. It helps generalize findings beyond
the sample.
• Differences between descriptive and inferential statistics: Descriptive
statistics summarize data, while inferential statistics draw conclusions
about a larger population.
• Role of inferential statistics in making predictions and drawing conclusions:
Inferential statistics allow researchers and decision-makers to make
predictions and draw conclusions with a certain level of confidence.
Inferential Statistics
• Some common inferential statistics include:

• Probability Concepts: Probability distributions describe the likelihood


of different outcomes in a random process.
• Hypothesis testing: testing whether a particular hypothesis is
supported by the data
• Confidence intervals: estimating the range of values within which a
population parameter is likely to lie
• Regression analysis: predicting the value of one variable from the
value of another variable
Probability Concepts
• Definition of probability: Probability quantifies the likelihood of an
event occurring. It ranges from 0 (impossible) to 1 (certain).
• Basic rules of probability: The addition rule deals with the probability
of either one or both of two events happening. The multiplication
rule deals with the probability of both events happening.
• Probability distributions: Probability distributions describe the
likelihood of different outcomes in a random process. Examples
include the normal distribution and Poisson distribution.
Hypothesis Testing
• Definition of hypothesis and null hypothesis: A hypothesis is a
testable statement about a population parameter. The null hypothesis
is a statement of no effect or no difference.
• Steps of hypothesis testing: Formulate the research hypothesis and
null hypothesis, set the significance level (alpha), perform the
statistical test, and draw conclusions based on the p-value.
• Commonly used significance levels (alpha) and p-values: Common
significance levels include 0.05 and 0.01. The p-value is the
probability of obtaining results as extreme or more extreme than the
observed results, assuming the null hypothesis is true.
Types of Hypothesis Tests
• One-sample t-test: Used to compare the mean of a sample to a known
value. It determines if the sample mean is significantly different from the
known value.
• Two-sample t-test: Used to compare the means of two independent
samples. It tests if there is a significant difference between the means of
the two groups.
• Chi-square test: Used to test the independence between two categorical
variables. It determines if there is a relationship between the variables.
• Z test: A z-test is a statistical test to determine whether two population
means are different when the variances are known, and the sample size is
large. A z-test is a hypothesis test in which the z-statistic follows a normal
distribution.
Confidence Intervals
• Definition of confidence intervals: A range of values around the
sample statistic within which the population parameter is likely to fall
with a certain level of confidence.
• Calculating confidence intervals for means and proportions:
Confidence intervals are calculated based on the sample data and the
desired level of confidence (e.g., 95% confidence interval).
• Interpreting confidence intervals and their relationship with
hypothesis testing: If the confidence interval includes the null
hypothesis value, the result is not statistically significant; if it does not
include the null hypothesis value, the result is statistically significant.
Regression Analysis and Common Errors
• Regression: Predicts the value of a dependent variable based on the value
of one or more independent variables. The regression equation is
determined using least squares estimation.
• Type I and Type II errors in hypothesis testing: Type I error occurs when the
null hypothesis is rejected when it is true. Type II error occurs when the
null hypothesis is not rejected when it is false.
• Interpreting p-values correctly: A p-value provides the probability of
obtaining the observed results, or more extreme results, assuming the null
hypothesis is true. It does not indicate the probability of the null
hypothesis being true or false.
• The importance of sample size in statistical analysis: Larger sample sizes
provide more reliable and precise estimates of population parameters,
reducing the impact of random variation.
Real-World Case Studies and Scenarios
• Case Study 1: Customer Satisfaction Survey
• Scenario: A company wants to assess customer satisfaction with its
products and services. They conduct a survey among a random
sample of customers.
Solution
Descriptive Statistics: The company calculates the mean satisfaction
score, median satisfaction score, and standard deviation of the
satisfaction scores. They also create a histogram to visualize the
distribution of satisfaction scores.

Inferential Statistics: The company uses inferential statistics to estimate


the overall customer satisfaction for the entire customer base based on
the sample data. They construct a confidence interval to determine the
range of likely satisfaction levels in the population.
Real-World Case Studies and Scenarios
Case Study 2: Medical Research
Scenario: A pharmaceutical company is testing a new drug to reduce
cholesterol levels. They conduct a randomized controlled trial with a
treatment group and a control group.
Solution
Descriptive Statistics: The company calculates the mean and standard
deviation of cholesterol levels for both groups. They also perform a box
plot to visualize the distribution of cholesterol levels in each group.

Inferential Statistics: The company uses inferential statistics to


determine if the new drug has a significant effect on reducing
cholesterol levels. They conduct a two-sample t-test to compare the
means of the treatment and control groups and determine if the
difference is statistically significant.
Real-World Case Studies and Scenarios
Case Study 3: Economic Forecasting
Scenario: A government agency wants to forecast the country's
economic growth for the next year. They analyze historical economic
data and other relevant indicators.
Solution
• Descriptive Statistics: The agency uses descriptive statistics to
summarize historical economic growth rates, inflation rates, and
other economic indicators. They calculate the mean, median, and
standard deviation of these variables.
• Inferential Statistics: The agency employs inferential statistics to make
predictions about future economic growth. They may use time-series
analysis, regression analysis, or other forecasting methods to estimate
future economic trends.
Real-World Case Studies and Scenarios
Case Study 5: Climate Change Analysis
Scenario: A climate research institute is studying the impact of climate
change on global temperatures. They analyze temperature data from
weather stations around the world.
Solution
• Descriptive Statistics: The institute uses descriptive statistics to
summarize the temperature data, calculating the mean, median, and
variability of global temperatures over time.
• Inferential Statistics: The institute employs inferential statistics to
determine if there is a significant trend in global temperatures over
the years. They may use linear regression analysis to identify long-
term temperature trends and estimate future temperature changes.

You might also like