Data Science
Data Science
Role of Statistics in Data Science; Population vs. Sample; Descriptive vs. Inferential
statistics; Probability distributions: Poisson, Normal, Binomial, Uniform; Bayes'
theorem and conditional probability; Descriptive statistics: Measures of central
tendency: Mean, median, mode; Measures of dispersion: Variance, standard deviation;
Inferential statistics: Hypothesis testing: Null and alternative hypotheses, p-values;
Confidence intervals, ANOVA, Chi-square test, T-test; Correlation and Covariance.
Role of statistics in data science:
Statistics plays a fundamental role in data science by providing the tools and
techniques necessary for understanding and extracting insights from data. Here are
some key roles of statistics in data science:
1. Descriptive Statistics: Descriptive statistics summarize and describe features of a
dataset, such as mean, median, mode, standard deviation, and percentiles. These
statistics help in understanding the basic properties of the data.
2. Inferential Statistics: Inferential statistics involve making predictions or inferences
about a population based on a sample of data. Techniques like hypothesis testing,
confidence intervals, and regression analysis are used to draw conclusions about larger
populations from limited data.
3. Probability Theory: Probability theory is essential for understanding uncertainty and
randomness in data. It provides the foundation for statistical modeling and inference,
enabling data scientists to quantify uncertainty and make probabilistic predictions.
4. Statistical Modeling: Statistical modeling involves building mathematical models
that describe the relationships between variables in a dataset. Techniques such as
linear regression, logistic regression, time series analysis, and machine learning
algorithms rely on statistical principles to model complex relationships and make
predictions.
5. Experimental Design: Statistics helps in designing experiments and observational
studies to collect data in a systematic and unbiased manner. Proper experimental
design ensures that the collected data is reliable and can lead to valid conclusions.
6. Data Mining and Pattern Recognition: Statistical methods are used to identify
patterns, trends, and relationships within large datasets. Techniques such as clustering,
classification, and association rule mining help in discovering useful insights from
data.
7. Model Evaluation and Validation: Statistics provides methods for evaluating the
performance of predictive models and assessing their accuracy and reliability.
Techniques like cross-validation, hypothesis testing, and goodness-of-fit tests are used
to validate models and ensure their effectiveness.
8. Bayesian Statistics: Bayesian methods are increasingly used in data science for
modeling complex phenomena and updating beliefs based on new evidence. Bayesian
inference allows data scientists to incorporate prior knowledge and uncertainty into
statistical analysis, leading to more robust and interpretable results.
In essence, statistics provides the theoretical foundation and analytical tools necessary
for extracting meaningful information from data, making data-driven decisions, and
solving real-world problems in various domains. Statistics is extensively applied in
various real-life scenarios within the realm of data science.
1. Business Analytics: Companies use statistics to analyze sales data, customer
demographics, and market trends to make informed decisions about pricing, marketing
strategies, inventory management, and resource allocation.
2. Healthcare Analytics: Healthcare providers employ statistics to analyze patient data,
clinical trials, and medical records to identify patterns, predict disease outbreaks,
assess treatment effectiveness, and improve patient outcomes.
3. Financial Analysis: Financial institutions use statistical models to analyze stock
market data, predict asset prices, assess investment risks, detect fraudulent activities,
and optimize portfolio management strategies.
4. Marketing and Customer Analytics: Marketers use statistics to segment customers
based on demographics, behavior, and preferences, conduct A/B testing for marketing
campaigns, analyze website traffic and user engagement, and personalize marketing
messages for targeted audiences.
5. Predictive Maintenance: Industries such as manufacturing and transportation use
statistical models to predict equipment failures, schedule maintenance activities,
optimize asset performance, and minimize downtime and operational costs.
6. Social Media Analytics: Social media platforms analyze user data, interactions, and
engagement metrics using statistical techniques to personalize content
recommendations, target advertisements, detect trends and sentiment, and improve
user experience.
7. Environmental Science: Environmental scientists use statistics to analyze climate
data, predict natural disasters such as hurricanes and earthquakes, assess
environmental risks, and make policy recommendations for conservation and
sustainability efforts.
8. Supply Chain Management: Logistics companies employ statistics to optimize
supply chain operations, forecast demand for goods and services, minimize inventory
costs, improve transportation routes, and enhance overall efficiency and
responsiveness.
9. Sports Analytics: Sports teams and organizations use statistics to analyze player
performance, assess team strategies, predict game outcomes, optimize player
recruitment and draft selections, and gain a competitive edge in sports competitions.
10. Fraud Detection: Banks, insurance companies, and e-commerce platforms use
statistical models to detect fraudulent transactions, identify suspicious patterns and
anomalies in financial data, and prevent financial losses due to fraudulent activities.
Population vs. Sample:
1. Population:
The population refers to the entire group of individuals, items, or events that we are
interested in studying and about which we want to draw conclusions.
It represents the entire target of our research or analysis.
For example, if we are studying the heights of all adult males in a country, the
population would be all adult males in that country.
2. Sample:
A sample is a subset of the population that is selected for study or analysis.
It is chosen to represent the larger population, with the aim of drawing conclusions or
making inferences about the population based on the characteristics observed in the
sample.
Samples are often used because it is impractical or impossible to study every
individual or item in the entire population.
The process of selecting a sample from a population should ideally be done in a way
that avoids bias and ensures that the sample is representative of the population.
For example, in the scenario of studying the heights of all adult males in a country, we
might select a sample of 1000 adult males from different regions and demographics
within the country to represent the entire population.
Key differences between population and sample:
Size: The population includes all individuals or items of interest, whereas the sample
is a smaller subset of the population.
Representativeness: The population represents the entire group being studied, while
the sample should ideally be representative of the population to ensure that findings
can be generalized.
Practicality: It is often impractical or impossible to study the entire population, so
samples are used for practical reasons.
Inference: Statistical analyses performed on a sample are used to make inferences or
draw conclusions about the population from which the sample was drawn.
In summary, while the population represents the entire group being studied, the
sample is a smaller subset of the population that is selected for analysis. The goal of
sampling is to obtain a representative sample that accurately reflects the
characteristics of the population, allowing for valid inferences to be made about the
population based on the observed sample data.
A population in statistics refers to the entire group of individuals, items, or events that
we are interested in studying and about which we want to draw conclusions. It
encompasses all possible subjects that meet specific criteria and is the complete set
from which a sample is drawn.
Key characteristics of a population include:
1. Comprehensiveness: The population includes all elements that meet the criteria for
inclusion in the study. It represents the entire target of our research or analysis.
2. Defined Parameters: Each individual or item in the population possesses certain
characteristics or parameters that are of interest for the study.
3. Finite or Infinite: A population can be finite, meaning it consists of a fixed number of
elements, or infinite, meaning it continues indefinitely.
4. Homogeneity or Heterogeneity: Populations can vary in terms of the similarity or
diversity of their elements. They can be homogeneous if all elements are similar in
certain characteristics, or heterogeneous if they are diverse.
5. Accessibility: While some populations are easily accessible and well-defined, others
may be difficult to access or define precisely.
6. Stability: The characteristics of a population may change over time due to various
factors such as growth, migration, or other external influences.
Examples of populations include:
All registered voters in a country.
Every smartphone user in a particular city.
All cars manufactured by a specific automobile company in a given year.
All students enrolled in a university program.
Understanding the population of interest is essential for determining the scope and
objectives of a study, selecting appropriate sampling methods, and making valid
inferences about the entire group based on the observed data.
A sample in statistics refers to a subset of the population that is selected for study or
analysis. It is chosen to represent the larger population, with the aim of drawing
conclusions or making inferences about the population based on the characteristics
observed in the sample.
Key characteristics of a sample include:
1. Representativeness: The sample should ideally be representative of the population
from which it is drawn. This means that the characteristics of the sample should
closely resemble those of the population in terms of relevant attributes.
2. Randomness: Random sampling methods are often used to select a sample from the
population, which helps to minimize bias and ensure that each member of the
population has an equal chance of being included in the sample.
3. Size: The size of the sample refers to the number of individuals or items included in
the sample. Larger samples generally provide more precise estimates of population
parameters, but the appropriate sample size depends on factors such as the variability
of the population and the desired level of confidence.
4. Validity: The validity of the sample refers to the extent to which the conclusions
drawn from the sample accurately reflect the characteristics of the population. Validity
can be compromised if the sample is not representative or if there are biases in the
sampling process.
5. Sampling Methods: Various sampling methods can be used to select a sample from a
population, including simple random sampling, stratified sampling, cluster sampling,
and systematic sampling. The choice of sampling method depends on factors such as
the nature of the population and the research objectives.
Examples of samples include:
Survey responses from a randomly selected group of individuals in a population.
Measurements of a random sample of products from a production line to assess
quality.
Test scores from a sample of students in a school to evaluate academic performance.
Blood samples collected from a subset of patients in a clinical trial to study the
effectiveness of a new treatment.
Sampling is an essential aspect of statistical analysis, as it allows researchers to study
populations without needing to collect data from every individual or item in the
population. However, it is important to ensure that the sample is representative and
that appropriate sampling methods are used to minimize bias and maximize the
reliability of the study results.
Descriptive and inferential statistics are two branches of statistical analysis that serve
different purposes:
Descriptive vs. Inferential Statistics
1. Descriptive Statistics:
Descriptive statistics are used to summarize and describe the main features of a
dataset.
They provide simple summaries about the sample or population under study.
Common measures of descriptive statistics include measures of central tendency (e.g.,
mean, median, mode) and measures of dispersion or variability (e.g., range, variance,
standard deviation).
Descriptive statistics help to organize and simplify large amounts of data, making it
easier to understand and interpret.
These statistics are often used to present the basic characteristics of a dataset, identify
patterns, and provide insights into the data without making inferences beyond the
dataset itself.
Example: Calculating the average height of students in a class, or the percentage of
customers who purchased a particular product.
2. Inferential Statistics:
Inferential statistics are used to make inferences or predictions about a population
based on sample data.
They involve generalizing from a sample to a population and drawing conclusions
about the population parameters.
Inferential statistics allow researchers to test hypotheses, assess relationships between
variables, and make predictions about future outcomes.
Techniques such as hypothesis testing, confidence intervals, and regression analysis
are commonly used in inferential statistics.
These statistics help to determine whether observed differences or relationships in the
sample data are statistically significant and can be generalized to the larger
population.
Example: Using a sample of voters to estimate the proportion of the population that
supports a particular political candidate, or testing whether a new drug treatment is
effective based on results from a clinical trial.
In summary, descriptive statistics are used to summarize and describe data, whereas
inferential statistics are used to make inferences or predictions about populations
based on sample data. Both branches of statistics play important roles in analyzing and
interpreting data, with descriptive statistics providing insights into the characteristics
of the data and inferential statistics allowing researchers to make broader conclusions
about populations based on sample data.
Probability Distributions:
Probability distributions describe how the values of a random variable are distributed.
Poisson distribution:
The Poisson probability distribution is used to model the number of events
occurring within a fixed interval of time or space when these events happen with a
known constant mean rate and independently of the time since the last event. It is
particularly useful for modeling rare events.
Mode
The mode is the number that appears most frequently in the dataset. A dataset can
have one mode, more than one mode, or no mode at all if no number repeats.
In our dataset, the number 23 appears three times, which is more frequent than any
other number.
Mode=23
Summary
For the dataset 23,29,20,32,25,23,27,28,23,30
Mean: 26
Median: 26
Mode: 23
These measures provide different perspectives on the central tendency of the data. The
mean gives the arithmetic average, the median provides the middle value, and the
mode identifies the most frequently occurring value.
Measures of Dispersion: Variance, Standard deviation
Measures of dispersion describe the spread or variability of a dataset. The most
commonly used measures of dispersion are the range, variance, standard deviation,
and interquartile range (IQR). Here's a detailed look at each, using the same dataset as
before:
Example Dataset
Consider the following set of numbers representing the ages of a group of people:
23,29,20,32,25,23,27,28,23,30
Range
The range is the difference between the highest and lowest values in the dataset.
Range=Maximum value−Minimum valueRange
For our dataset:
Maximum value=32
Minimum value=20
Range=32−20=12
First, arrange the dataset in ascending order:
20,23,23,23,25,27,28,29,30,32
Next, find Q1 and Q3. Q1 is the median of the first half of the data, and Q3 is the
median of the second half of the data.
For our dataset:
Q1 (the median of 20, 23, 23, 23, 25): 23
Q3 (the median of 27, 28, 29, 30, 32): 29
IQR=Q3−Q1=29−23=6
For the dataset 23,29,20,32,25,23,27,28,23,30
Range: 12
Variance: 14.44
Standard Deviation: 3.80
Interquartile Range (IQR): 6
These measures of dispersion provide insights into the variability and spread of the
data, complementing the measures of central tendency.
Hypothesis Testing: Null and Alternative Hypothesis
Hypothesis testing is a fundamental method used in statistics to make inferences about
a population based on sample data. It involves the following steps:
1. Formulating Hypotheses:
Null Hypothesis (H0): This is a statement of no effect or no difference, and it serves
as the default or starting assumption. For example, H0: μ=0 (where μ is the population
mean).
Alternative Hypothesis (Ha or H1): This is a statement that contradicts the null
hypothesis, indicating the presence of an effect or a difference. For example,
Ha: μ ≠ 0.
2. Selecting a Significance Level (α):
The significance level, often denoted by α, is the probability of rejecting
the null hypothesis when it is true. Common choices are 0.05, 0.01, and
0.10.
3. Choosing a Test Statistic:
The test statistic is a standardized value that is calculated from sample
data during a hypothesis test. Examples include the t-statistic, z-statistic,
chi-square statistic, and F-statistic. The choice depends on the nature of
the data and the hypotheses.
4. Determining the Sampling Distribution:
The sampling distribution of the test statistic under the null hypothesis is
determined. This distribution is used to calculate the probability of
observing the test statistic under the null hypothesis.
5. Calculating the Test Statistic:
Using the sample data, compute the test statistic. For instance, if testing the
mean of a population, the test statistic might be calculated as:
6. Making a Decision:
p-value approach: Calculate the p-value, which is the probability of
observing the test statistic or something more extreme, assuming the
null hypothesis is true. If the p-value is less than or equal to α, reject the
null hypothesis.
Critical value approach: Compare the test statistic to a critical value
from the sampling distribution. If the test statistic falls into the critical
region (beyond the critical value), reject the null hypothesis.
7. Conclusion:
Based on the decision, conclude whether there is sufficient evidence to
reject the null hypothesis in favor of the alternative hypothesis. This
conclusion helps in understanding whether the observed data
significantly deviates from what was expected under the null hypothesis.
Hypothesis testing is widely used in many fields, such as medicine,
psychology, economics, and social sciences, to test theories and make decisions based
on data.
Problem 1: Testing a Population Mean
A company claims that the average weight of its product is 50 grams. A quality
control manager wants to test this claim. A sample of 30 products shows an average
weight of 51 grams with a standard deviation of 2 grams. Test the company's claim at
the 0.05 significance level.
1. Formulate Hypotheses:
H0: μ=50 grams (null hypothesis: the mean weight is 50 grams)
Ha: μ ≠50 grams (alternative hypothesis: the mean weight is not 50 grams)
2. Select Significance Level:
α=0.05
3. Choose a Test Statistic:
Since the sample size is less than 30 and the population standard deviation is
unknown, we use the t-statistic.
4. Determine the Sampling Distribution:
Under H0, the sampling distribution of the sample mean is approximately normal due
to the Central Limit Theorem.
5. Calculate the Test Statistic:
6. Determine the Critical Value:
For α=0.05 and a two-tailed test with df=29, the critical value from the t-distribution
table is approximately ±2.045.
7. Make a Decision:
Since 2.74>2.0452.74>2.045, we reject the null hypothesis.
8. Conclusion:
There is sufficient evidence at the 0.05 significance level to conclude
that the mean weight of the product is not 50 grams.
In hypothesis testing, the null and alternative hypotheses form the basis of the
test. They are statements about a population parameter (such as a mean, proportion, or
variance) that we aim to test based on sample data.
Null Hypothesis (H0)
The null hypothesis is a statement that there is no effect, no difference, or no
relationship in the population, and it serves as the default assumption. It is
often a statement of equality (e.g., H0:μ=μ0, where μ is the population mean).
Types of ANOVA
1. One-Way ANOVA: Tests for differences among means of three or more independent
(unrelated) groups.
2. Two-Way ANOVA: Tests for differences among means when there are two
independent variables, allowing for the study of interaction effects between the
variables.
3. Repeated Measures ANOVA: Used when the same subjects are used for each
treatment (i.e., repeated measurements).
Key Concepts
Null Hypothesis (H0): Assumes that all group means are equal.
Alternative Hypothesis (Ha): Assumes that at least one group mean is different.
F-Statistic: The test statistic used in ANOVA, which is the ratio of the variance
between the groups to the variance within the groups.
P-Value: Used to determine the significance of the results. If the p-value is less than
the significance level (usually 0.05), the null hypothesis is rejected.
One-Way ANOVA
Assumptions
where
c = Degrees of freedom
O = Observed Value
E = Expected Value
The degrees of freedom in a statistical calculation represent the number of
variables that can vary in a calculation. The degrees of freedom can be calculated
to ensure that chi-square tests are statistically valid. These tests are frequently used
to compare observed data with data that would be expected to be obtained if a
particular hypothesis were true. The Observed values are those we gather
ourselves. The expected values are the frequencies expected based on the null
hypothesis.
A Chi-Square test is fundamentally a data analysis based on the observations of
a random set of variables. It computes how a model equates to actual observed
data. A Chi-Square statistic test is calculated based on the data, which must be raw,
random, drawn from independent variables, drawn from a wide-ranging sample and
mutually exclusive. In simple terms, two sets of statistical data are compared for
instance, the results of tossing a fair coin. Karl Pearson introduced this test in 1900
for categorical data analysis and distribution. This test is also known as „Pearson‟s
Chi-Squared Test‟.
Chi-Squared Tests are most commonly used in hypothesis testing. A hypothesis
is an assumption that any given condition might be true, which can be tested
afterwards. The Chi-Square test estimates the size of inconsistency between the
expected results and the actual results when the size of the sample and the number
of variables in the relationship is mentioned.
These tests use degrees of freedom to determine if a particular null hypothesis
can be rejected based on the total number of observations made in the experiments.
Larger the sample size, more reliable is the result.
There are two main types of Chi-Square tests namely:
Independence
Goodness-of-Fit
Independence
The Chi-Square Test of Independence is a derivable (also known as inferential)
statistical test which examines whether the two sets of variables are likely to be
related with each other or not. This test is used when we have counts of values for
two nominal or categorical variables and is considered as non-parametric test. A
relatively large sample size and independence of observations are the required
criteria for conducting this test.
Example:
In a movie theatre, suppose we made a list of movie genres. Let us consider
this as the first variable. The second variable is whether or not the people who
came to watch those genres of movies have bought snacks at the theatre. Here the
null hypothesis is that the genre of the film and whether people bought snacks or
not are unrelatable. If this is true, the movie genres do not impact snack sales.
Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines
whether a variable is likely to come from a given distribution or not. We must have
a set of data values and the idea of the distribution of this data. We can use this test
when we have value counts for categorical variables. This test demonstrates a way
of deciding if the data values have a “good enough” fit for our idea or if it is a
representative sample data of the entire population.
Example:
Suppose we have bags of balls with five different colours in each bag. The
given condition is that the bag should contain an equal number of balls of each
colour. The idea we would like to test here is that the proportions of the five
colours of balls in each bag must be exact.
Example
Let‟s say we want to know if gender has anything to do with political party
preference. We poll 440 voters in a simple random sample to find out which
political party they prefer. The results of the survey are shown in the table below:
To see if gender is linked to political party preference, perform a Chi-Square
test of independence using the steps below.
Step 1: Define the Hypothesis
H0: There is no link between gender and political party preference.
H1: There is a link between gender and political party preference.
Step 2: Calculate the Expected Values
Now we will calculate the expected frequency.
Similarly, we can calculate the expected value for each of the cells.
Total 60 80 60 200
Perform a chi-square test to determine whether there is a significant association
between gender and preferred mode of transportation at a significance level of 0.05.
T-Test in Statistics
A t-test is a statistical test used to compare the means of two groups. It helps
determine if the differences between the groups are statistically significant. There are
different types of t-tests depending on the study design and data structure:
1. One-Sample T-Test: Compares the mean of a single group to a known value.
2. Independent Two-Sample T-Test: Compares the means of two independent groups.
3. Paired Sample T-Test: Compares the means of two related groups (e.g.,
measurements before and after a treatment on the same subjects).
Key Concepts
Null Hypothesis (H0): Assumes no difference between the groups.
Alternative Hypothesis (Ha): Assumes a difference between the groups.
Test Statistic (t): A standardized value used to determine the probability of observing
the test results under the null hypothesis.
Degrees of Freedom (df): Reflects the number of independent values in the data that
can vary.
p-Value: The probability of obtaining a test statistic as extreme as, or more extreme
than, the observed value under the null hypothesis.
Significance Level (α): The threshold for rejecting the null hypothesis, commonly set
at 0.05.
Summary
The t-test is a versatile tool for comparing means across different scenarios. By
understanding the types of t-tests and their appropriate applications, we can effectively
analyze data and draw meaningful conclusions.
Correlation in Statistics
Correlation is a statistical measure that describes the strength and direction of a
relationship between two variables. It quantifies how changes in one variable are
associated with changes in another variable.
Key Concepts
1. Correlation Coefficient (r):
The correlation coefficient, typically denoted as r, ranges from -1 to 1.
r=1: Perfect positive correlation (as one variable increases, the other increases
proportionally).
r=−1: Perfect negative correlation (as one variable increases, the other decreases
proportionally).
r=0: No correlation (no linear relationship between the variables).
2. Types of Correlation:
Positive Correlation: Both variables increase or decrease together.
Negative Correlation: One variable increases while the other decreases.
No Correlation: No consistent pattern of relationship.
3. Correlation vs. Causation:
Correlation does not imply causation. Two variables may be correlated without one
causing the other.
Pearson Correlation Coefficient
The Pearson correlation coefficient measures the linear relationship between two
continuous variables. It is calculated as:
Covariance in Statistics
Covariance is a measure of the joint variability of two random variables. It indicates
the direction of the linear relationship between the variables. If the greater values of
one variable mainly correspond with the greater values of the other variable and the
lesser values correspond similarly, the covariance is positive. If greater values of one
variable mainly correspond with lesser values of the other, the covariance is negative.
Key Concepts
1. Covariance Calculation: The formula for the covariance between two variables X
and Y is:
1. Interpretation:
Positive Covariance: Indicates that as one variable increases, the other variable tends
to increase.
Negative Covariance: Indicates that as one variable increases, the other variable
tends to decrease.
Zero Covariance: Indicates no linear relationship between the variables.
2. Units of Covariance: The value of covariance is not standardized and is dependent on
the units of the variables. This can make it difficult to compare the strength of
relationships between different pairs of variables.
Examples:
Covariance vs. Correlation
While covariance indicates the direction of the linear relationship between variables, it
is not standardized, making it difficult to interpret the strength of the relationship.
Correlation, on the other hand, standardizes the covariance by dividing it by the
product of the standard deviations of the variables, resulting in a dimensionless value
between -1 and 1. This makes correlation easier to interpret and compare across
different data sets.
Summary
Covariance is a fundamental measure in statistics that helps to understand the
relationship between two variables. It forms the basis for more advanced analyses like
correlation and regression. However, its interpretation is often less straightforward
than correlation due to its dependency on the units of the variables involved.