Advance Statistics for Data Science and Data Analysis (2)
Advance Statistics for Data Science and Data Analysis (2)
1) Accounting
a) Use of sampling methods for audits
E.G. : Audit sampling is a technique used by auditors to select a representative
sample of data from a larger population to obtain reasonable assurance. This
approach reduces the cost and time of conducting a full audit on the entire
population.
2) Economics
a) Forecast about future
E.G : the process of attempting to predict the future condition of the economy
using a combination of widely followed indicators. Economic forecasting involves
the building of statistical models with inputs of several key variables, or
indicators, typically in an attempt to come up with a future gross domestic
product (GDP) growth rate. Primary economic indicators include inflation, interest
rates, industrial production, consumer confidence, worker productivity, retail
sales, and unemployment rates.
3) Marketing
a) Marketing Research
E.G. : when some company launches their product and to decide how much
production should be done they first have to research the market about the
consumer need.
4) Production
a) Quality control charts
E.G.:
Quality control charts are a type of control often used by engineers to assess the
performance of a firm's processes or finished products. If issues are detected,
they can easily be compared to their location on the chart for debugging or error
control. In other words, it provides a heuristic blueprint for maintaining quality
control.
Scale of Measurement :
Because data collected can be done from different - different studies or experiments , so
data has various types based on studies . To summarize data has mostly four types , and based
on different types of data we apply different statistical methods.
1) Nominal : its is just name of things , e.g. Name of students, cities, class, product,
animals, countries , student ID
2) Ordinal : it has all the properties of nominal but key things is it has order in it. E.g.
a) Freshman, Junior, Senior
b) Cold , warm, hot
c) 1st, 2nd , 3rd
3) Interval : It has all the properties of ordinal but it has something more i.e. the interval
between two numbers makes some sense. E.g.
a) Temperature of the 1st Jan is 30 degrees and 2nd Jan is 15 degrees. Then we
can say that 2nd Jan was 15 degrees colder than 1st Jan.
4) Ratio : It has all the property of Interval but it has something more i.e. the ratio of two
numbers makes some sense. E.g.
a) Suppose the age of the father is 40 and the son's age is 10 , so if we take out the
ratio then we can say that (40/10 = 4) father is 4 times older than the son.
Statistical Methods
In general we have two types of methods available for data based on what statistics we
want to applied i.e.
1) Descriptive statistics
2) Statistical Inference
Descriptive Statistics
This statistics is used when we want to describe the whole data in the form of
1) Tabular summary
2) Graphical summary
3) Numerical Summary
Inferential Statistics
This statistics is used when we want to conclude or take inference from data
E.G. :
Suppose we have a whole population , and to measure something we will take
some sample from the population (sample should be representative of population).
There is another thing that how much sample to take its totally depends on if data is
Homogeneous and Heterogeneous . Suppose if data is homogeneous then a small
sample will also work e.g. if a doctor wants to know about your blood group then he will
take just a drop of blood for study and conclusion. (here data is homogeneous)
Heterogeneous means data has lot of variation in it. If this is case then we need to take
large sample of data for analysis.
And whatever we get results from the sample then we will apply to the whole population.
This process i.e. taking samples from the population and doing data analysis and
results back to the population is called inferential statistics.
- Mean: The mean (μ) is calculated by summing all the values (xᵢ) in the dataset and
dividing by the number of observations (n): μ = (Σxᵢ) / n. It represents the average value
of a dataset.
Consider a dataset of exam scores for a class of students: 85, 90, 75, 80, and 95.
85 + 90 + 75 + 80 + 95 = 425
Next, you divide the sum by the number of observations (in this case, 5):
425 / 5 = 85
The mean represents the typical value or average value of the dataset. It provides a
sense of the central value around which the observations tend to cluster. In this
example, a mean score of 85 suggests that, on average, the students performed well on
the exam.
The mean is widely used in various fields, such as analyzing survey responses,
calculating average sales, or determining the average temperature over a period. It
serves as a useful summary statistic for understanding the overall pattern or level of a
dataset.
- Median: The median is the middle value when the data is sorted in ascending or
descending order. For an odd number of observations, it is the value at position (n+1)/2.
For an even number of observations, it is the average of the values at positions n/2 and
(n/2)+1.
It divides the dataset into two equal halves, with 50% of the values above and 50%
below it. The median is particularly useful when there are extreme values or when the
data is not symmetrically distributed.
Suppose we have a dataset representing the ages of 10 individuals: 21, 24, 26, 28, 30,
32, 35, 36, 40, 50.
To find the median, we need to first order the dataset in ascending or descending order:
21, 24, 26, 28, 30, 32, 35, 36, 40, 50.
Since the dataset has an even number of observations (10), the median will be the
average of the two middle values, which are 30 and 32. Thus, the median is:
In this case, the median age of the group is 31. It represents the value that separates
the dataset into two halves, with five values below 31 and five values above 31. The
median is not affected by extreme values, such as the highest value of 50 in this
example, making it a robust measure of central tendency.
The median is particularly useful in situations where the dataset contains outliers or
extreme values that could heavily influence the mean. By using the median, we can get
a more representative measure that is less sensitive to such outliers.
Overall, the median is a valuable statistic in statistics as it provides a central value that
represents the middle of the dataset, making it a reliable measure in various scenarios.
- Mode: The mode is the value(s) that occur(s) most frequently in the dataset.
It is the observation(s) that occur with the highest frequency. The mode is particularly
useful when dealing with categorical or discrete data, although it can also be applied to
continuous data.
Red, Blue, Green, Blue, Yellow, Red, Red, Green, Blue, Blue
To find the mode in this dataset, we need to identify the color(s) that occur(s) most
frequently. By examining the data, we can see that "Blue" appears the most, with a
frequency of 4. The other colors, such as "Red" and "Green," occur less frequently.
Therefore, in this example, the mode is "Blue" because it is the value that appears most
often in the dataset.
It's important to note that a dataset can have multiple modes or no mode at all. If there
are two or more values with the same highest frequency, the dataset is said to be
multimodal. In cases where no value is repeated or all values have the same frequency,
the dataset is considered to have no mode.
The mode is a simple and intuitive measure of central tendency that can provide
insights into the most common or popular observations in a dataset. It is particularly
useful when dealing with categorical data, such as favorite colors, types of cars, or
responses to a survey question with multiple choices.
2. Measures of Dispersion:
- Range: The range is calculated as the difference between the maximum (X_max)
and minimum (X_min) values: Range = X_max - X_min.
It provides a simple measure of the spread or variability of the data. The range gives
you an idea of how widely the data values are dispersed.
To calculate the range, you need to find the maximum and minimum values in the
dataset. In this case, the maximum value is 22 (the highest temperature recorded) and
the minimum value is 15 (the lowest temperature recorded).
Range = 22 - 15
Range = 7
So, in this example, the range of the daily high temperatures for the week is 7 degrees
Celsius. This means that the temperatures varied by a range of 7 degrees over the
week.
- Variance: The variance (σ²) measures the average squared deviation of each data
point from the mean. It is calculated as the sum of squared differences between each
value (xᵢ) and the mean (μ), divided by the number of observations (n): σ² = Σ(xᵢ - μ)² / n.
It quantifies how much the individual data points deviate from the mean (average) of the
dataset. A higher variance indicates a greater amount of variability, while a lower
variance indicates less variability.
where:
- xᵢ represents each individual data point in the dataset.
- μ represents the mean of the dataset.
- n represents the number of observations in the dataset.
- Σ denotes the summation of the squared differences.
To illustrate with an example, let's consider a dataset representing the daily sales (in
thousands of dollars) for a store over a week:
Monday: 4
Tuesday: 3
Wednesday: 6
Thursday: 4
Friday: 5
Saturday: 3
Sunday: 4
So, the variance of the daily sales for the week is approximately 0.8037 (in thousands of
dollars squared).
The variance provides a measure of how the sales values deviate from the average
sales. A higher variance indicates more variability, suggesting that the sales values are
spread out over a wider range. Conversely, a lower variance indicates less variability,
suggesting that the sales values are more tightly clustered around the mean.
Variance is a fundamental concept in statistics and is used in various analyses and
calculations, such as hypothesis testing, regression analysis, and decision-making
processes.
- Standard Deviation: The standard deviation (σ) is the square root of the
variance. It quantifies the dispersion around the mean: σ = √(σ²).
Consider a dataset representing the daily sales of a store for the past week: 1000,
1200, 900, 1100, 950, 1050, 1000.
In this example, the standard deviation tells us that the daily sales figures deviate from
the mean by an average of approximately 93.25 units. This measure gives us an
understanding of the variability or dispersion of the sales data and helps assess the
consistency or fluctuation in sales performance over the week. A higher standard
deviation would indicate more volatility, while a lower standard deviation would suggest
greater stability.
3. Frequency Distribution:
- To create a frequency distribution, you group data into intervals (bins) and count the
number of observations falling into each interval.
Frequency distribution in statistics involves organizing data into groups or intervals and
determining the count or frequency of observations falling into each group. It provides a
way to summarize and analyze data by presenting it in a more manageable and
meaningful form.
Suppose you have collected data on the ages of a group of individuals, and the dataset
looks like this:
18, 22, 25, 28, 31, 31, 34, 35, 37, 40, 40, 43, 45, 45, 48, 50, 52, 55, 60, 65
To create a frequency distribution, you need to group the ages into intervals or
categories and count the number of individuals falling into each interval. Here's one way
to do it:
Interval | Frequency
---------------------
18-25 |3
26-35 |4
36-45 |5
46-55 |4
56-65 |4
In this example, the data has been grouped into five intervals: 18-25, 26-35, 36-45, 46-
55, and 56-65. The frequency column represents the count of individuals falling into
each interval.
From the frequency distribution, we can observe certain characteristics of the data:
- The majority of individuals (5) fall into the 36-45 age group.
- The 18-25 and 56-65 age groups have the fewest individuals (3 each).
- The age distribution appears to be relatively even across the 26-55 age groups.
Frequency distributions are useful for understanding the distribution patterns, identifying
outliers, and gaining insights into the dataset. They provide a summarized view of the
data that can aid in data analysis and decision-making processes.
- Histograms are used to display the distribution of continuous data. The x-axis
represents intervals or bins, and the y-axis represents the frequency or proportion of
observations in each interval.
Let's say we have a dataset of exam scores for a class of 30 students. The scores
range from 60 to 100, and we want to create a histogram to understand the distribution
of scores.
For example, the bar for the interval 60-64 will have a height of 2, the bar for 65-69
will have a height of 4, and so on. These bars are placed adjacent to each other without
any gaps, as the intervals are continuous.
6. Interpretation:
The resulting histogram visually displays the distribution of the exam scores. It
provides insights into the shape of the distribution, any potential skewness, the central
tendency (such as the peak or mode), and the spread of scores.
In our example, we can observe how the scores are distributed across different
intervals. The histogram may show a bell-shaped distribution, indicating a symmetric
pattern. Alternatively, it could be skewed to the left or right, suggesting an asymmetrical
pattern.
- Bar charts are used to display the distribution of categorical data. The x-axis
represents categories, and the y-axis represents the frequency or proportion of
observations in each category. The bars are separated and distinct.
Bar charts are particularly useful for visualizing comparisons between different
categories or groups.
Let's say you conducted a survey to determine the favorite fruits of a group of 100
people. The survey participants were asked to choose among four options: apples,
bananas, oranges, and strawberries. The results of the survey are as follows:
- Apples: 30 people
- Bananas: 45 people
- Oranges: 15 people
- Strawberries: 10 people
To create a bar chart based on this data, you would follow these steps:
1. Identify the categories: In this case, the categories are the four fruits: apples,
bananas, oranges, and strawberries.
2. Determine the frequencies: The frequencies represent the number of people who
chose each fruit. In our example, the frequencies are 30, 45, 15, and 10 for apples,
bananas, oranges, and strawberries, respectively.
3. Choose the axes: The x-axis of the bar chart represents the categories (fruits), and
the y-axis represents the frequency or proportion of observations.
4. Plot the bars: Each category is represented by a rectangular bar. The length of each
bar corresponds to the frequency or proportion of observations in that category.
In our example, the bar chart would have the x-axis labeled with the fruit categories
(apples, bananas, oranges, strawberries), and the y-axis labeled with the frequency of
observations. The bars would be positioned above each category on the x-axis, with
lengths proportional to the corresponding frequencies.
The resulting bar chart would visually represent the popularity of each fruit, allowing for
easy comparison. You could observe that bananas are the most preferred fruit, followed
by apples, oranges, and strawberries.
Bar charts provide a clear and concise way to present categorical data, making it easier
to identify patterns, trends, or differences between groups. They are commonly used in
various fields such as market research, social sciences, and public opinion polls to
display and analyze categorical information.
Skewness is typically calculated using the third standardized moment. The formula for
skewness is as follows:
Where:
- xᵢ represents each individual data point in the dataset.
- μ is the mean of the dataset.
- σ is the standard deviation of the dataset.
- n is the number of observations in the dataset.
The value of skewness can range from negative infinity to positive infinity. A skewness
value of 0 indicates that the data is perfectly symmetrical.
Suppose we have a dataset representing the daily returns of a particular stock over a
month:
-0.02, -0.01, -0.03, 0.01, 0.05, -0.01, -0.04, -0.02, 0.02, -0.01
1. Calculate the mean (μ) and the standard deviation (σ) of the dataset.
- Mean (μ) = (-0.02 + -0.01 + -0.03 + 0.01 + 0.05 + -0.01 + -0.04 + -0.02 + 0.02 + -
0.01) / 10 = -0.005
- Standard Deviation (σ) = √(((-0.02 - (-0.005))² + (-0.01 - (-0.005))² + ... +
(-0.01 - (-0.005))²) / 10)
By performing these calculations, you can determine the skewness of the dataset. If the
resulting skewness value is negative, it indicates a left-skewed distribution, meaning the
tail of the distribution is stretched towards the left. If the skewness value is positive, it
indicates a right-skewed distribution, with the tail stretched towards the right. If the
skewness value is close to zero, the distribution is relatively symmetrical.
Note: While skewness provides information about the shape of the distribution, it is
important to interpret it in conjunction with other descriptive statistics and consider the
context of the data to draw meaningful conclusions.
Both datasets have the same mean and standard deviation. However, they exhibit
different kurtosis values, which can help us understand their shape.
For Dataset A:
- Mean (μ) = (75 + 80 + 85 + 90 + 95) / 5 = 85
- Standard Deviation (σ) ≈ 7.91
- Fourth Moment (μ₄) = [(75-85)⁴ + (80-85)⁴ + (85-85)⁴ + (90-85)⁴ + (95-
85)⁴] / 5 ≈ 420.8
For Dataset B:
- Mean (μ) = (65 + 70 + 85 + 90 + 95) / 5 = 81
- Standard Deviation (σ) ≈ 12.81
- Fourth Moment (μ₄) = [(65-81)⁴ + (70-81)⁴ + (85-81)⁴ + (90-81)⁴ + (95-
81)⁴] / 5 ≈ 635.6
On the other hand, Dataset B has a kurtosis value of approximately 2.072, which is
greater than 3. This indicates that the distribution has heavier tails compared to a
normal distribution. In other words, the scores in Dataset B are more concentrated in
the tails.
Kurtosis provides insights into the shape of a distribution and helps identify departures
from normality. Understanding the kurtosis of a dataset is useful in various fields such
as finance, economics, and risk analysis, where the shape of a distribution can have
important implications.
- Percentiles divide the data into hundredths, ranging from the first percentile (P₁) to
the 99th percentile (P₉₉). The formula to calculate the pth percentile is: P = (p/100) * (n
+ 1).
It indicates the percentage of values that are equal to or below a given value. For
example, the 75th percentile represents the value below which 75% of the data falls.
82, 76, 88, 92, 79, 85, 91, 78, 84, 86,
90, 95, 80, 83, 89, 87, 94, 81, 77, 93,
75, 96, 98, 73, 82, 88, 87, 90, 82, 84
To find the 75th percentile (P75) for this dataset, we follow these steps:
Step 3: Interpolation:
To interpolate, we consider the values at the 23rd and 24th positions:
Value at the 23rd position = 90
Value at the 24th position = 91
Interpolation formula:
P75 = Value at the 23rd position + (Index - 23) * (Value at the 24th position - Value at
the 23rd position)
The 75th percentile (P75) of the dataset is 90.25. This means that 75% of the exam
scores are equal to or below 90.25.
Percentiles are useful for understanding the distribution of data and identifying the
relative position of a particular value within a dataset. They provide valuable insights
into how an individual value compares to the rest of the data.
- Quartiles divide the data into quarters. The first quartile (Q₁) is the 25th percentile,
the second quartile (Q₂) is the median (50th percentile), and the third quartile (Q₃) is
the 75th percentile.
Quartiles are a set of three values that divide a dataset into four equal parts, each
containing 25% of the data. They are a type of percentile and provide insights into the
distribution of data and the spread of values.
To calculate quartiles, the dataset is first arranged in ascending order. The three
quartiles are denoted as Q₁, Q₂, and Q₃, representing the first quartile, second quartile
(also known as the median), and third quartile, respectively.
Consider the following dataset representing the scores of 12 students in a math exam:
78, 82, 65, 92, 88, 72, 90, 81, 75, 86, 94, 80
Step 1: Arrange the data in ascending order:
65, 72, 75, 78, 80, 81, 82, 86, 88, 90, 92, 94
- Q₁: The first quartile (Q₁) corresponds to the 25th percentile. It splits the lowest 25%
of the data from the rest. To find its position, multiply 25% by the total number of
observations (12) and add 1: Q₁ = (25/100) * (12 + 1) = 3.25. Since it falls between the
3rd and 4th values, we can take the average of these two values as the first quartile: Q₁
= (75 + 78) / 2 = 76.5.
- Q₂: The second quartile (Q₂) is the median and represents the 50th percentile. It
divides the dataset into two equal halves. In this case, the median is the average of the
two middle values since we have an even number of observations: Q₂ = (80 + 81) / 2 =
80.5.
- Q₃: The third quartile (Q₃) corresponds to the 75th percentile. It splits the highest 25%
of the data from the rest. To find its position, multiply 75% by the total number of
observations (12) and add 1: Q₃ = (75/100) * (12 + 1) = 9.75. Since it falls between the
9th and 10th values, we can take the average of these two values as the third quartile:
Q₃ = (88 + 90) / 2 = 89.
In summary, for the given dataset, the quartiles are Q₁ = 76.5, Q₂ = 80.5, and Q₃ = 89.
These values provide insights into the distribution of the scores, indicating how the data
is spread out across the lower, middle, and upper ranges.
- Correlation measures the strength and direction of the linear relationship between
two variables. It is calculated using the covariance divided by the product of the
standard deviations of the two variables.
A negative correlation indicates that as one variable increases, the other variable tends
to decrease. For instance, there might be a negative correlation between the amount of
rainfall and the number of hours spent outdoors. As the amount of rainfall increases, the
number of hours spent outdoors decreases.
It's important to note that correlation does not imply causation. Just because two
variables are correlated does not mean that one variable is causing the change in the
other. Correlation simply measures the strength and direction of the relationship
between variables.
Correlation can be calculated using various methods, but one commonly used measure
is Pearson's correlation coefficient (r). The formula for calculating Pearson's correlation
coefficient is:
where xᵢ and yᵢ are the individual values of the two variables, ȳ is the mean of the y
variable, and Σ represents the sum of the values.
For example, let's consider a dataset that measures the number of hours studied (x)
and the corresponding exam scores (y) for a group of students:
√(Σ(xᵢ - ȳ)²) = √(((4 - 5)²) + ((6 - 5)²) + ((3 - 5)²) + ((7 - 5)²) + ((5 - 5)²)) = √(10)
√(Σ(yᵢ - ȳ)²) = √(((70 - 76)²) + ((85 - 76)²) + ((60 - 76)²) + ((90 - 76)²) + ((75 - 76)²)) =
√(490)
- Covariance measures the direction and magnitude of the relationship between two
variables. It is calculated as the average of the product of the deviations from the means
of the two variables.
The formula for calculating covariance between two variables X and Y, based on a
dataset of n observations, is as follows:
Where:
- Xᵢ and Yᵢ represent the individual values of X and Y, respectively.
- μₓ and μᵧ are the means of X and Y, respectively.
- Σ denotes the sum of the terms over all n observations.
The resulting covariance value can be positive, negative, or zero, indicating different
types of relationships between the variables:
It's important to note that covariance is affected by the scale of measurement of the
variables. Consequently, interpreting the magnitude of covariance can be challenging.
To address this, the concept of correlation is often used, which standardizes the
covariance to provide a value between -1 and 1, indicating the strength and direction of
the linear relationship more explicitly.
A random variable is a variable in statistics and probability theory that takes on different values
based on the outcome of a random event. It represents a numerical quantity whose value is
uncertain and depends on the outcome of a random experiment.
In simple terms, a random variable is a way to assign a number to each possible outcome of a
random event or experiment. It provides a mathematical representation of the uncertainty
involved in the event.
Let's consider an example to understand the concept of a random variable.
Suppose we are interested in studying the number of children in a randomly selected family. We
can define a random variable, let's say X, to represent the number of children in a family. The
possible values of X can be 0, 1, 2, 3, and so on.
Now, let's assume we have a dataset of 100 families, and we record the number of children in
each family. The observed values might be as follows:
Family 1: X = 2
Family 2: X = 3
Family 3: X = 1
Family 4: X = 0
Family 5: X = 4
...
In this example, X is a discrete random variable since it can only take on specific whole number
values. Each value of X represents the outcome of a random event, which is the selection of a
family and counting the number of children they have.
With this random variable, we can perform various statistical analyses. For instance, we can
calculate the probability of a family having a certain number of children. We can also calculate
the mean (expected value) and variance of the number of children to understand the average
and spread of the data.
1. Discrete random variables: These variables can only take on a countable number of
distinct values. Examples include the number of heads obtained in multiple coin flips, the
number of cars passing through a toll booth in an hour, or the outcome of rolling a fair six-sided
die.
Let's say we roll the die 100 times and record the outcomes. The observed values might be as
follows:
Roll 1: Y = 3
Roll 2: Y = 6
Roll 3: Y = 2
Roll 4: Y = 4
Roll 5: Y = 1
...
In this example, Y is a discrete random variable since it can only take on specific values that
correspond to the numbers on the die. Each value of Y represents the outcome of a random
event, which is the roll of the die.
1. Probability distribution: We can calculate the probability of each possible outcome. Since the
die is fair, each outcome has an equal chance of occurring. So, the probability of Y being any
specific value (1, 2, 3, 4, 5, or 6) is 1/6.
2. Expected value (mean): We can calculate the expected value of Y, which represents the
average value we would expect to obtain if we repeated the experiment many times. For a fair
die, the expected value is (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5.
2. Continuous random variables: These variables can take on any value within a certain
range or interval. They are often associated with measurements and are characterized by a
range of possible values. Examples include the height of a person, the time it takes for a
computer program to run, or the temperature at a specific location.
Suppose we are interested in studying the heights of adults in a population. We can define a
continuous random variable, let's call it H, to represent the height of an adult. The possible
values of H can range from the minimum height to the maximum height observed in the
population.
In this example, H is a continuous random variable since height can take on any value within a
certain range or interval. It is not limited to specific discrete values.
Let's assume we have collected data on the heights of a sample of adults. Here's an example of
some observed values:
Person 1: H = 168 cm
Person 2: H = 175 cm
Person 3: H = 160 cm
Person 4: H = 182 cm
Person 5: H = 170 cm
...
In this example, each value of H represents the outcome of a random event, which is the
measurement of an adult's height. The random variable H can take on an infinite number of
possible values within a certain range.
Now, let's explore how we can analyze this continuous random variable:
1. Probability density function (PDF): Instead of calculating probabilities for specific values, we
use the probability density function to describe the likelihood of the random variable taking on
different values. The PDF provides a probability distribution over the range of possible heights.
2. Cumulative distribution function (CDF): The cumulative distribution function gives the
probability that the random variable is less than or equal to a certain value. It describes the
probability distribution in terms of cumulative probabilities.
3. Expected value (mean): We can calculate the expected value of H, which represents the
average height in the population. The expected value provides insights into the typical height of
adults in the population.
4. Variance and standard deviation: We can calculate the variance and standard deviation of H,
which measure the spread or variability in the heights. They indicate how much the actual
heights tend to deviate from the expected value.
By analyzing the probability distribution, expected value, variance, and other statistical
measures associated with the continuous random variable H, we can understand the distribution
of heights in the population, make comparisons between different groups, and perform various
statistical analyses.
The concept of random variables is fundamental in probability theory and statistics, as they
provide a mathematical framework for analyzing and modeling uncertain events and their
outcomes. They allow us to calculate probabilities, expected values, variances, and other
statistical measures, which are essential for making predictions and drawing conclusions based
on observed data.
A discrete distribution is a probability distribution that arises from a discrete random variable,
which can only take on specific values with gaps or intervals between them. In a discrete
distribution, the probability assigned to each possible value of the random variable is non-
negative and sums up to 1.
Let's dive into the example of coin flips and the associated discrete distribution in
more detail, including all calculations.
To understand the distribution associated with this random variable, we need to calculate the
probabilities for each possible outcome.
X |0 |1 |2 |3
----------------------------------
P(X) | 0.125 | 0.375 | 0.375 | 0.125
This distribution follows the binomial distribution, as each flip is an independent Bernoulli trial.
The distribution allows us to determine the likelihood of obtaining a specific number of heads in
three coin flips.
We can also calculate additional statistical measures associated with this discrete distribution.
For example:
- Expected value (mean):
E(X) = (0 * 0.125) + (1 * 0.375) + (2 * 0.375) + (3 * 0.125) = 1.5
- Variance:
Var(X) = [(0 - 1.5)^2 * 0.125] + [(1 - 1.5)^2 * 0.375] + [(2 - 1.5)^2 * 0.375] + [(3 - 1.5)^2 * 0.125] =
0.75
By analyzing this discrete distribution, we can make probabilistic predictions, calculate expected
values, variances, and other statistical measures, which are essential for analyzing and drawing
conclusions based on observed data related to the random variable X.
Discrete distributions refer to probability distributions that arise from discrete random variables.
A discrete random variable takes on a countable number of distinct values. In other words, it
can only assume specific values with gaps or intervals between them.
Discrete distributions are defined by the probabilities assigned to each possible value of the
random variable. The probabilities are non-negative and sum up to 1, reflecting the likelihood of
each outcome occurring.
1. Bernoulli distribution: This distribution represents a single trial with two possible outcomes,
often denoted as 0 and 1. It is characterized by a parameter p, which is the probability of the
event being a success (1) and the complementary probability (1-p) of it being a failure (0).
Suppose we have a biased coin that has a probability of 0.3 of landing on heads and a
probability of 0.7 of landing on tails. We are interested in studying the outcome of a single flip of
this coin.
In this example, we can define a Bernoulli random variable, let's call it X, to represent the
outcome of the coin flip. We can assign the value 1 to represent heads and the value 0 to
represent tails.
Now, let's explore the properties and characteristics of the Bernoulli distribution in this context:
In this case, the Bernoulli distribution assigns a probability of 0.3 to the outcome heads (X = 1)
and a probability of 0.7 to the outcome tails (X = 0). The probabilities sum up to 1, as required
for a probability distribution.
E(X) = p * 1 + q * 0 = p
5. Variance:
The variance of a Bernoulli random variable can be calculated as the product of the probability
of success (p), the probability of failure (q), and the difference between the value assigned to
success (1) and the expected value squared (p - E(X))^2:
Var(X) = p * q * (1 - p)^2
Suppose we are interested in studying the number of heads obtained in five flips of a fair coin.
We can model this situation using the binomial distribution.
In this example, we define a binomial random variable, let's call it X, to represent the number of
heads. The possible values of X range from 0 to 5, as we are flipping the coin five times.
Now, let's explain the properties and characteristics of the binomial distribution in this context:
where k is the number of heads (ranging from 0 to 5), C(n, k) represents the number of ways to
choose k heads from n flips (given by the binomial coefficient), p is the probability of success
(0.5 in our case), and q is the probability of failure (0.5 in our case).
For instance, let's calculate the probability of getting exactly 3 heads:
Substituting these values into the equation, we can calculate P(X = 3).
E(X) = n * p
6. Variance:
The variance of a binomial random variable can be calculated as the product of the number of
trials (n), the probability of success (p), and the probability of failure (q):
Var(X) = n * p * q
The binomial distribution is commonly used to model the number of successes in a fixed
number of independent Bernoulli trials. It allows us to understand the likelihood of obtaining
different numbers of successes.
3. Poisson distribution: This distribution models the number of events that occur in a fixed
interval of time or space. It is often used to represent rare events where the average number of
occurrences is known. The Poisson distribution is characterized by a single parameter, λ
(lambda), which represents the average rate of occurrence.
Suppose we are running a small café, and on average, we receive 10 customer arrivals per
hour during the lunchtime rush. We are interested in studying the number of customer arrivals in
a given hour using the Poisson distribution.
In this example, we define a Poisson random variable, let's call it X, to represent the number of
customer arrivals in a given hour.
Now, let's explain the properties and characteristics of the Poisson distribution in this context:
where k is the number of customer arrivals (ranging from 0 to infinity), e is the mathematical
constant approximately equal to 2.71828, λ is the average rate (10 in our case), and k! denotes
the factorial of k.
For instance, let's calculate the probability of having exactly 8 customer arrivals in one hour:
By substituting the values into the equation, we can calculate P(X = 8).
E(X) = Var(X) = λ
In our example, the expected value and variance of X are both 10.
The Poisson distribution is commonly used to model the number of events occurring in a fixed
interval of time or space when events happen independently and at a constant average rate.
4. Geometric distribution: This distribution models the number of trials required to achieve the
first success in a sequence of independent Bernoulli trials. It is characterized by a parameter p,
the probability of success in each trial.
Let's explore an example of the geometric distribution with a detailed
explanation.
Suppose we are interested in studying the number of coin tosses needed until we obtain the first
heads. We can model this situation using the geometric distribution.
In this example, we define a geometric random variable, let's call it X, to represent the number
of tosses needed until we get heads. The possible values of X range from 1 to infinity.
Now, let's explain the properties and characteristics of the geometric distribution in this context:
P(X = k) = (1 - p)^(k-1) * p
where k is the number of tosses needed (ranging from 1 to infinity), p is the probability of
success (0.5 in our case), and (1 - p)^(k-1) represents the probability of having k-1 consecutive
tails followed by heads on the k-th toss.
For instance, let's calculate the probability of needing exactly 3 tosses until getting the first
heads:
By substituting the values into the equation, we can calculate P(X = 3).
Var(X) = q / p^2
The geometric distribution is commonly used to model the number of trials needed until the first
success occurs in a sequence of independent Bernoulli trials.
Suppose we have a standard deck of 52 playing cards, consisting of four suits (hearts,
diamonds, clubs, spades) with 13 cards each. We are interested in studying the probability of
drawing a certain number of hearts when drawing a specific number of cards without
replacement.
In this example, we define a hypergeometric random variable, let's call it X, to represent the
number of hearts in the drawn cards. The possible values of X range from 0 to the minimum of
the number of hearts in the deck and the number of cards drawn.
Now, let's explain the properties and characteristics of the hypergeometric distribution in this
context:
where C(a, b) represents the binomial coefficient, which is the number of ways to choose b
items from a items, and C(N, n) represents the number of ways to choose n items from N items.
For instance, let's calculate the probability of drawing exactly 2 hearts when drawing 5 cards:
By substituting the values into the equation, we can calculate P(X = 2).
The hypergeometric distribution models situations where the sample is drawn without
replacement and the probability of success changes as items are selected.
There are several different continuous distributions that are commonly used in
statistics and probability theory
Let's dive into the explanation of the normal distribution with an example.
The shape of the normal distribution is symmetric, with the mean as its center point. The
standard deviation determines the spread or dispersion of the distribution. The
probability density function (PDF) of the normal distribution is given by the formula:
where f(x) represents the probability density at a given value x, μ is the mean, σ is the
standard deviation, π is a mathematical constant approximately equal to 3.14159, and
exp() denotes the exponential function.
In this example, let's assume the mean height (μ) of the adult population is 170
centimeters and the standard deviation (σ) is 5 centimeters.
With these parameters, we can use the normal distribution to make various probabilistic
statements about the heights of individuals:
Additionally, we can calculate z-scores to determine how far a given height is from the
mean in terms of standard deviations. A positive z-score indicates a height above the
mean, while a negative z-score indicates a height below the mean.
The normal distribution is widely used in various fields, including social sciences, natural
sciences, economics, and engineering, as many natural phenomena tend to follow a
roughly normal distribution. By understanding the characteristics of the normal
distribution and using its properties, we can analyze data, make probabilistic
predictions, and draw conclusions about the likelihood of certain events or observations
occurring.
2. Uniform Distribution:
The uniform distribution is characterized by a constant probability density over a
specified interval. It is often represented as a rectangle with equal heights across the
interval. The uniform distribution is used when all outcomes in a range are equally likely,
such as rolling a fair die.
Let's explore an example of the uniform distribution with a detailed explanation.
Suppose you have a bag containing 100 numbered balls, ranging from 1 to 100. You
want to randomly select a ball from the bag, and you believe that each ball is equally
likely to be chosen.
In this example, the uniform distribution can be used to model the probability distribution
of selecting a number from the bag. Each number has an equal probability of being
chosen since all the balls are assumed to be identical and equally likely.
1. Interval:
In this case, the interval represents the range of numbers in the bag, which is from 1 to
100. The interval is inclusive, meaning it includes both the lower and upper bounds.
f(x) = 1 / (b - a)
where f(x) represents the probability density at a given value x, and a and b represent
the lower and upper bounds of the interval, respectively.
In our example, the lower bound (a) is 1 and the upper bound (b) is 100. Therefore, the
PDF of the uniform distribution for this example is:
This means that each ball in the bag has a probability of 1/99 of being selected.
Using the uniform distribution, we can answer various questions and make probabilistic
statements about selecting a number from the bag:
3. Expected Value:
The expected value (mean) of a uniform distribution is calculated as the average of the
lower and upper bounds. In this case, the expected value of randomly selecting a
number from the bag is (1 + 100) / 2 = 50.5.
The uniform distribution is commonly used when all outcomes within an interval are
equally likely. It provides a simple and intuitive way to model situations where each
outcome has the same probability of occurring.
Here are others distributions that we are not going to cover in details
3. Exponential Distribution:
The exponential distribution is often used to model the time between events in a
Poisson process, where events occur at a constant rate independently of time. It is
characterized by a decreasing exponential shape and is typically used to model the
duration until an event happens.
4. Beta Distribution:
The beta distribution is a versatile continuous distribution defined on the interval [0, 1]. It
is commonly used to model random variables that represent proportions or probabilities.
The shape of the distribution is controlled by two shape parameters, often denoted as α
and β.
5. Gamma Distribution:
The gamma distribution is a flexible continuous distribution that is often used to model
positive, skewed data. It is commonly used to model waiting times, lifetimes, and other
positive continuous variables. The gamma distribution has two shape parameters: α and
β.
6. Weibull Distribution:
The Weibull distribution is commonly used to model failure times or lifetimes of systems.
It is characterized by its shape parameter (k) and scale parameter (λ). The Weibull
distribution can take on various shapes, including exponential (when k = 1), decreasing
failure rate, and increasing failure rate.
3. Cluster Sampling: The population is divided into clusters or groups, and then a
subset of clusters is randomly selected. All individuals within the selected clusters are
included in the sample.
Once a sample is obtained, researchers can analyze the collected data and draw
conclusions about the population. This is where the concept of a sampling distribution
comes into play.
1. Central Limit Theorem: The sampling distribution of a sufficiently large sample size
(typically n > 30) tends to follow an approximately normal distribution, regardless of the
shape of the population distribution. This is known as the central limit theorem.
2. Standard Error: The standard deviation of the sampling distribution is called the
standard error. It measures the variability of the statistic across different samples. The
standard error decreases as the sample size increases.
3. Sampling Distribution of the Mean: For the sampling distribution of the sample mean,
the mean of the sampling distribution is equal to the population mean, and the standard
deviation (standard error) is equal to the population standard deviation divided by the
square root of the sample size.
Sampling techniques and the resulting sampling distributions play a crucial role in
inferential statistics, hypothesis testing, and making generalizations about populations
based on sample data. Proper sampling techniques and understanding the
characteristics of sampling distributions help ensure accurate and reliable research
findings.
Example:
Let's consider an example to understand how confidence intervals work:
Suppose you want to estimate the average height of students in a university. You collect
a random sample of 100 students and measure their heights. The sample mean height
is found to be 165 cm, and the standard deviation of the sample is 5 cm.
To construct a 95% confidence interval for the population mean height, you would follow
these steps:
Interpretation:
We can interpret the confidence interval as follows: "We are 95% confident that the true
average height of students in the university lies between 164.02 cm and 165.98 cm."
This means that if we were to repeat the sampling process many times and construct
confidence intervals, approximately 95% of those intervals would capture the true
population mean height.
The width of the confidence interval (1.96 cm in this case) represents the margin of
error or the level of uncertainty in the estimation. A wider interval indicates higher
uncertainty, while a narrower interval indicates greater precision in the estimate.
Confidence intervals provide a range of plausible values for the population parameter
and allow for making statistical inferences about the population based on the sample
data. They help to quantify the uncertainty associated with the estimation process and
provide a measure of the precision and reliability of the estimates.
7. Draw Conclusions:
Based on the comparison, make a conclusion about the null hypothesis. If the null
hypothesis is rejected, it suggests that there is sufficient evidence to support the
alternative hypothesis. If the null hypothesis is not rejected, it indicates that the
evidence is not strong enough to support the alternative hypothesis.
Example:
Suppose a pharmaceutical company has developed a new drug and wants to determine
if it is effective in reducing cholesterol levels. The null hypothesis (H0) would state that
there is no significant difference in cholesterol levels between the drug-treated group
and the control group, while the alternative hypothesis (Ha) would state that there is a
significant difference.
The company conducts a randomized controlled trial with 100 participants, randomly
assigning 50 to receive the new drug and 50 to receive a placebo. After the treatment
period, cholesterol levels are measured for both groups.
Using a t-test as the analysis method, the company calculates the test statistic (e.g., t-
value) based on the sample data and determines the critical value(s) for the chosen
significance level (e.g., α = 0.05).
Suppose the calculated t-value is 2.36, and the critical value for a two-tailed t-test at α =
0.05 is 2.00.
Comparing the test statistic (2.36) with the critical value (2.00), we find that the test
statistic
falls within the rejection region. Therefore, we reject the null hypothesis and conclude
that there is sufficient evidence to support the alternative hypothesis. This suggests that
the new drug is effective in reducing cholesterol levels compared to the placebo.
In statistics, inference with two populations refers to the process of drawing conclusions
or making comparisons about two separate populations based on sample data. This
involves conducting hypothesis tests or constructing confidence intervals to assess the
similarity or difference between population parameters.
For example, if the analysis yields a p-value of 0.02 at a significance level of 0.05, we
would reject the null hypothesis and conclude that there is a significant difference in the
average salaries between Company A and Company B.
Inference with two populations allows us to make statistical comparisons and draw
conclusions about differences or similarities between two groups of interest. It helps in
making informed decisions, providing evidence for policy changes, or identifying areas
of improvement based on comparisons between populations.