0% found this document useful (0 votes)
64 views

Advance Statistics for Data Science and Data Analysis (2)

Uploaded by

khush khoyani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Advance Statistics for Data Science and Data Analysis (2)

Uploaded by

khush khoyani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Chapter 1 : Applications of Statistics in various fields 🧾

1) Accounting
a) Use of sampling methods for audits
E.G. : Audit sampling is a technique used by auditors to select a representative
sample of data from a larger population to obtain reasonable assurance. This
approach reduces the cost and time of conducting a full audit on the entire
population.
2) Economics
a) Forecast about future
E.G : the process of attempting to predict the future condition of the economy
using a combination of widely followed indicators. Economic forecasting involves
the building of statistical models with inputs of several key variables, or
indicators, typically in an attempt to come up with a future gross domestic
product (GDP) growth rate. Primary economic indicators include inflation, interest
rates, industrial production, consumer confidence, worker productivity, retail
sales, and unemployment rates.
3) Marketing
a) Marketing Research
E.G. : when some company launches their product and to decide how much
production should be done they first have to research the market about the
consumer need.
4) Production
a) Quality control charts

E.G.:
Quality control charts are a type of control often used by engineers to assess the
performance of a firm's processes or finished products. If issues are detected,
they can easily be compared to their location on the chart for debugging or error
control. In other words, it provides a heuristic blueprint for maintaining quality
control.

Scale of Measurement :
Because data collected can be done from different - different studies or experiments , so
data has various types based on studies . To summarize data has mostly four types , and based
on different types of data we apply different statistical methods.

1) Nominal : its is just name of things , e.g. Name of students, cities, class, product,
animals, countries , student ID
2) Ordinal : it has all the properties of nominal but key things is it has order in it. E.g.
a) Freshman, Junior, Senior
b) Cold , warm, hot
c) 1st, 2nd , 3rd

3) Interval : It has all the properties of ordinal but it has something more i.e. the interval
between two numbers makes some sense. E.g.
a) Temperature of the 1st Jan is 30 degrees and 2nd Jan is 15 degrees. Then we
can say that 2nd Jan was 15 degrees colder than 1st Jan.

4) Ratio : It has all the property of Interval but it has something more i.e. the ratio of two
numbers makes some sense. E.g.
a) Suppose the age of the father is 40 and the son's age is 10 , so if we take out the
ratio then we can say that (40/10 = 4) father is 4 times older than the son.

Qualitative and Quantitative Measure

Statistical Methods
In general we have two types of methods available for data based on what statistics we
want to applied i.e.
1) Descriptive statistics
2) Statistical Inference

Descriptive Statistics
This statistics is used when we want to describe the whole data in the form of
1) Tabular summary
2) Graphical summary
3) Numerical Summary

Inferential Statistics
This statistics is used when we want to conclude or take inference from data

E.G. :
Suppose we have a whole population , and to measure something we will take
some sample from the population (sample should be representative of population).

There is another thing that how much sample to take its totally depends on if data is
Homogeneous and Heterogeneous . Suppose if data is homogeneous then a small
sample will also work e.g. if a doctor wants to know about your blood group then he will
take just a drop of blood for study and conclusion. (here data is homogeneous)

Heterogeneous means data has lot of variation in it. If this is case then we need to take
large sample of data for analysis.

And whatever we get results from the sample then we will apply to the whole population.
This process i.e. taking samples from the population and doing data analysis and
results back to the population is called inferential statistics.

Chapter 2 : Descriptive Statistics


1. Measures of Central Tendency:

- Mean: The mean (μ) is calculated by summing all the values (xᵢ) in the dataset and
dividing by the number of observations (n): μ = (Σxᵢ) / n. It represents the average value
of a dataset.

Here's an example to illustrate the concept of mean:

Consider a dataset of exam scores for a class of students: 85, 90, 75, 80, and 95.

To find the mean, you add up all the values:

85 + 90 + 75 + 80 + 95 = 425

Next, you divide the sum by the number of observations (in this case, 5):

425 / 5 = 85

Therefore, the mean of the exam scores in this dataset is 85.

The mean represents the typical value or average value of the dataset. It provides a
sense of the central value around which the observations tend to cluster. In this
example, a mean score of 85 suggests that, on average, the students performed well on
the exam.

The mean is widely used in various fields, such as analyzing survey responses,
calculating average sales, or determining the average temperature over a period. It
serves as a useful summary statistic for understanding the overall pattern or level of a
dataset.

- Median: The median is the middle value when the data is sorted in ascending or
descending order. For an odd number of observations, it is the value at position (n+1)/2.
For an even number of observations, it is the average of the values at positions n/2 and
(n/2)+1.

It divides the dataset into two equal halves, with 50% of the values above and 50%
below it. The median is particularly useful when there are extreme values or when the
data is not symmetrically distributed.

To understand the concept of median, let's consider an example:

Suppose we have a dataset representing the ages of 10 individuals: 21, 24, 26, 28, 30,
32, 35, 36, 40, 50.

To find the median, we need to first order the dataset in ascending or descending order:

21, 24, 26, 28, 30, 32, 35, 36, 40, 50.

Since the dataset has an even number of observations (10), the median will be the
average of the two middle values, which are 30 and 32. Thus, the median is:

(30 + 32) / 2 = 31.

In this case, the median age of the group is 31. It represents the value that separates
the dataset into two halves, with five values below 31 and five values above 31. The
median is not affected by extreme values, such as the highest value of 50 in this
example, making it a robust measure of central tendency.

The median is particularly useful in situations where the dataset contains outliers or
extreme values that could heavily influence the mean. By using the median, we can get
a more representative measure that is less sensitive to such outliers.

Overall, the median is a valuable statistic in statistics as it provides a central value that
represents the middle of the dataset, making it a reliable measure in various scenarios.
- Mode: The mode is the value(s) that occur(s) most frequently in the dataset.

It is the observation(s) that occur with the highest frequency. The mode is particularly
useful when dealing with categorical or discrete data, although it can also be applied to
continuous data.

Let's consider an example to understand the concept of mode:

Suppose we have a dataset representing the favorite colors of a group of people:

Red, Blue, Green, Blue, Yellow, Red, Red, Green, Blue, Blue

To find the mode in this dataset, we need to identify the color(s) that occur(s) most
frequently. By examining the data, we can see that "Blue" appears the most, with a
frequency of 4. The other colors, such as "Red" and "Green," occur less frequently.

Therefore, in this example, the mode is "Blue" because it is the value that appears most
often in the dataset.

It's important to note that a dataset can have multiple modes or no mode at all. If there
are two or more values with the same highest frequency, the dataset is said to be
multimodal. In cases where no value is repeated or all values have the same frequency,
the dataset is considered to have no mode.

The mode is a simple and intuitive measure of central tendency that can provide
insights into the most common or popular observations in a dataset. It is particularly
useful when dealing with categorical data, such as favorite colors, types of cars, or
responses to a survey question with multiple choices.

2. Measures of Dispersion:

- Range: The range is calculated as the difference between the maximum (X_max)
and minimum (X_min) values: Range = X_max - X_min.

It provides a simple measure of the spread or variability of the data. The range gives
you an idea of how widely the data values are dispersed.

Here's an example to illustrate the concept of range:


Consider a dataset representing the daily high temperatures (in degrees Celsius) in a
city for a week:

15, 17, 19, 20, 22, 18, 16

To calculate the range, you need to find the maximum and minimum values in the
dataset. In this case, the maximum value is 22 (the highest temperature recorded) and
the minimum value is 15 (the lowest temperature recorded).

Range = Maximum value - Minimum value

Range = 22 - 15

Range = 7

So, in this example, the range of the daily high temperatures for the week is 7 degrees
Celsius. This means that the temperatures varied by a range of 7 degrees over the
week.

The range is a straightforward measure of dispersion and provides a quick


understanding of the spread of data. However, it can be influenced heavily by extreme
values or outliers, which may not fully represent the overall variability of the dataset.
Therefore, it is often used in conjunction with other measures of dispersion, such as
variance or standard deviation, for a more comprehensive analysis of the spread of
data.

- Variance: The variance (σ²) measures the average squared deviation of each data
point from the mean. It is calculated as the sum of squared differences between each
value (xᵢ) and the mean (μ), divided by the number of observations (n): σ² = Σ(xᵢ - μ)² / n.

It quantifies how much the individual data points deviate from the mean (average) of the
dataset. A higher variance indicates a greater amount of variability, while a lower
variance indicates less variability.

Mathematically, variance is calculated by taking the average of the squared differences


between each data point and the mean. The formula for variance is as follows:

Variance (σ²) = Σ(xᵢ - μ)² / n

where:
- xᵢ represents each individual data point in the dataset.
- μ represents the mean of the dataset.
- n represents the number of observations in the dataset.
- Σ denotes the summation of the squared differences.

To illustrate with an example, let's consider a dataset representing the daily sales (in
thousands of dollars) for a store over a week:

Monday: 4
Tuesday: 3
Wednesday: 6
Thursday: 4
Friday: 5
Saturday: 3
Sunday: 4

Step 1: Calculate the mean:


Mean (μ) = (4 + 3 + 6 + 4 + 5 + 3 + 4) / 7 = 29 / 7 ≈ 4.14

Step 2: Calculate the squared differences from the mean:


Squared differences:
(4 - 4.14)² ≈ 0.0196
(3 - 4.14)² ≈ 1.3396
(6 - 4.14)² ≈ 3.4596
(4 - 4.14)² ≈ 0.0196
(5 - 4.14)² ≈ 0.7396
(3 - 4.14)² ≈ 1.3396
(4 - 4.14)² ≈ 0.0196

Step 3: Calculate the variance:


Variance (σ²) = (0.0196 + 1.3396 + 3.4596 + 0.0196 + 0.7396 + 1.3396 +
0.0196) / 7 ≈ 0.8037

So, the variance of the daily sales for the week is approximately 0.8037 (in thousands of
dollars squared).

The variance provides a measure of how the sales values deviate from the average
sales. A higher variance indicates more variability, suggesting that the sales values are
spread out over a wider range. Conversely, a lower variance indicates less variability,
suggesting that the sales values are more tightly clustered around the mean.
Variance is a fundamental concept in statistics and is used in various analyses and
calculations, such as hypothesis testing, regression analysis, and decision-making
processes.

- Standard Deviation: The standard deviation (σ) is the square root of the
variance. It quantifies the dispersion around the mean: σ = √(σ²).

Standard deviation is a measure of the dispersion or spread of a dataset. It quantifies


how much the individual data points deviate from the mean, providing information about
the variability or uncertainty within the dataset. A higher standard deviation indicates
greater dispersion, while a lower standard deviation indicates less dispersion.

Here's an explanation of standard deviation with an example:

Consider a dataset representing the daily sales of a store for the past week: 1000,
1200, 900, 1100, 950, 1050, 1000.

1. Calculate the mean:


To calculate the mean, sum up all the values in the dataset and divide it by the
number of observations:
(1000 + 1200 + 900 + 1100 + 950 + 1050 + 1000) / 7 = 1042.86

2. Calculate the deviations:


Find the deviation of each value from the mean by subtracting the mean from each
data point:
1000 - 1042.86 = -42.86
1200 - 1042.86 = 157.14
900 - 1042.86 = -142.86
1100 - 1042.86 = 57.14
950 - 1042.86 = -92.86
1050 - 1042.86 = 7.14
1000 - 1042.86 = -42.86

3. Calculate the squared deviations:


Square each deviation to eliminate negative signs and emphasize differences:
(-42.86)² = 1836.74
(157.14)² = 24662.04
(-142.86)² = 20408.16
(57.14)² = 3265.86
(-92.86)² = 8624.86
(7.14)² = 51.12
(-42.86)² = 1836.74

4. Calculate the variance:


Find the average of the squared deviations. Divide the sum of squared deviations by
the number of observations:
(1836.74 + 24662.04 + 20408.16 + 3265.86 + 8624.86 + 51.12 + 1836.74) / 7 =
8698.18

5. Calculate the standard deviation:


Take the square root of the variance to obtain the standard deviation:
√8698.18 = 93.25

The standard deviation of the dataset is approximately 93.25.

In this example, the standard deviation tells us that the daily sales figures deviate from
the mean by an average of approximately 93.25 units. This measure gives us an
understanding of the variability or dispersion of the sales data and helps assess the
consistency or fluctuation in sales performance over the week. A higher standard
deviation would indicate more volatility, while a lower standard deviation would suggest
greater stability.

3. Frequency Distribution:
- To create a frequency distribution, you group data into intervals (bins) and count the
number of observations falling into each interval.

Frequency distribution in statistics involves organizing data into groups or intervals and
determining the count or frequency of observations falling into each group. It provides a
way to summarize and analyze data by presenting it in a more manageable and
meaningful form.

Let's consider an example to understand frequency distribution:

Suppose you have collected data on the ages of a group of individuals, and the dataset
looks like this:
18, 22, 25, 28, 31, 31, 34, 35, 37, 40, 40, 43, 45, 45, 48, 50, 52, 55, 60, 65
To create a frequency distribution, you need to group the ages into intervals or
categories and count the number of individuals falling into each interval. Here's one way
to do it:

Interval | Frequency
---------------------
18-25 |3
26-35 |4
36-45 |5
46-55 |4
56-65 |4

In this example, the data has been grouped into five intervals: 18-25, 26-35, 36-45, 46-
55, and 56-65. The frequency column represents the count of individuals falling into
each interval.

From the frequency distribution, we can observe certain characteristics of the data:
- The majority of individuals (5) fall into the 36-45 age group.
- The 18-25 and 56-65 age groups have the fewest individuals (3 each).
- The age distribution appears to be relatively even across the 26-55 age groups.

Frequency distributions can be represented in various forms, such as tables or charts.


Histograms, for example, provide a visual representation of frequency distributions for
continuous data, where the intervals are represented on the x-axis, and the frequency is
shown on the y-axis as the height of the bars.

Frequency distributions are useful for understanding the distribution patterns, identifying
outliers, and gaining insights into the dataset. They provide a summarized view of the
data that can aid in data analysis and decision-making processes.

4. Histograms and Bar Charts:

- Histograms are used to display the distribution of continuous data. The x-axis
represents intervals or bins, and the y-axis represents the frequency or proportion of
observations in each interval.

A histogram is a graphical representation of the distribution of a dataset. It provides a


visual display of the frequencies or proportions of observations falling into different
intervals or bins. Histograms are commonly used in statistics to understand the shape,
central tendency, and variability of continuous data.

Here's an example to help explain histograms:

Let's say we have a dataset of exam scores for a class of 30 students. The scores
range from 60 to 100, and we want to create a histogram to understand the distribution
of scores.

1. Determine the Intervals (Bins):


Firstly, we need to determine the number and width of intervals (bins) to divide the
range of scores. A common rule is to have around 5-15 bins, but we can adjust it based
on the dataset and desired level of detail. Let's choose 8 intervals for this example.

2. Define the Interval Width:


To determine the interval width, we calculate the range of scores divided by the
number of intervals. In this case, the range is 100 - 60 = 40. So, the interval width is 40 /
8 = 5.

3. Create the Intervals:


Start with the minimum value (60) and create the intervals by adding the interval width
repeatedly. In this example, the intervals would be: 60-64, 65-69, 70-74, 75-79, 80-84,
85-89, 90-94, 95-99, and 100.

4. Count the Frequencies:


Now, count the number of scores that fall into each interval. For example, let's say we
have the following frequencies: 2, 4, 5, 6, 7, 3, 2, 0, 1. These frequencies represent how
many students scored within each interval.

5. Plotting the Histogram:


On the x-axis, we represent the intervals (bins) from step 3. On the y-axis, we
represent the frequency of scores in each interval. Each interval will have a
corresponding bar whose height represents the frequency.

For example, the bar for the interval 60-64 will have a height of 2, the bar for 65-69
will have a height of 4, and so on. These bars are placed adjacent to each other without
any gaps, as the intervals are continuous.

6. Interpretation:
The resulting histogram visually displays the distribution of the exam scores. It
provides insights into the shape of the distribution, any potential skewness, the central
tendency (such as the peak or mode), and the spread of scores.

In our example, we can observe how the scores are distributed across different
intervals. The histogram may show a bell-shaped distribution, indicating a symmetric
pattern. Alternatively, it could be skewed to the left or right, suggesting an asymmetrical
pattern.

By constructing a histogram, we can gain a better understanding of the distribution of


the dataset, identify any patterns or outliers, and make data-driven decisions or
interpretations based on the visual representation of the data.

- Bar charts are used to display the distribution of categorical data. The x-axis
represents categories, and the y-axis represents the frequency or proportion of
observations in each category. The bars are separated and distinct.

Bar charts are particularly useful for visualizing comparisons between different
categories or groups.

Here's an example to illustrate the use of a bar chart:

Let's say you conducted a survey to determine the favorite fruits of a group of 100
people. The survey participants were asked to choose among four options: apples,
bananas, oranges, and strawberries. The results of the survey are as follows:

- Apples: 30 people
- Bananas: 45 people
- Oranges: 15 people
- Strawberries: 10 people

To create a bar chart based on this data, you would follow these steps:

1. Identify the categories: In this case, the categories are the four fruits: apples,
bananas, oranges, and strawberries.

2. Determine the frequencies: The frequencies represent the number of people who
chose each fruit. In our example, the frequencies are 30, 45, 15, and 10 for apples,
bananas, oranges, and strawberries, respectively.
3. Choose the axes: The x-axis of the bar chart represents the categories (fruits), and
the y-axis represents the frequency or proportion of observations.

4. Plot the bars: Each category is represented by a rectangular bar. The length of each
bar corresponds to the frequency or proportion of observations in that category.

In our example, the bar chart would have the x-axis labeled with the fruit categories
(apples, bananas, oranges, strawberries), and the y-axis labeled with the frequency of
observations. The bars would be positioned above each category on the x-axis, with
lengths proportional to the corresponding frequencies.

The resulting bar chart would visually represent the popularity of each fruit, allowing for
easy comparison. You could observe that bananas are the most preferred fruit, followed
by apples, oranges, and strawberries.

Bar charts provide a clear and concise way to present categorical data, making it easier
to identify patterns, trends, or differences between groups. They are commonly used in
various fields such as market research, social sciences, and public opinion polls to
display and analyze categorical information.

5. Measures of Skewness and Kurtosis:

- Skewness measures the asymmetry of a distribution. It is calculated using the third


moment about the mean (μ) and the standard deviation (σ).

Skewness is a measure of the asymmetry of a probability distribution. It indicates


whether the data is skewed to the left (negative skewness) or to the right (positive
skewness) relative to the mean.

Skewness is typically calculated using the third standardized moment. The formula for
skewness is as follows:

Skewness = (Σ((xᵢ - μ)³) / (n * σ³)

Where:
- xᵢ represents each individual data point in the dataset.
- μ is the mean of the dataset.
- σ is the standard deviation of the dataset.
- n is the number of observations in the dataset.
The value of skewness can range from negative infinity to positive infinity. A skewness
value of 0 indicates that the data is perfectly symmetrical.

Now, let's consider an example to understand skewness:

Suppose we have a dataset representing the daily returns of a particular stock over a
month:

-0.02, -0.01, -0.03, 0.01, 0.05, -0.01, -0.04, -0.02, 0.02, -0.01

To calculate the skewness, we need to follow these steps:

1. Calculate the mean (μ) and the standard deviation (σ) of the dataset.
- Mean (μ) = (-0.02 + -0.01 + -0.03 + 0.01 + 0.05 + -0.01 + -0.04 + -0.02 + 0.02 + -
0.01) / 10 = -0.005
- Standard Deviation (σ) = √(((-0.02 - (-0.005))² + (-0.01 - (-0.005))² + ... +
(-0.01 - (-0.005))²) / 10)

2. Calculate the numerator of the skewness formula: Σ((xᵢ - μ)³).


- ((-0.02 - (-0.005))³ + (-0.01 - (-0.005))³ + ... + (-0.01 - (-0.005))³)

3. Calculate the denominator of the skewness formula: n * σ³.


- 10 * σ³

4. Divide the numerator by the denominator to obtain the skewness value.

By performing these calculations, you can determine the skewness of the dataset. If the
resulting skewness value is negative, it indicates a left-skewed distribution, meaning the
tail of the distribution is stretched towards the left. If the skewness value is positive, it
indicates a right-skewed distribution, with the tail stretched towards the right. If the
skewness value is close to zero, the distribution is relatively symmetrical.

Note: While skewness provides information about the shape of the distribution, it is
important to interpret it in conjunction with other descriptive statistics and consider the
context of the data to draw meaningful conclusions.

- Kurtosis measures the degree of peakedness or flatness of a distribution. It is


calculated using the fourth moment about the mean (μ) and the standard deviation (σ).
It quantifies the degree to which a distribution deviates from a normal distribution in
terms of the concentration of values in the tails.

Kurtosis is typically calculated by comparing the distribution's fourth moment to the


fourth power of its standard deviation. A positive kurtosis indicates a distribution with
heavier or more concentrated tails compared to a normal distribution, while a negative
kurtosis indicates lighter or less concentrated tails.

Here's an example to help illustrate the concept of kurtosis:

Consider two datasets of exam scores:

Dataset A: 75, 80, 85, 90, 95


Dataset B: 65, 70, 85, 90, 95

Both datasets have the same mean and standard deviation. However, they exhibit
different kurtosis values, which can help us understand their shape.

For Dataset A:
- Mean (μ) = (75 + 80 + 85 + 90 + 95) / 5 = 85
- Standard Deviation (σ) ≈ 7.91
- Fourth Moment (μ₄) = [(75-85)⁴ + (80-85)⁴ + (85-85)⁴ + (90-85)⁴ + (95-
85)⁴] / 5 ≈ 420.8

For Dataset B:
- Mean (μ) = (65 + 70 + 85 + 90 + 95) / 5 = 81
- Standard Deviation (σ) ≈ 12.81
- Fourth Moment (μ₄) = [(65-81)⁴ + (70-81)⁴ + (85-81)⁴ + (90-81)⁴ + (95-
81)⁴] / 5 ≈ 635.6

Now, let's calculate the kurtosis for each dataset:

Kurtosis for Dataset A:


- Kurtosis = (μ₄ / σ⁴) ≈ 420.8 / (7.91)⁴ ≈ 0.705

Kurtosis for Dataset B:


- Kurtosis = (μ₄ / σ⁴) ≈ 635.6 / (12.81)⁴ ≈ 2.072
Dataset A has a kurtosis value of approximately 0.705, which is less than 3. This
indicates that the distribution has lighter tails compared to a normal distribution. In other
words, the scores in Dataset A are less concentrated in the tails.

On the other hand, Dataset B has a kurtosis value of approximately 2.072, which is
greater than 3. This indicates that the distribution has heavier tails compared to a
normal distribution. In other words, the scores in Dataset B are more concentrated in
the tails.

Kurtosis provides insights into the shape of a distribution and helps identify departures
from normality. Understanding the kurtosis of a dataset is useful in various fields such
as finance, economics, and risk analysis, where the shape of a distribution can have
important implications.

6. Percentiles and Quartiles:

- Percentiles divide the data into hundredths, ranging from the first percentile (P₁) to
the 99th percentile (P₉₉). The formula to calculate the pth percentile is: P = (p/100) * (n
+ 1).

It indicates the percentage of values that are equal to or below a given value. For
example, the 75th percentile represents the value below which 75% of the data falls.

Here's an example to illustrate percentiles:

Consider a dataset of exam scores for a class of 30 students:

82, 76, 88, 92, 79, 85, 91, 78, 84, 86,
90, 95, 80, 83, 89, 87, 94, 81, 77, 93,
75, 96, 98, 73, 82, 88, 87, 90, 82, 84

To find the 75th percentile (P75) for this dataset, we follow these steps:

Step 1: Sort the data in ascending order:


73, 75, 76, 77, 78, 79, 80, 81, 82, 82,
82, 83, 84, 84, 85, 86, 87, 87, 88, 88,
89, 90, 90, 91, 92, 93, 94, 95, 96, 98

Step 2: Calculate the index corresponding to the 75th percentile:


Index = (75/100) * (n + 1) = (75/100) * (30 + 1) = 23.25
Since the index is not a whole number, we need to interpolate to find the corresponding
value.

Step 3: Interpolation:
To interpolate, we consider the values at the 23rd and 24th positions:
Value at the 23rd position = 90
Value at the 24th position = 91

Interpolation formula:
P75 = Value at the 23rd position + (Index - 23) * (Value at the 24th position - Value at
the 23rd position)

P75 = 90 + (23.25 - 23) * (91 - 90)


P75 = 90 + 0.25 * 1
P75 = 90.25

The 75th percentile (P75) of the dataset is 90.25. This means that 75% of the exam
scores are equal to or below 90.25.

Percentiles are useful for understanding the distribution of data and identifying the
relative position of a particular value within a dataset. They provide valuable insights
into how an individual value compares to the rest of the data.

- Quartiles divide the data into quarters. The first quartile (Q₁) is the 25th percentile,
the second quartile (Q₂) is the median (50th percentile), and the third quartile (Q₃) is
the 75th percentile.

Quartiles are a set of three values that divide a dataset into four equal parts, each
containing 25% of the data. They are a type of percentile and provide insights into the
distribution of data and the spread of values.

To calculate quartiles, the dataset is first arranged in ascending order. The three
quartiles are denoted as Q₁, Q₂, and Q₃, representing the first quartile, second quartile
(also known as the median), and third quartile, respectively.

Here's an example to illustrate quartiles:

Consider the following dataset representing the scores of 12 students in a math exam:

78, 82, 65, 92, 88, 72, 90, 81, 75, 86, 94, 80
Step 1: Arrange the data in ascending order:

65, 72, 75, 78, 80, 81, 82, 86, 88, 90, 92, 94

Step 2: Calculate the positions of the quartiles:

- Q₁: The first quartile (Q₁) corresponds to the 25th percentile. It splits the lowest 25%
of the data from the rest. To find its position, multiply 25% by the total number of
observations (12) and add 1: Q₁ = (25/100) * (12 + 1) = 3.25. Since it falls between the
3rd and 4th values, we can take the average of these two values as the first quartile: Q₁
= (75 + 78) / 2 = 76.5.

- Q₂: The second quartile (Q₂) is the median and represents the 50th percentile. It
divides the dataset into two equal halves. In this case, the median is the average of the
two middle values since we have an even number of observations: Q₂ = (80 + 81) / 2 =
80.5.

- Q₃: The third quartile (Q₃) corresponds to the 75th percentile. It splits the highest 25%
of the data from the rest. To find its position, multiply 75% by the total number of
observations (12) and add 1: Q₃ = (75/100) * (12 + 1) = 9.75. Since it falls between the
9th and 10th values, we can take the average of these two values as the third quartile:
Q₃ = (88 + 90) / 2 = 89.

In summary, for the given dataset, the quartiles are Q₁ = 76.5, Q₂ = 80.5, and Q₃ = 89.
These values provide insights into the distribution of the scores, indicating how the data
is spread out across the lower, middle, and upper ranges.

7. Correlation and Covariance:

- Correlation measures the strength and direction of the linear relationship between
two variables. It is calculated using the covariance divided by the product of the
standard deviations of the two variables.

Correlation is a statistical measure that quantifies the relationship between two


variables. It helps to determine how changes in one variable are associated with
changes in another variable. Correlation is often expressed as a correlation coefficient,
which ranges from -1 to +1.
A positive correlation indicates that as one variable increases, the other variable tends
to increase as well. For example, there might be a positive correlation between the
number of hours studied and exam scores. As the number of hours studied increases,
the exam scores tend to increase as well.

A negative correlation indicates that as one variable increases, the other variable tends
to decrease. For instance, there might be a negative correlation between the amount of
rainfall and the number of hours spent outdoors. As the amount of rainfall increases, the
number of hours spent outdoors decreases.

A correlation coefficient of 0 suggests no linear relationship between the variables. In


other words, the variables are not associated with each other in a predictable manner.

It's important to note that correlation does not imply causation. Just because two
variables are correlated does not mean that one variable is causing the change in the
other. Correlation simply measures the strength and direction of the relationship
between variables.

Correlation can be calculated using various methods, but one commonly used measure
is Pearson's correlation coefficient (r). The formula for calculating Pearson's correlation
coefficient is:

r = (Σ((xᵢ - ȳ) * (yᵢ - ȳ))) / (√(Σ(xᵢ - ȳ)²) * √(Σ(yᵢ - ȳ)²))

where xᵢ and yᵢ are the individual values of the two variables, ȳ is the mean of the y
variable, and Σ represents the sum of the values.

For example, let's consider a dataset that measures the number of hours studied (x)
and the corresponding exam scores (y) for a group of students:

Hours studied (x): 4, 6, 3, 7, 5


Exam scores (y): 70, 85, 60, 90, 75

Using the formula, we can calculate the correlation coefficient:

First, calculate the means:


ȳ = (70 + 85 + 60 + 90 + 75) / 5 = 76
x̄ = (4 + 6 + 3 + 7 + 5) / 5 = 5

Then, calculate the sums:


Σ((xᵢ - ȳ) * (yᵢ - ȳ)) = ((4 - 5) * (70 - 76)) + ((6 - 5) * (85 - 76)) + ((3 - 5) * (60 - 76)) + ((7 -
5) * (90 - 76)) + ((5 - 5) * (75 - 76)) = -18

√(Σ(xᵢ - ȳ)²) = √(((4 - 5)²) + ((6 - 5)²) + ((3 - 5)²) + ((7 - 5)²) + ((5 - 5)²)) = √(10)

√(Σ(yᵢ - ȳ)²) = √(((70 - 76)²) + ((85 - 76)²) + ((60 - 76)²) + ((90 - 76)²) + ((75 - 76)²)) =
√(490)

Finally, calculate the correlation coefficient:


r = (-18) / (√(10) * √(490)) ≈ -0.61

The correlation coefficient (r) in this example is approximately -0.61, indicating a


moderate negative correlation between the number of hours studied and exam scores.

- Covariance measures the direction and magnitude of the relationship between two
variables. It is calculated as the average of the product of the deviations from the means
of the two variables.

Covariance is a statistical measure that quantifies the relationship between two


variables. It indicates how changes in one variable are associated with changes in
another variable. Specifically, covariance measures the direction and magnitude of the
linear relationship between two variables.

The formula for calculating covariance between two variables X and Y, based on a
dataset of n observations, is as follows:

Cov(X, Y) = Σ[(Xᵢ - μₓ)(Yᵢ - μᵧ)] / n

Where:
- Xᵢ and Yᵢ represent the individual values of X and Y, respectively.
- μₓ and μᵧ are the means of X and Y, respectively.
- Σ denotes the sum of the terms over all n observations.

The resulting covariance value can be positive, negative, or zero, indicating different
types of relationships between the variables:

- Positive Covariance: A positive covariance indicates a direct or positive relationship


between the variables. This means that as one variable increases, the other tends to
increase as well. For example, consider a dataset that examines the relationship
between hours studied (X) and exam scores (Y) for a group of students. If students who
study more hours tend to achieve higher scores, the covariance between X and Y would
be positive.

- Negative Covariance: A negative covariance indicates an inverse or negative


relationship between the variables. This means that as one variable increases, the other
tends to decrease. For instance, let's consider a dataset that examines the relationship
between temperature (X) and ice cream sales (Y) over a period of time. As the
temperature rises, ice cream sales tend to decrease. In this case, the covariance
between X and Y would be negative.

- Zero Covariance: A covariance of zero indicates no linear relationship between the


variables. This means that changes in one variable are not associated with changes in
the other. However, it does not necessarily imply that there is no relationship between
the variables, as they could have a non-linear or non-monotonic relationship. For
example, if we analyze the relationship between a person's age (X) and their shoe size
(Y), the covariance would likely be close to zero since age and shoe size are not linearly
related.

It's important to note that covariance is affected by the scale of measurement of the
variables. Consequently, interpreting the magnitude of covariance can be challenging.
To address this, the concept of correlation is often used, which standardizes the
covariance to provide a value between -1 and 1, indicating the strength and direction of
the linear relationship more explicitly.

Descriptive statistics provide a variety of measures and techniques to summarize and


analyze data, allowing for a comprehensive understanding of its characteristics and
distribution.
Chapter 3 : Random Variable

A random variable is a variable in statistics and probability theory that takes on different values
based on the outcome of a random event. It represents a numerical quantity whose value is
uncertain and depends on the outcome of a random experiment.

In simple terms, a random variable is a way to assign a number to each possible outcome of a
random event or experiment. It provides a mathematical representation of the uncertainty
involved in the event.
Let's consider an example to understand the concept of a random variable.

Suppose we are interested in studying the number of children in a randomly selected family. We
can define a random variable, let's say X, to represent the number of children in a family. The
possible values of X can be 0, 1, 2, 3, and so on.

Now, let's assume we have a dataset of 100 families, and we record the number of children in
each family. The observed values might be as follows:

Family 1: X = 2
Family 2: X = 3
Family 3: X = 1
Family 4: X = 0
Family 5: X = 4
...

In this example, X is a discrete random variable since it can only take on specific whole number
values. Each value of X represents the outcome of a random event, which is the selection of a
family and counting the number of children they have.

With this random variable, we can perform various statistical analyses. For instance, we can
calculate the probability of a family having a certain number of children. We can also calculate
the mean (expected value) and variance of the number of children to understand the average
and spread of the data.

Random variables can be classified into two main types:


1) discrete random variables
2) continuous random variables

1. Discrete random variables: These variables can only take on a countable number of
distinct values. Examples include the number of heads obtained in multiple coin flips, the
number of cars passing through a toll booth in an hour, or the outcome of rolling a fair six-sided
die.

Let's consider an example of a discrete random variable to illustrate its concept.


Suppose we are interested in studying the outcomes of rolling a fair six-sided die. We can
define a random variable, let's call it Y, to represent the outcome of each roll. The possible
values of Y are the numbers 1, 2, 3, 4, 5, and 6.

Let's say we roll the die 100 times and record the outcomes. The observed values might be as
follows:

Roll 1: Y = 3
Roll 2: Y = 6
Roll 3: Y = 2
Roll 4: Y = 4
Roll 5: Y = 1
...

In this example, Y is a discrete random variable since it can only take on specific values that
correspond to the numbers on the die. Each value of Y represents the outcome of a random
event, which is the roll of the die.

Now, let's see how we can analyze this random variable:

1. Probability distribution: We can calculate the probability of each possible outcome. Since the
die is fair, each outcome has an equal chance of occurring. So, the probability of Y being any
specific value (1, 2, 3, 4, 5, or 6) is 1/6.

2. Expected value (mean): We can calculate the expected value of Y, which represents the
average value we would expect to obtain if we repeated the experiment many times. For a fair
die, the expected value is (1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5.

3. Variance: We can calculate the variance of Y, which measures the spread or


variability of the random variable. For the fair die, the variance is [(1-3.5)^2 + (2-
3.5)^2 + (3-3.5)^2 + (4-3.5)^2 + (5-3.5)^2 + (6-3.5)^2]/6 ≈ 2.92.

2. Continuous random variables: These variables can take on any value within a certain
range or interval. They are often associated with measurements and are characterized by a
range of possible values. Examples include the height of a person, the time it takes for a
computer program to run, or the temperature at a specific location.

Let's consider a detailed example of a continuous random variable with a proper


explanation.

Example: Heights of Adults

Suppose we are interested in studying the heights of adults in a population. We can define a
continuous random variable, let's call it H, to represent the height of an adult. The possible
values of H can range from the minimum height to the maximum height observed in the
population.

In this example, H is a continuous random variable since height can take on any value within a
certain range or interval. It is not limited to specific discrete values.

Let's assume we have collected data on the heights of a sample of adults. Here's an example of
some observed values:

Person 1: H = 168 cm
Person 2: H = 175 cm
Person 3: H = 160 cm
Person 4: H = 182 cm
Person 5: H = 170 cm
...

In this example, each value of H represents the outcome of a random event, which is the
measurement of an adult's height. The random variable H can take on an infinite number of
possible values within a certain range.

Now, let's explore how we can analyze this continuous random variable:

1. Probability density function (PDF): Instead of calculating probabilities for specific values, we
use the probability density function to describe the likelihood of the random variable taking on
different values. The PDF provides a probability distribution over the range of possible heights.

2. Cumulative distribution function (CDF): The cumulative distribution function gives the
probability that the random variable is less than or equal to a certain value. It describes the
probability distribution in terms of cumulative probabilities.

3. Expected value (mean): We can calculate the expected value of H, which represents the
average height in the population. The expected value provides insights into the typical height of
adults in the population.

4. Variance and standard deviation: We can calculate the variance and standard deviation of H,
which measure the spread or variability in the heights. They indicate how much the actual
heights tend to deviate from the expected value.

By analyzing the probability distribution, expected value, variance, and other statistical
measures associated with the continuous random variable H, we can understand the distribution
of heights in the population, make comparisons between different groups, and perform various
statistical analyses.
The concept of random variables is fundamental in probability theory and statistics, as they
provide a mathematical framework for analyzing and modeling uncertain events and their
outcomes. They allow us to calculate probabilities, expected values, variances, and other
statistical measures, which are essential for making predictions and drawing conclusions based
on observed data.

Chapter 4: Discrete Distributions

A discrete distribution is a probability distribution that arises from a discrete random variable,
which can only take on specific values with gaps or intervals between them. In a discrete
distribution, the probability assigned to each possible value of the random variable is non-
negative and sums up to 1.

Let's dive into the example of coin flips and the associated discrete distribution in
more detail, including all calculations.

Example: Coin Flips


Suppose we have a fair coin and we are interested in studying the number of heads obtained in
three flips. We define a discrete random variable, X, to represent the number of heads. The
possible values of X are 0, 1, 2, and 3.

To understand the distribution associated with this random variable, we need to calculate the
probabilities for each possible outcome.

1. Probability of 0 heads (X = 0):


In three flips, the probability of getting tails in each flip is 0.5. Since each flip is independent, the
probability of getting no heads in three flips is:
P(X = 0) = (0.5) * (0.5) * (0.5) = 0.125

2. Probability of 1 head (X = 1):


There are three ways to obtain exactly one head in three flips: HTT, THT, and TTH. Each
combination has a probability of (0.5) * (0.5) * (0.5) = 0.125. Since there are three combinations,
the probability of getting exactly one head is:
P(X = 1) = 3 * 0.125 = 0.375

3. Probability of 2 heads (X = 2):


Similarly, there are three ways to obtain exactly two heads in three flips: HHT, HTH, and THH.
Each combination has a probability of (0.5) * (0.5) * (0.5) = 0.125. Therefore, the probability of
getting exactly two heads is:
P(X = 2) = 3 * 0.125 = 0.375

4. Probability of 3 heads (X = 3):


In three flips, there is only one way to obtain three heads: HHH. The probability of getting three
heads is:
P(X = 3) = (0.5) * (0.5) * (0.5) = 0.125

The probabilities assigned to each outcome add up to 1, as required for a probability


distribution.

Now, let's summarize the probability distribution of X:

X |0 |1 |2 |3
----------------------------------
P(X) | 0.125 | 0.375 | 0.375 | 0.125

This distribution follows the binomial distribution, as each flip is an independent Bernoulli trial.
The distribution allows us to determine the likelihood of obtaining a specific number of heads in
three coin flips.

We can also calculate additional statistical measures associated with this discrete distribution.
For example:
- Expected value (mean):
E(X) = (0 * 0.125) + (1 * 0.375) + (2 * 0.375) + (3 * 0.125) = 1.5

- Variance:
Var(X) = [(0 - 1.5)^2 * 0.125] + [(1 - 1.5)^2 * 0.375] + [(2 - 1.5)^2 * 0.375] + [(3 - 1.5)^2 * 0.125] =
0.75

By analyzing this discrete distribution, we can make probabilistic predictions, calculate expected
values, variances, and other statistical measures, which are essential for analyzing and drawing
conclusions based on observed data related to the random variable X.

Discrete distributions refer to probability distributions that arise from discrete random variables.
A discrete random variable takes on a countable number of distinct values. In other words, it
can only assume specific values with gaps or intervals between them.

Discrete distributions are defined by the probabilities assigned to each possible value of the
random variable. The probabilities are non-negative and sum up to 1, reflecting the likelihood of
each outcome occurring.

Common examples of discrete distributions include

1. Bernoulli distribution: This distribution represents a single trial with two possible outcomes,
often denoted as 0 and 1. It is characterized by a parameter p, which is the probability of the
event being a success (1) and the complementary probability (1-p) of it being a failure (0).

Let's consider an example of the Bernoulli distribution with an explanation.

Example: Flipping a Biased Coin

Suppose we have a biased coin that has a probability of 0.3 of landing on heads and a
probability of 0.7 of landing on tails. We are interested in studying the outcome of a single flip of
this coin.
In this example, we can define a Bernoulli random variable, let's call it X, to represent the
outcome of the coin flip. We can assign the value 1 to represent heads and the value 0 to
represent tails.

Now, let's explore the properties and characteristics of the Bernoulli distribution in this context:

1. Probability of success (p):


The Bernoulli distribution is characterized by a parameter p, which represents the probability of
success. In our example, success refers to the coin landing on heads. So, p = 0.3.

2. Probability of failure (q):


The probability of failure, denoted as q, is simply the complement of p. In our example, q = 1 - p
= 0.7.

3. Probability mass function (PMF):


The probability mass function of the Bernoulli distribution gives the probability of each possible
outcome. For the biased coin flip, the PMF of X can be expressed as:

P(X = 1) = p = 0.3 (probability of heads)


P(X = 0) = q = 0.7 (probability of tails)

In this case, the Bernoulli distribution assigns a probability of 0.3 to the outcome heads (X = 1)
and a probability of 0.7 to the outcome tails (X = 0). The probabilities sum up to 1, as required
for a probability distribution.

4. Expected value (mean):


The expected value of a Bernoulli random variable can be calculated as the product of the
probability of success (p) and the value assigned to success (1), plus the product of the
probability of failure (q) and the value assigned to failure (0):

E(X) = p * 1 + q * 0 = p

In our example, the expected value of X is E(X) = 0.3.

5. Variance:
The variance of a Bernoulli random variable can be calculated as the product of the probability
of success (p), the probability of failure (q), and the difference between the value assigned to
success (1) and the expected value squared (p - E(X))^2:

Var(X) = p * q * (1 - p)^2

In our example, the variance of X is Var(X) = 0.3 * 0.7 * (1 - 0.3)^2.


The Bernoulli distribution is often used to model binary outcomes or events with only two
possible outcomes. It is a fundamental distribution in probability theory and serves as the
building block for other distributions, such as the binomial distribution.

2. Binomial distribution: This distribution describes the number of successes in a fixed


number of independent Bernoulli trials. It is characterized by two parameters: n, the number of
trials, and p, the probability of success in each trial.

Let's explore an example of the binomial distribution with a proper explanation.

Example: Multiple Flips of a Fair Coin

Suppose we are interested in studying the number of heads obtained in five flips of a fair coin.
We can model this situation using the binomial distribution.

In this example, we define a binomial random variable, let's call it X, to represent the number of
heads. The possible values of X range from 0 to 5, as we are flipping the coin five times.

Now, let's explain the properties and characteristics of the binomial distribution in this context:

1. Number of trials (n):


The binomial distribution is characterized by the number of trials, denoted as n. In our example,
n = 5 since we are flipping the coin five times.

2. Probability of success (p):


The probability of success, denoted as p, represents the probability of obtaining a head in a
single coin flip. Since we have a fair coin, p = 0.5.

3. Probability of failure (q):


The probability of failure, denoted as q, is simply the complement of p. In our case, q = 1 - p = 1
- 0.5 = 0.5.

4. Probability mass function (PMF):


The probability mass function of the binomial distribution gives the probability of each possible
outcome. For our example, the PMF of X can be expressed as:

P(X = k) = C(n, k) * p^k * q^(n-k)

where k is the number of heads (ranging from 0 to 5), C(n, k) represents the number of ways to
choose k heads from n flips (given by the binomial coefficient), p is the probability of success
(0.5 in our case), and q is the probability of failure (0.5 in our case).
For instance, let's calculate the probability of getting exactly 3 heads:

P(X = 3) = C(5, 3) * (0.5)^3 * (0.5)^(5-3)

The binomial coefficient C(5, 3) can be calculated as:

C(5, 3) = 5! / (3! * (5-3)!) = 10

Substituting these values into the equation, we can calculate P(X = 3).

5. Expected value (mean):


The expected value of a binomial random variable can be calculated as the product of the
number of trials (n) and the probability of success (p):

E(X) = n * p

In our example, the expected value of X is E(X) = 5 * 0.5 = 2.5.

6. Variance:
The variance of a binomial random variable can be calculated as the product of the number of
trials (n), the probability of success (p), and the probability of failure (q):

Var(X) = n * p * q

In our example, the variance of X is Var(X) = 5 * 0.5 * 0.5 = 1.25.

The binomial distribution is commonly used to model the number of successes in a fixed
number of independent Bernoulli trials. It allows us to understand the likelihood of obtaining
different numbers of successes.

3. Poisson distribution: This distribution models the number of events that occur in a fixed
interval of time or space. It is often used to represent rare events where the average number of
occurrences is known. The Poisson distribution is characterized by a single parameter, λ
(lambda), which represents the average rate of occurrence.

Let's explore an example of the Poisson distribution with a detailed explanation.

Example: Number of Customer Arrivals

Suppose we are running a small café, and on average, we receive 10 customer arrivals per
hour during the lunchtime rush. We are interested in studying the number of customer arrivals in
a given hour using the Poisson distribution.
In this example, we define a Poisson random variable, let's call it X, to represent the number of
customer arrivals in a given hour.

Now, let's explain the properties and characteristics of the Poisson distribution in this context:

1. Average rate (λ):


The Poisson distribution is characterized by an average rate parameter, denoted as λ (lambda).
In our example, λ = 10, as we receive an average of 10 customer arrivals per hour during the
lunchtime rush.

2. Probability mass function (PMF):


The probability mass function of the Poisson distribution gives the probability of each possible
outcome. For the number of customer arrivals, the PMF of X can be expressed as:

P(X = k) = (e^(-λ) * λ^k) / k!

where k is the number of customer arrivals (ranging from 0 to infinity), e is the mathematical
constant approximately equal to 2.71828, λ is the average rate (10 in our case), and k! denotes
the factorial of k.

For instance, let's calculate the probability of having exactly 8 customer arrivals in one hour:

P(X = 8) = (e^(-10) * 10^8) / 8!

By substituting the values into the equation, we can calculate P(X = 8).

3. Expected value (mean) and Variance:


The expected value (mean) and variance of a Poisson random variable are both equal to the
average rate (λ).

E(X) = Var(X) = λ

In our example, the expected value and variance of X are both 10.

The Poisson distribution is commonly used to model the number of events occurring in a fixed
interval of time or space when events happen independently and at a constant average rate.

4. Geometric distribution: This distribution models the number of trials required to achieve the
first success in a sequence of independent Bernoulli trials. It is characterized by a parameter p,
the probability of success in each trial.
Let's explore an example of the geometric distribution with a detailed
explanation.

Example: Tossing a Coin until Getting Heads

Suppose we are interested in studying the number of coin tosses needed until we obtain the first
heads. We can model this situation using the geometric distribution.

In this example, we define a geometric random variable, let's call it X, to represent the number
of tosses needed until we get heads. The possible values of X range from 1 to infinity.

Now, let's explain the properties and characteristics of the geometric distribution in this context:

1. Probability of success (p):


The geometric distribution is characterized by the probability of success, denoted as p. In our
example, p represents the probability of getting heads in a single coin toss. Since we have a fair
coin, p = 0.5.

2. Probability of failure (q):


The probability of failure, denoted as q, is simply the complement of p. In our case, q = 1 - p =
0.5.

3. Probability mass function (PMF):


The probability mass function of the geometric distribution gives the probability of each possible
outcome. For the number of tosses needed until getting the first heads, the PMF of X can be
expressed as:

P(X = k) = (1 - p)^(k-1) * p

where k is the number of tosses needed (ranging from 1 to infinity), p is the probability of
success (0.5 in our case), and (1 - p)^(k-1) represents the probability of having k-1 consecutive
tails followed by heads on the k-th toss.

For instance, let's calculate the probability of needing exactly 3 tosses until getting the first
heads:

P(X = 3) = (1 - 0.5)^(3-1) * 0.5

By substituting the values into the equation, we can calculate P(X = 3).

4. Expected value (mean) and Variance:


The expected value (mean) of a geometric random variable can be calculated as the inverse of
the probability of success:
E(X) = 1 / p

In our example, the expected value of X is E(X) = 1 / 0.5 = 2.

The variance of a geometric random variable can be calculated as:

Var(X) = q / p^2

In our example, the variance of X is Var(X) = 0.5 / (0.5)^2 = 2.

The geometric distribution is commonly used to model the number of trials needed until the first
success occurs in a sequence of independent Bernoulli trials.

5. Hypergeometric distribution: This distribution describes the probability of drawing a certain


number of successes from a finite population without replacement. It is characterized by
parameters N, the population size, K, the number of successes in the population, and n, the
number of draws without replacement.

Let's explore an example of the hypergeometric distribution with a detailed


explanation.

Example: Drawing Cards from a Deck

Suppose we have a standard deck of 52 playing cards, consisting of four suits (hearts,
diamonds, clubs, spades) with 13 cards each. We are interested in studying the probability of
drawing a certain number of hearts when drawing a specific number of cards without
replacement.

In this example, we define a hypergeometric random variable, let's call it X, to represent the
number of hearts in the drawn cards. The possible values of X range from 0 to the minimum of
the number of hearts in the deck and the number of cards drawn.

Now, let's explain the properties and characteristics of the hypergeometric distribution in this
context:

1. Population size (N):


The hypergeometric distribution is characterized by the population size, denoted as N, which
represents the total number of items in the population. In our example, N = 52, as we have 52
cards in the deck.

2. Number of successes in the population (K):


The number of successes in the population, denoted as K, represents the number of items in
the population that satisfy a certain criterion. In our example, K = 13, as there are 13 hearts in
the deck.

3. Sample size (n):


The sample size, denoted as n, represents the number of items drawn from the population
without replacement. In our example, n can vary depending on how many cards we choose to
draw.

4. Number of successes in the sample (k):


The number of successes in the sample, denoted as k, represents the number of items in the
sample that satisfy the criterion. In our example, k can range from 0 to the minimum of the
number of hearts in the deck and the number of cards drawn.

5. Probability mass function (PMF):


The probability mass function of the hypergeometric distribution gives the probability of each
possible outcome. For the number of hearts drawn, the PMF of X can be expressed as:

P(X = k) = (C(K, k) * C(N-K, n-k)) / C(N, n)

where C(a, b) represents the binomial coefficient, which is the number of ways to choose b
items from a items, and C(N, n) represents the number of ways to choose n items from N items.

For instance, let's calculate the probability of drawing exactly 2 hearts when drawing 5 cards:

P(X = 2) = (C(13, 2) * C(52-13, 5-2)) / C(52, 5)

By substituting the values into the equation, we can calculate P(X = 2).

The hypergeometric distribution models situations where the sample is drawn without
replacement and the probability of success changes as items are selected.

Chapter 5: Continuous Distribution

A continuous distribution is a probability distribution that describes the likelihood of


obtaining a continuous random variable, which can take on any value within a given
interval. Unlike discrete distributions, which deal with discrete or countable outcomes,
continuous distributions deal with measurements that can take on any value within a
range.
In continuous distributions, probabilities are defined over intervals rather than specific
values. The probability density function (PDF) represents the probability of the random
variable falling within a particular range of values.

There are several different continuous distributions that are commonly used in
statistics and probability theory

1. Normal Distribution (Gaussian Distribution):


The normal distribution is characterized by its bell-shaped curve. It is symmetric around
the mean and is defined by two parameters: the mean (μ) and the standard deviation
(σ). Many real-world measurements, such as heights, weights, and IQ scores, tend to
follow a normal distribution.

Let's dive into the explanation of the normal distribution with an example.

The shape of the normal distribution is symmetric, with the mean as its center point. The
standard deviation determines the spread or dispersion of the distribution. The
probability density function (PDF) of the normal distribution is given by the formula:

f(x) = (1 / (σ * √(2π))) * exp(-(x - μ)^2 / (2σ^2))

where f(x) represents the probability density at a given value x, μ is the mean, σ is the
standard deviation, π is a mathematical constant approximately equal to 3.14159, and
exp() denotes the exponential function.

Example: Heights of Adults

Suppose we collect data on the heights of a large number of adults in a population. We


find that the distribution of heights closely follows a normal distribution.

In this example, let's assume the mean height (μ) of the adult population is 170
centimeters and the standard deviation (σ) is 5 centimeters.

With these parameters, we can use the normal distribution to make various probabilistic
statements about the heights of individuals:

1. Probability of a Range of Heights:


We can calculate the probability of finding an individual with a height within a specific
range. For instance, we can determine the probability of randomly selecting an adult
whose height is between 165 and 175 centimeters. By calculating the area under the
normal curve between these two values, we can obtain the probability.

2. Percentiles and Z-scores:


We can calculate percentiles to determine the height below which a certain percentage
of individuals fall. For example, we can find the height that corresponds to the 90th
percentile, which represents the height below which 90% of the adult population falls.

Additionally, we can calculate z-scores to determine how far a given height is from the
mean in terms of standard deviations. A positive z-score indicates a height above the
mean, while a negative z-score indicates a height below the mean.

3. Outliers and Unusual Heights:


With the normal distribution, we can identify outliers or heights that are considered
unusual or extreme. Generally, heights that fall several standard deviations away from
the mean (e.g., more than 2 or 3 standard deviations) are considered unusual or
outliers.

The normal distribution is widely used in various fields, including social sciences, natural
sciences, economics, and engineering, as many natural phenomena tend to follow a
roughly normal distribution. By understanding the characteristics of the normal
distribution and using its properties, we can analyze data, make probabilistic
predictions, and draw conclusions about the likelihood of certain events or observations
occurring.

2. Uniform Distribution:
The uniform distribution is characterized by a constant probability density over a
specified interval. It is often represented as a rectangle with equal heights across the
interval. The uniform distribution is used when all outcomes in a range are equally likely,
such as rolling a fair die.
Let's explore an example of the uniform distribution with a detailed explanation.

Example: Randomly Selecting a Number

Suppose you have a bag containing 100 numbered balls, ranging from 1 to 100. You
want to randomly select a ball from the bag, and you believe that each ball is equally
likely to be chosen.
In this example, the uniform distribution can be used to model the probability distribution
of selecting a number from the bag. Each number has an equal probability of being
chosen since all the balls are assumed to be identical and equally likely.

Here's an explanation of the uniform distribution in this context:

1. Interval:
In this case, the interval represents the range of numbers in the bag, which is from 1 to
100. The interval is inclusive, meaning it includes both the lower and upper bounds.

2. Probability density function (PDF):


The PDF of the uniform distribution is constant over the interval and is given by the
formula:

f(x) = 1 / (b - a)

where f(x) represents the probability density at a given value x, and a and b represent
the lower and upper bounds of the interval, respectively.

In our example, the lower bound (a) is 1 and the upper bound (b) is 100. Therefore, the
PDF of the uniform distribution for this example is:

f(x) = 1 / (100 - 1) = 1/99

This means that each ball in the bag has a probability of 1/99 of being selected.

Using the uniform distribution, we can answer various questions and make probabilistic
statements about selecting a number from the bag:

1. Probability of Selecting a Specific Number:


The probability of randomly selecting a specific number, such as 42, is 1/99 since there
is only one ball with that specific number.
2. Probability of Selecting a Range of Numbers:
The probability of selecting a number within a specific range can be calculated by
summing the probabilities of each individual number within the range. For example, the
probability of selecting a number between 30 and 50 is (50 - 30 + 1) * (1/99) = 21/99.

3. Expected Value:
The expected value (mean) of a uniform distribution is calculated as the average of the
lower and upper bounds. In this case, the expected value of randomly selecting a
number from the bag is (1 + 100) / 2 = 50.5.
The uniform distribution is commonly used when all outcomes within an interval are
equally likely. It provides a simple and intuitive way to model situations where each
outcome has the same probability of occurring.

Here are others distributions that we are not going to cover in details

3. Exponential Distribution:
The exponential distribution is often used to model the time between events in a
Poisson process, where events occur at a constant rate independently of time. It is
characterized by a decreasing exponential shape and is typically used to model the
duration until an event happens.

4. Beta Distribution:
The beta distribution is a versatile continuous distribution defined on the interval [0, 1]. It
is commonly used to model random variables that represent proportions or probabilities.
The shape of the distribution is controlled by two shape parameters, often denoted as α
and β.

5. Gamma Distribution:
The gamma distribution is a flexible continuous distribution that is often used to model
positive, skewed data. It is commonly used to model waiting times, lifetimes, and other
positive continuous variables. The gamma distribution has two shape parameters: α and
β.

6. Weibull Distribution:
The Weibull distribution is commonly used to model failure times or lifetimes of systems.
It is characterized by its shape parameter (k) and scale parameter (λ). The Weibull
distribution can take on various shapes, including exponential (when k = 1), decreasing
failure rate, and increasing failure rate.

Chapter 6: Sampling and Sampling Distribution

A sampling technique refers to the method or approach used to select a subset of


individuals or items from a larger population for the purpose of conducting a study or
gathering data. Sampling is a crucial aspect of research as it allows researchers to draw
inferences about the population based on the characteristics observed in the sample.

There are several sampling techniques commonly used in research:


1. Simple Random Sampling: Every individual or item in the population has an equal
chance of being selected for the sample. This can be done using random number
generators .

2. Stratified Sampling: The population is divided into homogeneous subgroups called


strata, and then a random sample is selected from each stratum. This technique
ensures representation from each subgroup in the sample.

3. Cluster Sampling: The population is divided into clusters or groups, and then a
subset of clusters is randomly selected. All individuals within the selected clusters are
included in the sample.

4. Systematic Sampling: Individuals or items are selected from the population at


regular intervals using a predetermined pattern. For example, selecting every 10th
person from a list.

Once a sample is obtained, researchers can analyze the collected data and draw
conclusions about the population. This is where the concept of a sampling distribution
comes into play.

A sampling distribution refers to the distribution of a statistic (such as the mean or


proportion) calculated from multiple samples of the same size, drawn from the same
population. The sampling distribution allows us to understand the variability and
properties of the statistic and make inferences about the population.

Key points about sampling distributions include:

1. Central Limit Theorem: The sampling distribution of a sufficiently large sample size
(typically n > 30) tends to follow an approximately normal distribution, regardless of the
shape of the population distribution. This is known as the central limit theorem.

2. Standard Error: The standard deviation of the sampling distribution is called the
standard error. It measures the variability of the statistic across different samples. The
standard error decreases as the sample size increases.
3. Sampling Distribution of the Mean: For the sampling distribution of the sample mean,
the mean of the sampling distribution is equal to the population mean, and the standard
deviation (standard error) is equal to the population standard deviation divided by the
square root of the sample size.

Sampling techniques and the resulting sampling distributions play a crucial role in
inferential statistics, hypothesis testing, and making generalizations about populations
based on sample data. Proper sampling techniques and understanding the
characteristics of sampling distributions help ensure accurate and reliable research
findings.

Chapter 7: Understanding Confidence Interval

In statistics, a confidence interval is a range of values that provides an estimated range


for a population parameter, such as a mean or proportion. It is constructed based on
sample data and is used to quantify the uncertainty or variability in estimating the true
population parameter.

A confidence interval is associated with a level of confidence, typically expressed as a


percentage. The level of confidence represents the probability that the interval contains
the true population parameter. For example, a 95% confidence interval means that if we
were to repeat the sampling process and construct confidence intervals many times,
approximately 95% of those intervals would contain the true population parameter.

Example:
Let's consider an example to understand how confidence intervals work:

Suppose you want to estimate the average height of students in a university. You collect
a random sample of 100 students and measure their heights. The sample mean height
is found to be 165 cm, and the standard deviation of the sample is 5 cm.

To construct a 95% confidence interval for the population mean height, you would follow
these steps:

1. Determine the Level of Confidence:


Choose the desired level of confidence. In this case, it is 95%.

2. Calculate the Standard Error:


Compute the standard error, which measures the variability in the estimate. The
standard error of the mean is calculated as the sample standard deviation divided by
the square root of the sample size:

Standard Error = 5 / √100 = 5 / 10 = 0.5 cm

3. Determine the Critical Value:


Based on the chosen level of confidence, find the corresponding critical value from the
appropriate statistical table or using statistical software. For a 95% confidence level, the
critical value for a two-tailed test is approximately 1.96.

4. Calculate the Confidence Interval:


Using the formula for constructing a confidence interval for the mean, we have:

Confidence Interval = Sample Mean ± (Critical Value × Standard Error)


Confidence Interval = 165 ± (1.96 × 0.5) = 165 ± 0.98

The resulting confidence interval is (164.02, 165.98).

Interpretation:
We can interpret the confidence interval as follows: "We are 95% confident that the true
average height of students in the university lies between 164.02 cm and 165.98 cm."
This means that if we were to repeat the sampling process many times and construct
confidence intervals, approximately 95% of those intervals would capture the true
population mean height.

The width of the confidence interval (1.96 cm in this case) represents the margin of
error or the level of uncertainty in the estimation. A wider interval indicates higher
uncertainty, while a narrower interval indicates greater precision in the estimate.

Confidence intervals provide a range of plausible values for the population parameter
and allow for making statistical inferences about the population based on the sample
data. They help to quantify the uncertainty associated with the estimation process and
provide a measure of the precision and reliability of the estimates.

Chapter 8: Understanding Hypothesis Testing

Hypothesis testing is a statistical method used to make inferences and draw


conclusions about a population based on sample data. It involves setting up two
competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha),
and examining the evidence from the sample to determine which hypothesis is
supported.

Here's an explanation of hypothesis testing:

1. Define the Null Hypothesis (H0) and Alternative Hypothesis (Ha):


The null hypothesis represents the default or initial assumption about the population
parameter being tested. It states that there is no significant difference or relationship
between variables or that the population parameter takes a specific value. The
alternative hypothesis, on the other hand, represents the claim or assertion we want to
test. It suggests that there is a significant difference or relationship between variables or
that the population parameter takes a different value.

2. Choose the Significance Level (α):


The significance level, denoted by α, is the probability of rejecting the null hypothesis
when it is actually true. It determines the threshold for considering evidence strong
enough to reject the null hypothesis. Commonly used significance levels are 0.05 (5%)
and 0.01 (1%).
3. Collect and Analyze Sample Data:
Collect a sample from the population of interest and analyze the data using appropriate
statistical techniques. The choice of analysis depends on the research question and
type of data, such as t-tests, chi-square tests, or regression analysis.

4. Calculate the Test Statistic:


Calculate a test statistic that measures the discrepancy between the observed sample
data and what would be expected under the null hypothesis. The test statistic varies
depending on the hypothesis being tested and the analysis method used.

5. Determine the Rejection Region:


Determine the rejection region, also known as the critical region, which defines the
values of the test statistic that would lead to rejecting the null hypothesis. The critical
region is determined based on the significance level and the distribution of the test
statistic under the null hypothesis.

6. Compare the Test Statistic with the Critical Value:


Compare the calculated test statistic with the critical value(s) associated with the
chosen significance level. If the test statistic falls within the rejection region, it provides
evidence to reject the null hypothesis in favor of the alternative hypothesis. If the test
statistic does not fall within the rejection region, there is insufficient evidence to reject
the null hypothesis.

7. Draw Conclusions:
Based on the comparison, make a conclusion about the null hypothesis. If the null
hypothesis is rejected, it suggests that there is sufficient evidence to support the
alternative hypothesis. If the null hypothesis is not rejected, it indicates that the
evidence is not strong enough to support the alternative hypothesis.

Example:

Let's consider an example to illustrate hypothesis testing:

Suppose a pharmaceutical company has developed a new drug and wants to determine
if it is effective in reducing cholesterol levels. The null hypothesis (H0) would state that
there is no significant difference in cholesterol levels between the drug-treated group
and the control group, while the alternative hypothesis (Ha) would state that there is a
significant difference.
The company conducts a randomized controlled trial with 100 participants, randomly
assigning 50 to receive the new drug and 50 to receive a placebo. After the treatment
period, cholesterol levels are measured for both groups.

Using a t-test as the analysis method, the company calculates the test statistic (e.g., t-
value) based on the sample data and determines the critical value(s) for the chosen
significance level (e.g., α = 0.05).

Suppose the calculated t-value is 2.36, and the critical value for a two-tailed t-test at α =
0.05 is 2.00.

Comparing the test statistic (2.36) with the critical value (2.00), we find that the test
statistic

falls within the rejection region. Therefore, we reject the null hypothesis and conclude
that there is sufficient evidence to support the alternative hypothesis. This suggests that
the new drug is effective in reducing cholesterol levels compared to the placebo.

Hypothesis testing allows researchers to make data-driven decisions, draw conclusions


about population parameters, and determine the significance of research findings. It
provides a framework for testing hypotheses and assessing the statistical evidence,
helping to guide decision-making in various fields of study.

Chapter 9: Inference with Two Populations

In statistics, inference with two populations refers to the process of drawing conclusions
or making comparisons about two separate populations based on sample data. This
involves conducting hypothesis tests or constructing confidence intervals to assess the
similarity or difference between population parameters.

Here's an explanation of inference with two populations with an example:


Example:
Suppose a researcher wants to compare the average salaries of employees in two
different companies, Company A and Company B. The goal is to determine if there is a
significant difference in the population means.

1. Define the Research Question:


The research question is whether the average salary in Company A is different from the
average salary in Company B.

2. Collect Sample Data:


Random samples are collected from both populations. Let's say we have a sample of
100 employees from Company A and a sample of 120 employees from Company B.
The salaries of these individuals are recorded.

3. Define the Hypotheses:


The null hypothesis (H0) states that there is no significant difference in the average
salaries between the two populations, while the alternative hypothesis (Ha) suggests
that there is a significant difference.

H0: μA = μB (The population mean salary of Company A is equal to the population


mean salary of Company B)
Ha: μA ≠ μB (The population mean salary of Company A is not equal to the
population mean salary of Company B)

4. Choose the Significance Level:


Select a significance level (α) to determine the threshold for considering evidence
strong enough to reject the null hypothesis. Commonly used significance levels are 0.05
(5%) and 0.01 (1%).

5. Conduct the Analysis:


Perform a statistical analysis to compare the means of the two samples. Depending on
the characteristics of the data and assumptions, various methods can be used, such as
the independent samples t-test or the Mann-Whitney U test.

6. Calculate the Test Statistic and P-value:


Compute the appropriate test statistic (e.g., t-value) and determine the corresponding p-
value. The test statistic measures the difference between the sample means, while the
p-value quantifies the evidence against the null hypothesis.
7. Interpret the Results:
If the p-value is less than the chosen significance level (α), we reject the null hypothesis
and conclude that there is a significant difference in the average salaries between the
two populations. If the p-value is greater than α, we fail to reject the null hypothesis and
conclude that there is insufficient evidence to suggest a significant difference.

For example, if the analysis yields a p-value of 0.02 at a significance level of 0.05, we
would reject the null hypothesis and conclude that there is a significant difference in the
average salaries between Company A and Company B.

Inference with two populations allows us to make statistical comparisons and draw
conclusions about differences or similarities between two groups of interest. It helps in
making informed decisions, providing evidence for policy changes, or identifying areas
of improvement based on comparisons between populations.

You might also like