UNIT II Notes
UNIT II Notes
A sample space is a group or a set of all the possible results of a random activity. The sample space is shown
using the symbol "S". The smaller groups of possible results from an activity are called events. A sample space
can have many results, and it depends on the activity. If it has a limited number of results, it is called discrete or
finite sample spaces.
The sample spaces for a random activity are written inside curly brackets "{}". There is a difference between the
sample space and the events. For example, when rolling a die, the sample space, S, is {1, 2, 3, 4, 5, 6}, while the
event can be {1, 3, 5}, which represents the set of odd numbers, and {2, 4, 6}, which represents the set of even
numbers. The results of an activity are random, and the sample space acts as the main set for certain activities.
Some examples include:
Tossing a Coin
When flipping a coin, there are two possible outcomes: head and tail.
When flipping two coins, the number of possible outcomes are four. Let, H 1 and T1 be the head and tail of the
first coin and H2 and T2 be the head and tail of the second coin respectively and the sample space can be written
as
Sample Space, S = { (H1, H2), (H1, T2), (T1, H2), (T1, T2) }
In general, if you have “n” coins, then the possible number of outcomes will be 2n.
Sample space S = { HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
A Die is Thrown
When a single die is thrown, it has 6 outcomes since it has 6 faces. Therefore, the sample is given as
S = { 1, 2, 3, 4, 5, 6}
When two dice are thrown together, we will get 36 pairs of possible outcomes. Each face of the first die can fall
with all the six faces of the second die. As there are 6 x 6 possible pairs, it becomes 36 outcomes. The 36
outcome pairs are written as:
{(1,1)(1,2)(1,3)(1,4)(1,5)(1,6)(2,1)(2,2)(2,3)(2,4)(2,5)(2,6)(3,1)(3,2)(3,3)(3,4)(3,5)(3,6)(4,1)(4,2)(4,3)(4,4)(4,5)
(4,6)(5,1)(5,2)(5,3)(5,4)(5,5)(5,6)(6,1)(6,2)(6,3)(6,4)(6,5)(6,6)}
If three dice are thrown, it should have the possible outcomes of 216 where n in the experiment is taken as 3, so
it becomes 63 = 216.
A probability event can be defined as a set of outcomes of an experiment. In other words, an event in probability
is the subset of the respective sample space. So, what is sample space?
The entire possible set of outcomes of a random experiment is the sample space or the individual space of that
experiment. The likelihood of occurrence of an event is known as probability. The probability of occurrence of
any event lies between 0 and 1.
The sample space for the tossing of three coins simultaneously is given by:
Suppose, if we want to find only the outcomes which have at least two heads; then the set of all such
possibilities can be given as:
There could be a lot of events associated with a given sample space. For any event to occur, the outcome of the
experiment must be an element of the set of event E.
The number of favourable outcomes to the total number of outcomes is defined as the probability of occurrence
of any event. So, the probability that an event will occur is given as:
Simple Events
Compound Events
Exhaustive Events
Complementary Events
If the probability of occurrence of an event is 0, such an event is called an impossible event and if the
probability of occurrence of an event is 1, it is called a sure event. In other words, the empty set ϕ is an
impossible event and the sample space S is a sure event.
Simple Events
Any event consisting of a single point of the sample space is known as a simple event in probability. For
example, if S = {56 , 78 , 96 , 54 , 89} and E = {78} then E is a simple event.
Compound Events
Contrary to the simple event, if any event consists of more than one single point of the sample space then such
an event is called a compound event. Considering the same example again, if S = {56 ,78 ,96 ,54 ,89}, E 1 = {56
,54 }, E2 = {78 ,56 ,89 } then, E1 and E2 represent two compound events.
If the occurrence of any event is completely unaffected by the occurrence of any other event, such events are
known as an independent event in probability and the events which are affected by other events are known
as dependent events.
If the occurrence of one event excludes the occurrence of another event, such events are mutually exclusive
events i.e. two events don’t have any common point. For example, if S = {1 , 2 , 3 , 4 , 5 , 6} and E 1, E2 are two
events such that E1 consists of numbers less than 3 and E2 consists of numbers greater than 4.
Exhaustive Events
A set of events is called exhaustive if all the events together consume the entire sample space.
Complementary Events
For any event E1 there exists another event E1‘ which represents the remaining elements of the sample space S.
E1 = S − E1‘
If a dice is rolled then the sample space S is given as S = {1 , 2 , 3 , 4 , 5 , 6 }. If event E 1 represents all the
outcomes which is greater than 4, then E1 = {5, 6} and E1‘ = {1, 2, 3, 4}.
Similarly, the complement of E1, E2, E3……….En will be represented as E1‘, E2‘, E3‘……….En‘
If two events E1 and E2 are associated with OR then it means that either E 1 or E2 or both. The union
symbol (∪) is used to represent OR in probability.
If we have mutually exhaustive events E1, E2, E3 ………En associated with sample space S then,
E1 U E2 U E3U ………En = S
If two events E1 and E2 are associated with AND then it means the intersection of elements which is common to
both the events. The intersection symbol (∩) is used to represent AND in probability.
It represents the difference between both the events. Event E 1 but not E2 represents all the outcomes which are
present in E1 but not in E2. Thus, the event E1 but not E2 is represented as
E1, E2 = E1 – E2
Measures of probability
Probability is a fundamental concept in statistics and mathematics that quantifies the likelihood of an event
occurring. Here are some key measures and concepts related to probability:
Basic Probability
1. Probability of an Event (P(A)) : The probability of an event \( A \) is defined as the ratio of the number of
favorable outcomes to the total number of possible outcomes in the sample space. It ranges from 0 to 1, where 0
means the event cannot occur, and 1 means the event is certain.
2. Complementary Probability : The probability of the complement of an event \( A \), denoted as \( A' \), is \( 1
- P(A) \). It represents the probability that event \( A \) does not occur.
1. Conditional Probability : The probability of an event \( A \) occurring given that another event \( B \) has
already occurred is denoted by \( P(A|B) \). It is calculated using:
2. Joint Probability : The probability that both events \( A \) and \( B \) occur is denoted by \( P(A \cap B) \).
3. Marginal Probability : The probability of a single event occurring, without consideration of other events, is
known as marginal probability. For example, \( P(A) \) is the marginal probability of event \( A \).
4. Bayes' Theorem : A way to update the probability of an event based on new information. It is expressed as:
5. Law of Total Probability : This law provides a way to calculate \( P(B) \) when \( B \) can occur in several
mutually exclusive ways. For example:
6. Expectation (Expected Value) : The average value of a random variable, considering all possible values it can
take, weighted by their probabilities. For a discrete random variable \( X \):
7. Variance and Standard Deviation : Measures of the spread or dispersion of a random variable's possible
values around the expected value. Variance is given by:
8. Probability Distribution : Describes how probabilities are distributed over the values of the random variable.
Common distributions include:
Probability Distribution
Probability distribution yields the possible outcomes for any random event. It is also defined based on the
underlying sample space as a set of possible outcomes of any random experiment. These settings could be a set
of real numbers or a set of vectors or a set of any entities. It is a part of probability and statistics.
There are two types of probability distribution which are used for different purposes and various types of the
data generation process.
Since the normal distribution statistics estimates many natural events so well, it has evolved into a standard of
recommendation for many probability queries. Some of the examples are:
Tossing a coin
A distribution is called a discrete probability distribution, where the set of outcomes are discrete in nature.
For example, if a dice is rolled, then all the possible outcomes are discrete and give a mass of outcomes. It is
also known as the probability mass function.
So, the outcomes of binomial distribution consist of n repeated trials and the outcome may or may not occur.
The formula for the binomial distribution is;
As we already know, binomial distribution gives the possibility of a different set of outcomes. In the real-life,
the concept is used for:
To find the number of used and unused materials while manufacturing a product.
To take a survey of positive and negative feedback from the people for anything.
The Poisson probability distribution is a discrete probability distribution that represents the probability of a
given number of events happening in a fixed time or space if these cases occur with a known steady rate and
individually of the time since the last event. It was titled after French mathematician Siméon Denis Poisson. The
Poisson distribution can also be practised for the number of events happening in other particularised intervals
such as distance, area or volume. Some of the real-life examples are:
A function which is used to define the distribution of a probability is called a Probability distribution function.
Depending upon the types, we can define these functions. Also, these functions are used in terms of probability
density functions for any given random variable.
In the case of Normal distribution, the function of a real-valued random variable X is the function given by;
FX(x) = P(X ≤ x)
Where P shows the probability that the random variable X occurs on less than or equal to the value of x.
For a closed interval, (a→b), the cumulative probability function can be defined as;
If we express, the cumulative probability function as integral of its probability density function f X , then,
In the case of a random variable X=b, we can define cumulative probability function as;
In the case of Binomial distribution, as we know it is defined as the probability of mass or discrete random
variable gives exactly some value. This distribution is also called probability mass distribution and the function
associated with it is called a probability mass function.
Probability mass function is basically defined for scalar or multivariate random variables whose domain is
variant or discrete. Let us discuss its formula:
X:S→A
Then the probability mass function fX : A → [0,1] for X can be defined as;
The Poisson distribution is a discrete probability function that means the variable can only take specific values
in a given list of numbers, probably infinite. A Poisson distribution measures how many times an event is likely
to occur within “x” period of time. In other words, we can define it as the probability distribution that results
from the Poisson experiment. A Poisson experiment is a statistical experiment that classifies the experiment into
two categories, such as success or failure. Poisson distribution is a limiting process of the binomial distribution.
A Poisson random variable “x” defines the number of successes in the experiment. This distribution occurs
when there are events that do not occur as the outcomes of a definite number of outcomes. Poisson distribution
is used under certain conditions. They are:
np = 1 is finite
Hypothesis testing is a statistical method used to make inferences or draw conclusions about a population based
on sample data. Here’s a basic overview of the process:
1. Formulate Hypotheses :
- Null Hypothesis (H₀) : This is a statement of no effect or no difference. It's the hypothesis that you seek to
test and potentially reject.
- Alternative Hypothesis (H₁ or Ha) : This is a statement that indicates the presence of an effect or a
difference. It is what you are trying to find evidence for.
Example:
- This is the probability of rejecting the null hypothesis when it is actually true (Type I error). Common
significance levels are 0.05, 0.01, or 0.10.
- Depending on the type of data and hypothesis, you choose a statistical test (e.g., t-test, chi-square test,
ANOVA). The choice depends on factors like sample size, data type, and whether variances are equal.
- Gather your sample data and compute the test statistic using the chosen test. The test statistic is a
standardized value that is used to determine how far your sample statistic is from the null hypothesis.
- The p-value measures the probability of obtaining a test statistic at least as extreme as the one observed,
assuming the null hypothesis is true. It helps in deciding whether to reject the null hypothesis.
6. Make a Decision :
- If p-value ≤ α : Reject the null hypothesis. There is sufficient evidence to support the alternative
hypothesis.
- If p-value > α : Fail to reject the null hypothesis. There is not enough evidence to support the alternative
hypothesis.
7. Draw a Conclusion :
- Based on the decision, summarize the findings in the context of the research question.
Types of Errors
- Type I Error (False Positive): Rejecting the null hypothesis when it is actually true.
- Type II Error (False Negative): Failing to reject the null hypothesis when it is actually false.
Example
Imagine you want to test if a new teaching method is more effective than the traditional method. You collect test
scores from students taught by both methods.
1. Hypotheses :
- H₀ : The mean score of students taught by the new method is equal to the mean score of students taught by
the traditional method.
- H₁ : The mean score of students taught by the new method is different from the mean score of students
taught by the traditional method.
3. Test : Use a t-test for independent samples if the scores are normally distributed.
4. Calculate Test Statistic : Compute the t-value from the sample data.
6. Decision :
- If p-value ≤ 0.05, reject H₀. There is evidence that the new method affects test scores.
- If p-value > 0.05, do not reject H₀. There is not enough evidence to conclude that the new method affects test
scores.
T-TEST
A t-test is a statistical test used to determine if there is a significant difference between the means of two groups.
It's commonly used when you have small sample sizes and are comparing means.
1. Types of t-tests:
- Independent samples t-test: Compares the means of two different groups (e.g., test scores of two different
classes).
- Paired samples t-test: Compares means from the same group at different times (e.g., test scores before and
after an intervention).
- One-sample t-test: Compares the mean of a single group to a known value or population mean.
2. Key Concepts:
- Test Statistic (t-value): Measures the size of the difference relative to the variation in the sample data.
- Degrees of Freedom (df): A parameter that adjusts based on sample size and type of t-test.
- P-value: The probability of observing the data assuming the null hypothesis is true. A p-value below a
certain threshold (typically 0.05) suggests that the observed difference is statistically significant.
3. Formula:
4. Interpreting Results:
- Compare the calculated t-value to a critical value from the t-distribution table based on your chosen
significance level (α) and degrees of freedom.
- Alternatively, compare the p-value to your significance level to determine if the result is significant.
Analysis of Variance (ANOVA) is a statistical method used to determine if there are significant differences
between the means of three or more groups. It's an extension of the t-test, which is limited to comparing just two
groups. ANOVA helps to understand whether any observed differences among group means are statistically
significant.
Types of ANOVA
1. One-Way ANOVA:
- Purpose: Tests for differences in means across three or more groups based on one factor (independent
variable).
2. Two-Way ANOVA:
- Purpose: Tests for differences in means across groups based on two factors and can also examine the
interaction between these factors.
- Example: Comparing test scores based on both teaching methods and student gender.
- Purpose: Used when the same subjects are measured multiple times under different conditions.
- Example: Measuring the effect of different diets on the same group of people over several months.
Key Concepts
- Null Hypothesis (H₀): Assumes that all group means are equal.
- Alternative Hypothesis (H₁): Assumes that at least one group mean is different from the others.
- F-Statistic: A ratio of variances. It compares the variance between group means to the variance within
groups. The formula is:
- P-Value: The probability of observing the data (or something more extreme) assuming the null hypothesis is
true. A p-value less than the chosen significance level (typically 0.05) suggests significant differences between
group means.
ANOVA Steps
1. Calculate Group Means and Overall Mean:
- Compute the mean of each group and the grand mean (overall mean of all data points).
4. Determine Significance:
- Compare the F-statistic to a critical value from the F-distribution table or use the p-value to assess
significance.
- If ANOVA indicates significant differences, perform post-hoc tests (e.g., Tukey's HSD) to determine which
specific groups differ from each other.
Interpreting Results
- If the F-statistic is large and/or the p-value is small: There are significant differences among group means.
- If the F-statistic is small and/or the p-value is large: There is no significant difference among group means.
Suppose you want to test if three different diets have different effects on weight loss. You collect data from three
groups of participants, each following a different diet. After performing a one-way ANOVA, you might find that
the F-statistic is 5.2 and the p-value is 0.01. Since the p-value is less than 0.05, you reject the null hypothesis
and conclude that there are significant differences in weight loss among the three diets.
Chi-square test
The chi-square test is a statistical test used to determine whether there is a significant association between
categorical variables. There are a few different types of chi-square tests, but the two most common are the chi-
square test of independence and the chi-square goodness-of-fit test.
How It Works :
1. Construct a Contingency Table : Create a table that displays the frequency of occurrences for each
combination of categories for the two variables.
2. Calculate Expected Frequencies : Based on the assumption that the variables are independent, calculate the
expected frequency for each cell in the table.
4. Compare to Critical Value : Compare the chi-square statistic to a critical value from the chi-square
distribution table with the appropriate degrees of freedom to determine statistical significance.
Degrees of Freedom : For a contingency table with \(r\) rows and \(c\) columns, the degrees of freedom are:
How It Works :
1. Specify Expected Distribution : Determine the theoretical distribution you expect (e.g., uniform distribution,
specific proportions).
2. Calculate Expected Frequencies : Based on the expected distribution, calculate the expected frequency for
each category.
4. Compare to Critical Value : Use the chi-square distribution table to find the critical value and compare it to
the chi-square statistic to assess significance.
Degrees of Freedom : For the goodness-of-fit test, the degrees of freedom are:
Assumptions
3. Sample Size : Generally, expected frequencies should be at least 5 to use the chi-square test reliably. If this
assumption is violated, alternative tests like Fisher's Exact Test may be used.
Example
Imagine you want to test if there's an association between gender (male, female) and preference for a type of
product (A, B, C). You collect data and create a contingency table. You would then calculate the expected
frequencies, compute the chi-square statistic, and compare it to the critical value for the appropriate degrees of
freedom to determine if the association is statistically significant.
Correlation analysis is a statistical technique used to determine the relationship between two or more variables.
It measures how one variable changes in relation to another, which can be useful for understanding and
predicting patterns in data. Here’s a rundown of the key aspects:
1. Types of Correlation
- Positive Correlation : When one variable increases, the other variable also increases. For example, height and
weight often have a positive correlation.
- Negative Correlation : When one variable increases, the other variable decreases. For example, the amount
of time spent on social media might negatively correlate with academic performance.
- No Correlation : No discernible pattern in how the variables change relative to each other.
2. Correlation Coefficients
- Pearson’s Correlation Coefficient (r) : Measures the strength and direction of the linear relationship between
two continuous variables. Values range from -1 to 1, where:
- Spearman’s Rank Correlation Coefficient (ρ or rs) : Measures the strength and direction of the monotonic
relationship between two variables. It is used when the data are ordinal or not normally distributed.
- Kendall’s Tau (τ) : Measures the strength and direction of the association between two variables. It’s less
sensitive to outliers compared to Pearson’s r and is used for ordinal data.
3. Calculating Correlation
- Spearman’s ρ is calculated by ranking the data points and then using a formula similar to Pearson’s but
applied to the ranks.
- Kendall’s τ involves comparing the number of concordant and discordant pairs of observations.
4. Interpreting Correlation
- Strength : The closer the correlation coefficient is to 1 or -1, the stronger the relationship. Values closer to 0
indicate a weaker relationship.
- Direction : Positive values indicate a direct relationship, while negative values indicate an inverse
relationship.
- Significance : Statistical tests (e.g., hypothesis tests) can be used to determine if the observed correlation is
statistically significant or if it might have occurred by chance.
5. Limitations
- Correlation does not imply causation : A high correlation between two variables does not mean that one
causes the other. There could be other underlying factors or variables involved.
- Sensitivity to outliers : Outliers can disproportionately affect correlation coefficients, particularly Pearson’s r.
- Non-linearity : Pearson’s r only measures linear relationships. Non-linear relationships might not be captured
effectively.
6. Applications
- Business : Analyzing customer data to identify trends and correlations between spending and demographic
variables.
- Healthcare : Studying the relationship between lifestyle factors and health outcomes.
Example
Suppose you have data on the number of hours studied and exam scores. You could use Pearson’s correlation
coefficient to quantify how strongly hours studied and exam scores are related. A high positive correlation
would suggest that more study hours are associated with higher exam scores.
The simple correlation coefficient, often referred to as Pearson's correlation coefficient, is a statistical measure
that quantifies the strength and direction of the linear relationship between two variables. It is denoted by \( r \).
1. Formula :
2. Range :
3. Interpretation :
4. Assumptions :
- The data should be approximately normally distributed, though Pearson's correlation is fairly robust to
deviations from normality.
5. Applications :
interpretation
In statistics, interpretation refers to the process of making sense of statistical results and understanding what
they mean in the context of the data and the research question. Here are some key aspects of interpretation in
statistics:
1. Understanding Results : This involves looking at statistical outputs such as means, medians, standard
deviations, correlations, regression coefficients, p-values, and confidence intervals. You need to know what
these numbers represent and how they relate to your research question.
2. Contextualizing Findings : Statistical results should be interpreted within the context of the study. This
means considering the research design, the variables involved, and the data collection methods. For example, a
correlation coefficient tells you the strength and direction of a relationship between two variables, but it doesn’t
imply causation without additional context.
4. Evaluating Assumptions : Many statistical tests have underlying assumptions (e.g., normality, homogeneity
of variances). Checking whether your data meet these assumptions is crucial for valid interpretation.
5. Considering Effect Sizes : Effect sizes measure the magnitude of a relationship or difference, not just
whether it is statistically significant. For example, in regression, effect size could be measured by R-squared,
which indicates how well the model explains the variability in the outcome.
6. Generalizing Findings : When interpreting results, consider the extent to which they can be generalized
beyond the sample used in the study. This involves thinking about sample size, sampling method, and any
potential biases.
7. Drawing Conclusions : Based on the results and context, you draw conclusions about your research question.
This might involve making recommendations, noting limitations, and suggesting areas for further research.
8.Communicating Results : Finally, effectively communicating your findings is important. This involves
translating statistical jargon into understandable language for your audience, whether they are other researchers,
policymakers, or the general public.
Scatter Plot
A scatter plot is a type of data visualization commonly used in statistics to display the relationship between two
quantitative variables. Each point on the scatter plot represents an observation in the dataset, with its position
determined by the values of the two variables.
1. Axes : The plot consists of a horizontal axis (x-axis) and a vertical axis (y-axis). Each axis represents one of
the two variables being analyzed.
2. Points : Each point on the plot corresponds to a pair of values, one from each variable. The position of the
point indicates the values of these variables.
3. Trend : By examining the distribution of points, you can often discern patterns or relationships between the
variables. For example, if the points tend to rise together, there might be a positive correlation. If one variable
increases while the other decreases, there might be a negative correlation.
4. Outliers : Scatter plots can help identify outliers, or data points that deviate significantly from the overall
pattern.
Uses of Scatter Plots:
- Identifying Relationships : Scatter plots are useful for seeing if there is a relationship or correlation between
two variables. You might observe linear relationships, curves, or no discernible pattern.
- Assessing Strength and Direction : The scatter plot helps assess the strength (how closely the points follow a
trend) and direction (positive or negative) of the relationship.
- Detecting Outliers : Outliers can be easily spotted as points that fall far away from the general cluster of data.
Example:
Imagine you're studying the relationship between hours studied and test scores. You would plot hours studied on
the x-axis and test scores on the y-axis. Each student’s data would be represented by a point on the scatter plot.
If the points tend to form an upward trend, it suggests that more study hours are associated with higher test
scores.
2. Plot Points : For each observation, plot a point where the x-coordinate is the value of the first variable and
the y-coordinate is the value of the second variable.
Certainly! Linear regression is a powerful tool used in statistics and machine learning for modeling the
relationship between variables. Let’s delve deeper into its components and application:
Model :
Objective : Find the coefficients \( \beta_0 \) and \( \beta_1 \) that minimize the difference between the
observed values and the values predicted by the model.
Model :
Objective : Find the best-fitting plane (or hyperplane) in a multidimensional space that predicts \( y \) from \
( x_1, x_2, \ldots, x_p \).
1. Linearity : The relationship between the dependent and independent variables is linear.
3. Homoscedasticity : The residuals should have constant variance at all levels of the independent variables.
4. Normality : The residuals should be approximately normally distributed (important for hypothesis testing
and confidence intervals).
3. Estimation of Coefficients
The coefficients are estimated using the Ordinary Least Squares (OLS) method, which minimizes the sum of
squared residuals (differences between observed and predicted values). The formula for the OLS estimator is:
4. Evaluation Metrics
1. R-squared (\( R^2 \)) : Measures how well the independent variables explain the variation in the dependent
variable.
2. Mean Squared Error (MSE) : Average of the squared differences between observed and predicted values.
3. Root Mean Squared Error (RMSE) : Square root of MSE, providing a measure of the average magnitude of
the errors.
- Formula :
6. Practical Considerations
- Feature Scaling : Independent variables may need to be scaled (e.g., standardization) for better performance,
especially in multiple regression.
- Multicollinearity : When independent variables are highly correlated, it can affect the stability of coefficient
estimates.
- Model Validation : Use techniques like cross-validation to assess the model’s performance and avoid
overfitting.
Polynomial regression
Polynomial regression is a type of regression analysis used to model the relationship between a dependent
variable and one or more independent variables using a polynomial function. It's an extension of linear
regression where the relationship between the variables is modeled as an \(n\)th-degree polynomial.
Key Concepts
2. Degree of the Polynomial: The degree of the polynomial determines the flexibility of the model. For example:
3. Model Fitting: In polynomial regression, you fit the polynomial function to the data by estimating the
coefficients using methods such as least squares.
4. Overfitting: Higher-degree polynomials can fit the training data very well, but they might also fit the noise in
the data, leading to poor generalization to new data. This is known as overfitting.
5. Underfitting: On the other hand, a polynomial of too low a degree might not capture the underlying trend of
the data, leading to underfitting.
1. Select the Degree: Choose the degree n of the polynomial based on domain knowledge, exploratory data
analysis, or cross-validation.
2. Transform the Features: For a given degree n, transform the input features x into polynomial features up to
degree n.
3. Fit the Model: Use a regression method (like least squares) to find the coefficients of the polynomial that best
fits the transformed data.
4. Evaluate the Model: Assess the model’s performance using metrics such as Mean Squared Error (MSE) and
ensure that it generalizes well to unseen data.
Example
Suppose you have data points and you want to model the relationship between x and y. For a
quadratic polynomial, you would fit a model of the form:
where are determined by minimizing the difference between the predicted values and the actual
values of y.
Many data science libraries provide functionality for polynomial regression. For example:
- Python: You can use libraries like `scikit-learn` to perform polynomial regression by using
`PolynomialFeatures` to generate polynomial features and then applying a linear regression model.
Logistic regression is a type of regression analysis used for binary classification problems. It models the
probability that a given input belongs to a particular class. When dealing with just one independent variable, the
logistic regression model is relatively simple and easy to interpret.
For a logistic regression model with one variable, the objective is to predict a binary outcome, y , where y can
be 0 or 1. The model predicts the probability that y = 1 given the input x.
Model Equation
To train a logistic regression model, you need to find the optimal parameters that best fit the data.
This is typically done using maximum likelihood estimation.
1. Cost Function: The cost function for logistic regression is the log loss or cross-entropy loss, defined as:
2. Gradient Descent: To minimize the cost function and find the optimal parameters, you can use gradient
descent. The update rules for the parameters are:
3. Convergence: Gradient descent continues updating the parameters until the cost function converges to a
minimum value.
Making Predictions
To predict the class for a new input x, you compute:
Example
Let's say you have data on whether students passed or failed an exam based on the number of hours studied. You
can fit a logistic regression model to predict the probability of passing the exam based on hours studied.
1. Collect Data: Gather data with features (hours studied) and labels (pass/fail).
If this probability is greater than 0.5, predict that the student will pass.
- Python: Libraries like `scikit-learn` provide built-in functions to perform logistic regression. For example, you
can use `LogisticRegression` from `sklearn.linear_model`.
- R: In R, you can use the `glm` function with the `family = binomial` argument to perform logistic regression.
Linear regression with multiple variables, often called multiple linear regression, extends simple linear
regression to accommodate multiple predictors. It models the relationship between a dependent variable and
several independent variables using a linear function.
Model Equation
Objective
The goal is to estimate the coefficients such that the difference between the predicted values
and the actual values is minimized. This is usually done by minimizing the sum of squared residuals (errors).
Cost Function
The cost function for multiple linear regression is the Mean Squared Error (MSE), which is given by:
1. Ordinary Least Squares (OLS): This method minimizes the sum of squared residuals to find the best
coefficients. It's the most common method for fitting linear regression models.
2. Gradient Descent: An iterative optimization algorithm that adjusts the coefficients to minimize the cost
function. This is useful when dealing with very large datasets or complex models.
1. Data Preparation:
- Use a fitting method to estimate the coefficients. In Python, libraries like `scikit-learn` offer the
`LinearRegression` class which can perform this task.
- Use metrics such as R-squared, Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) to
assess the performance of the model.
- Perform residual analysis to check if the assumptions of linear regression are met.
4. Make Predictions:
- Evaluate which predictors have the most influence on the dependent variable.
Example
Suppose you want to predict a house price based on features like square footage, number of bedrooms, and age
of the house. Your model might look like: