Unit-IV of Data Science
Unit-IV of Data Science
Linear Regression
Assumptions of Linear Regression
Linear Relationship
Multivariate Normality
No or Little Multicollinearity
No or Little Autocorrelation
Homoscedasticity
Linear Regression is a linear approach to modeling the relationship between a
dependent variable and one independent variable. An independent variable is a
variable that is controlled in a scientific experiment to test the effects on the
dependent variable. A dependent variable is a variable being measured in a
scientific experiment.
These values in the population are called parameters. Parameters are the unknown
characteristics of the entire population, like the population mean and median. Sample
statistics describe the characteristics of a fraction of population which, is taken as the sample.
The sample mean and median is fixed and known.
• Instead, the company might select a sample of the population. A sample is a smaller
group of members of a population selected to represent the population. In order to use
statistics to learn things about the population, the sample must be random. A random
sample is one in which every member of a population has an equal chance of being
selected. The most commonly used sample is a simple random sample. It requires that
every possible sample of the selected size has an equal chance of being used.
• A parameter is a characteristic of a population. A statistic is a characteristic of a
sample. Inferential statistics enables you to make an educated guess about a
population parameter based on a statistic computed from a sample randomly drawn
from that population
Estimate, Estimator
What is an estimator?
In machine learning, an estimator is an equation for
picking the “best,” or most likely accurate, data
model based upon observations in realty. Not to be
confused with estimation in general, the estimator
is the formula that evaluates a given quantity (the
estimand) and generates an estimate. This estimate
is then inserted into the deep learning classifier
system to determine what action to take.
Uses of Estimators
• By quantifying guesses, estimators are how machine learning in
theory is implemented in practice. Without the ability to estimate the
parameters of a dataset (such as the layers in a neural network or the
bandwidth in a kernel), there would be no way for an AI system to
“learn.”
• A simple example of estimators and estimation in practice is the so-
called “German Tank Problem” from World War Two. The Allies had
no way to know for sure how many tanks the Germans were building
every month. By counting the serial numbers of captured or
destroyed tanks (the estimand), Allied statisticians created an
estimator rule. This equation calculated the maximum possible
number of tanks based upon the sequential serial numbers, and apply
minimum variance analysis to generate the most likely estimate for
how many new tanks German was building.
Types of Estimators
Estimators come in two broad categories—point and interval. Point equations
generate single value results, such as standard deviation, that can be plugged into
a deep learning algorithm’s classifier functions. Interval equations generate a
range of likely values, such as a confidence interval, for analysis.
In addition, each estimator rule can be tailored to generate different types of
estimates:
• Biased - Either an overestimate or an underestimate.
• Efficient - Smallest variance analysis. The smallest possible variance is referred
to as the “best” estimate.
• Invariant: Less flexible estimates that aren’t easily changed by data
transformations.
• Shrinkage: An unprocessed estimate that’s combined with other variables to
create complex estimates.
• Sufficient: Estimating the total population’s parameter from a limited dataset.
• Unbiased: An exact-match estimate value that neither underestimates nor
overestimates.
Properties of Good Estimators
A distinction is made between an estimate and an estimator.
The numerical value of the sample mean is said to be an
estimate of the population mean figure. On the other hand,
the statistical measure used, that is, the method of estimation
is referred to as an estimator. A good estimator, as common
sense dictates, is close to the parameter being estimated. Its
quality is to be evaluated in terms of the following properties:
• Unbiasedness
• Efficient
• Consistent
• Sufficient
1. Unbiasedness.
An estimator is said to be unbiased if its expected value is identical with the population
parameter being estimated. That is if θ is an unbiased estimate of θ, then we must have E (θ) =
θ. Many estimators are “Asymptotically unbiased” in the sense that the biases reduce to
practically insignificant value (Zero) when n becomes sufficiently large. The estimator S2 is an
example.
It should be noted that bias is estimation is not necessarily undesirable. It may turn out to be
an asset in some situations.
2. Consistency.
If an estimator, say θ, approaches the parameter θ closer and closer as the sample size n
increases, θ is said to be a consistent estimator of θ. Stating somewhat more rigorously, the
estimator θ is said is be a consistent estimator of θ if, as n approaches infinity, the probability
approaches 1 that θ will differ from the parameter θ by no more than an arbitrary constant.
The sample mean is an unbiased estimator of µ no matter what form the population
distribution assumes, while the sample median is an unbiased estimate of µ only if the
population distribution is symmetrical. The sample mean is better than the sample median as
an estimate of µ in terms of both unbiasedness and consistency.
3. Efficiency.
The concept of efficiency refers to the sampling variability of an estimator. If two
competing estimators are both unbiased, the one with the smaller variance (for a given
sample size) is said to be relatively more efficient. Stated in a somewhat different
language, an estimator θ is said to be more efficient than another estimator θ 2 for θ if
the variance of the first is less than the variance of the second. The smaller the variance
of the estimator, the more concentrated is the distribution of the estimator around the
parameter being estimated and, therefore, the better this estimator is.
4. Sufficiency.
An estimator is said to be sufficient if it conveys much information as is possible about
the parameter which is contained in the sample. The significance of sufficiency lies in the
fact that if a sufficient estimator exists, it is absolutely unnecessary to considered any
other estimator; a sufficient estimator ensures that all information a sample a sample
can furnished with respect to the estimation of a parameter is being utilized.
Many methods have been devised for estimating parameters that may provide
estimators satisfying these properties. The two important methods are the least square
method and the method of maximum likelihood.
Estimate and Estimators
Let X be a random variable having distribution fx(x;θ),
where θ is an unknown parameter. A random sample, X1,
X2, ——, Xn, of size n taken on X.