0% found this document useful (0 votes)
22 views38 pages

Unit-IV of Data Science

It's a data science notes ppt

Uploaded by

dimplejangid1808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views38 pages

Unit-IV of Data Science

It's a data science notes ppt

Uploaded by

dimplejangid1808
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit-IV : Statistics

Dr. Amit Kumar Chaturvedi


Assistant Prof., CSE(MCA) Dept.,
Engineering College, Ajmer
Basic Terminology in Statistics
Statistics is a form of mathematical analysis that uses quantified models
and representations for a given set of experimental data or real-life
studies. The main advantage of statistics is that information is presented
in an easy way. To become a master in the statistical program we should
be familiar with certain terminologies. They are:
• Understand the Type of Analytics
• Probability
• Central Tendency
• Variability
• Relationship Between Variables
• Probability Distribution
• Hypothesis Testing and Statistical Significance
• Regression
Understand the Type of Analytics

• Descriptive Analytics tells us what happened in the past and


helps a business understand how it is performing by providing
context to help stakeholders interpret information.
• Diagnostic Analytics takes descriptive data a step further and
helps you understand why something happened in the past.
• Predictive Analytics predicts what is most likely to happen in
the future and provides companies with actionable insights
based on the information.
• Prescriptive Analytics provides recommendations regarding
actions that will take advantage of the predictions and guide
the possible actions toward a solution.
Probability
Probability is the measure of the likelihood that an event will occur in a Random Experiment.
Complement: P(A) + P(A’) = 1
Intersection: P(A∩B) = P(A)P(B)
Union: P(A∪B) = P(A) + P(B) − P(A∩B)

Intersection and Union.


Conditional Probability: P(A|B) is a measure of the probability of one event occurring with some relationship to one or
more other events. P(A|B)=P(A∩B)/P(B), when P(B)>0.
Independent Events: Two events are independent if the occurrence of one does not affect the probability of occurrence
of the other. P(A∩B)=P(A)P(B) where P(A) != 0 and P(B) != 0 , P(A|B)=P(A), P(B|A)=P(B)
Mutually Exclusive Events: Two events are mutually exclusive if they cannot both occur at the same time. P(A∩B)=0 and
P(A∪B)=P(A)+P(B).
Bayes’ Theorem describes the probability of an event based on prior knowledge of conditions that might be related to
the event.
Bayes’ Theorem.
Central Tendency
• Mean: The average of the dataset.
• Median: The middle value of an ordered dataset.
• Mode: The most frequent value in the dataset. If
the data have multiple values that occurred the
most frequently, we have a multimodal
distribution.
• Skewness: A measure of symmetry.
• Kurtosis: A measure of whether the data are
heavy-tailed or light-tailed relative to a normal
distribution
Variability
• Range: The difference between the highest and lowest value in the dataset.
• Percentiles, Quartiles and Interquartile Range (IQR)
• Percentiles — A measure that indicates the value below which a given
percentage of observations in a group of observations falls.
• Quantiles— Values that divide the number of data points into four more or
less equal parts, or quarters.
• Interquartile Range (IQR)— A measure of statistical dispersion and
variability based on dividing a data set into quartiles. IQR = Q3 − Q1

Percentiles, Quartiles and Interquartile Range (IQR).


Population and Sample Variance and Standard Deviation.
Relationship Between Variables
• Causality: Relationship between two events where one event is affected by the other.
• Covariance: A quantitative measure of the joint variability between two or more
variables.
• Correlation: Measure the relationship between two variables and ranges from -1 to 1,
the normalized version of covariance.

Covariance and Correlation.


Probability Distributions
Probability Distribution Functions
Probability Mass Function (PMF): A function that gives the probability that
a discrete random variable is exactly equal to some value.
Probability Density Function (PDF): A function for continuous data where the
value at any given sample can be interpreted as providing a relative likelihood that
the value of the random variable would equal that sample.
Cumulative Density Function (CDF): A function that gives the probability that a
random variable is less than or equal to a certain value.

Comparison between PMF, PDF, and CDF.


Continuous Probability Distribution
• Uniform Distribution: Also called a rectangular distribution, is a probability
distribution where all outcomes are equally likely.
• Normal/Gaussian Distribution: The curve of the distribution is bell-shaped and
symmetrical and is related to the Central Limit Theorem that the sampling
distribution of the sample means approaches a normal distribution as the
sample size gets larger.
• Exponential Distribution: A probability distribution of the time
between the events in a Poisson point process.
• Chi-Square Distribution: The distribution of the sum of squared
standard normal deviates.
Discrete Probability Distribution
Bernoulli Distribution: The distribution of a random variable
which takes a single trial and only 2 possible outcomes,
namely 1(success) with probability p, and 0(failure) with
probability (1-p).
Binomial Distribution: The distribution of the number of
successes in a sequence of n independent experiments, and
each with only 2 possible outcomes, namely 1(success) with
probability p, and 0(failure) with probability (1-p).
Poisson Distribution: The distribution that expresses the
probability of a given number of events k occurring in a fixed
interval of time if these events occur with a known constant
average rate λ and independently of the time.
Hypothesis Testing and Statistical Significance

Null and Alternative Hypothesis


Null Hypothesis: A general statement that there is no relationship between two
measured phenomena or no association among groups. Alternative Hypothesis: Be
contrary to the null hypothesis.
In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis,
while a type II error is the non-rejection of a false null hypothesis.
Interpretation
P-value: The probability of the test statistic being at least as extreme as the one
observed given that the null hypothesis is true. When p-value > α, we fail to reject
the null hypothesis, while p-value ≤ α, we reject the null hypothesis, and we can
conclude that we have a significant result.
Critical Value: A point on the scale of the test statistic beyond which we reject the
null hypothesis and is derived from the level of significance α of the test. It
depends upon a test statistic, which is specific to the type of test, and the
significance level, α, which defines the sensitivity of the test.
Significance Level and Rejection Region: The rejection region is actually
dependent on the significance level. The significance level is denoted by α and is
the probability of rejecting the null hypothesis if it is true.
Z-Test
A Z-test is any statistical test for which the distribution of the test statistic under
the null hypothesis can be approximated by a normal distribution and tests the
mean of a distribution in which we already know the population variance.
Therefore, many statistical tests can be conveniently performed as approximate Z-
tests if the sample size is large or the population variance is known.
T-Test
A T-test is the statistical test if the population variance is unknown, and the sample
size is not large (n < 30).
Paired sample means that we collect data twice from the same group, person, item,
or thing. Independent sample implies that the two samples must have come from
two completely different populations.
ANOVA (Analysis of Variance)
ANOVA is the way to find out if experimental results are significant. One-way
ANOVA compares two means from two independent groups using only one
independent variable. Two-way ANOVA is the extension of one-way ANOVA using
two independent variables to calculate the main effect and interaction effect.
Chi-Square Test
Chi-Square Test checks whether or not a model follows approximately normality
when we have s discrete set of data points. Goodness of Fit Test determines if a
sample matches the population fit one categorical variable to a distribution. Chi-
Square Test for Independence compares two sets of data to see if there is a
relationship.
Regression

Linear Regression
Assumptions of Linear Regression
Linear Relationship
Multivariate Normality
No or Little Multicollinearity
No or Little Autocorrelation
Homoscedasticity
Linear Regression is a linear approach to modeling the relationship between a
dependent variable and one independent variable. An independent variable is a
variable that is controlled in a scientific experiment to test the effects on the
dependent variable. A dependent variable is a variable being measured in a
scientific experiment.

Linear Regression Formula.


Multiple Linear Regression is a linear approach to modeling the relationship
between a dependent variable and two or more independent variables.
Steps for Running the Linear Regression
Step 1: Understand the model description, causality, and directionality
Step 2: Check the data, categorical data, missing data, and outliers
Outlier is a data point that differs significantly from other observations. We can use the standard deviation
method and interquartile range (IQR) method.
Dummy variable takes only the value 0 or 1 to indicate the effect for categorical variables.
Step 3: Simple Analysis — Check the effect comparing between dependent variable to independent variable and
independent variable to independent variable
Use scatter plots to check the correlation
Multicollinearity occurs when more than two independent variables are highly correlated. We can use Variance
Inflation Factor (VIF) to measure if VIF > 5 there is highly correlated and if VIF > 10, then there is certainly
multicollinearity among the variables.
Interaction Term implies a change in the slope from one value to another value.
Step 4: Multiple Linear Regression — Check the model and the correct variables
Step 5: Residual Analysis
Check normal distribution and normality for the residuals.
Homoscedasticity describes a situation in which the error term is the same across all values of the independent
variables and means that the residuals are equal across the regression line.
Step 6: Interpretation of Regression Output
R-Squared is a statistical measure of fit that indicates how much variation of a dependent variable is explained by
the independent variables. Higher R-Squared value represents smaller differences between the observed data
and fitted values.
P-value
Regression Equation
Populations, Samples, Parameters, and Statistics
The field of inferential statistics enables you to make educated guesses about the numerical
characteristics of large groups. The logic of sampling gives you a way to test conclusions about
such groups using only a small portion of its members.
A population is a group of phenomena that have something in common. The term often refers
to a group of people, as in the following examples:
• All registered voters in Crawford County
• All members of the International Machinists Union
• All Americans who played golf at least once in the past year

But populations can refer to things as well as people:


• All widgets produced last Tuesday by the Acme Widget Company
• All daily maximum temperatures in July for major U.S. cities
• All basal ganglia cells from a particular rhesus monkey

These values in the population are called parameters. Parameters are the unknown
characteristics of the entire population, like the population mean and median. Sample
statistics describe the characteristics of a fraction of population which, is taken as the sample.
The sample mean and median is fixed and known.
• Instead, the company might select a sample of the population. A sample is a smaller
group of members of a population selected to represent the population. In order to use
statistics to learn things about the population, the sample must be random. A random
sample is one in which every member of a population has an equal chance of being
selected. The most commonly used sample is a simple random sample. It requires that
every possible sample of the selected size has an equal chance of being used.
• A parameter is a characteristic of a population. A statistic is a characteristic of a
sample. Inferential statistics enables you to make an educated guess about a
population parameter based on a statistic computed from a sample randomly drawn
from that population
Estimate, Estimator
What is an estimator?
In machine learning, an estimator is an equation for
picking the “best,” or most likely accurate, data
model based upon observations in realty. Not to be
confused with estimation in general, the estimator
is the formula that evaluates a given quantity (the
estimand) and generates an estimate. This estimate
is then inserted into the deep learning classifier
system to determine what action to take.
Uses of Estimators
• By quantifying guesses, estimators are how machine learning in
theory is implemented in practice. Without the ability to estimate the
parameters of a dataset (such as the layers in a neural network or the
bandwidth in a kernel), there would be no way for an AI system to
“learn.”
• A simple example of estimators and estimation in practice is the so-
called “German Tank Problem” from World War Two. The Allies had
no way to know for sure how many tanks the Germans were building
every month. By counting the serial numbers of captured or
destroyed tanks (the estimand), Allied statisticians created an
estimator rule. This equation calculated the maximum possible
number of tanks based upon the sequential serial numbers, and apply
minimum variance analysis to generate the most likely estimate for
how many new tanks German was building.
Types of Estimators
Estimators come in two broad categories—point and interval. Point equations
generate single value results, such as standard deviation, that can be plugged into
a deep learning algorithm’s classifier functions. Interval equations generate a
range of likely values, such as a confidence interval, for analysis.
In addition, each estimator rule can be tailored to generate different types of
estimates:
• Biased - Either an overestimate or an underestimate.
• Efficient - Smallest variance analysis. The smallest possible variance is referred
to as the “best” estimate.
• Invariant: Less flexible estimates that aren’t easily changed by data
transformations.
• Shrinkage: An unprocessed estimate that’s combined with other variables to
create complex estimates.
• Sufficient: Estimating the total population’s parameter from a limited dataset.
• Unbiased: An exact-match estimate value that neither underestimates nor
overestimates.
Properties of Good Estimators
A distinction is made between an estimate and an estimator.
The numerical value of the sample mean is said to be an
estimate of the population mean figure. On the other hand,
the statistical measure used, that is, the method of estimation
is referred to as an estimator. A good estimator, as common
sense dictates, is close to the parameter being estimated. Its
quality is to be evaluated in terms of the following properties:
• Unbiasedness
• Efficient
• Consistent
• Sufficient
1. Unbiasedness.
An estimator is said to be unbiased if its expected value is identical with the population
parameter being estimated. That is if θ is an unbiased estimate of θ, then we must have E (θ) =
θ. Many estimators are “Asymptotically unbiased” in the sense that the biases reduce to
practically insignificant value (Zero) when n becomes sufficiently large. The estimator S2 is an
example.
It should be noted that bias is estimation is not necessarily undesirable. It may turn out to be
an asset in some situations.

2. Consistency.
If an estimator, say θ, approaches the parameter θ closer and closer as the sample size n
increases, θ is said to be a consistent estimator of θ. Stating somewhat more rigorously, the
estimator θ is said is be a consistent estimator of θ if, as n approaches infinity, the probability
approaches 1 that θ will differ from the parameter θ by no more than an arbitrary constant.

The sample mean is an unbiased estimator of µ no matter what form the population
distribution assumes, while the sample median is an unbiased estimate of µ only if the
population distribution is symmetrical. The sample mean is better than the sample median as
an estimate of µ in terms of both unbiasedness and consistency.
3. Efficiency.
The concept of efficiency refers to the sampling variability of an estimator. If two
competing estimators are both unbiased, the one with the smaller variance (for a given
sample size) is said to be relatively more efficient. Stated in a somewhat different
language, an estimator θ is said to be more efficient than another estimator θ 2 for θ if
the variance of the first is less than the variance of the second. The smaller the variance
of the estimator, the more concentrated is the distribution of the estimator around the
parameter being estimated and, therefore, the better this estimator is.

4. Sufficiency.
An estimator is said to be sufficient if it conveys much information as is possible about
the parameter which is contained in the sample. The significance of sufficiency lies in the
fact that if a sufficient estimator exists, it is absolutely unnecessary to considered any
other estimator; a sufficient estimator ensures that all information a sample a sample
can furnished with respect to the estimation of a parameter is being utilized.
Many methods have been devised for estimating parameters that may provide
estimators satisfying these properties. The two important methods are the least square
method and the method of maximum likelihood.
Estimate and Estimators
Let X be a random variable having distribution fx(x;θ),
where θ is an unknown parameter. A random sample, X1,
X2, ——, Xn, of size n taken on X.

The problem of point estimation is to pick a statistic,


g(X1, X2, ——–, Xn), that best estimates the parameter θ.

Once observed, the numerical value of g(x1, x2, ——–,


xn) is called an estimate, and Statistic g(X1, X2, ——–, Xn)
is called an estimator.
What is Point Estimators?
Point estimators are defined as the functions that are used to find an approximate
value of a population parameter from random samples of the population. They
take the help of the sample data of a population to determine a point estimate or a
statistic that serves as the best estimate of an unknown parameter of a population.
Sample Mean and Sample Variance
Two important Statistic:
Let X1, X2, X3, ——–, Xn be a random sample, then:
The Sample mean is denoted by x̄, and Sample Variance is
denoted by s2
Here x̄ and s2 are called the sample parameters.
The population parameters are denoted by:
σ2 = population variance, and µ = population mean

Fig. Population and Sample Mean


Fig. Population and Sample Variance
Measures of Spread of Data
In the world of data science, some of the most important
decisions regarding analyses are made while performing
exploratory data analysis on data-sets. While
understanding the concepts of Mean, Median and Mode
help analysts get started with the basic structure of the
data set, these are just the measures of central
tendency and don’t provide an overview of the entire
data set. Understanding Range, Interquartile Range
(IQR), Standard Deviation and Variance help us to
understand how spread out our data are from one
another.
When we discuss measures of spread, we are considering
numeric values that are associated with how far our points
are from one another.
Common measures of spread include:
• Range
• Interquartile Range (IQR)
• Standard Deviation
• Variance
It is easiest to understand the spread of our data visually and
the most common visual for quantitative data is the
Histogram.

You might also like