0% found this document useful (0 votes)
22 views25 pages

Unit 2

Uploaded by

msmakkar.chief19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views25 pages

Unit 2

Uploaded by

msmakkar.chief19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Course :Data Science

Course Code : ITAC008


Chapter
2:
Statistic
s of Data
Science
Chapter Index
S.No. Reference Particulars Slide
No. From - To
1 Learning Objectives 3
2 Topic 1 Measures of Central 4 -5
Tendency
3 Topic 2 Probability Theory 6 - 9
4 Topic 3 Statistical Inference 10 - 11
Chapter Index
S.No. Particulars Slide
Reference From -To
No.
5 Topic Sampling Theory 12 - 15
4
6 Topic Hypothesis Testing 16 - 18
5
7 Topic Regression 19 – 20
6 Analysis
8 Let’s Sum Up 16
Learning Objectives
 Describe the concept of probability theory
€ Explain the meaning and types of statistical inference
€ Discuss the importance of sampling theory
€ Elucidate the meaning and importance of hypothesis testing
€ Describe the concept and types of regression analysis
Measures of Central Tendency

 €€Arithmetic Mean: The mean of a variable represents its average value. It can be
calculated by using the below formula:

- where X represents the sample mean and fi represents the frequency of an ith observation of
the variable. Mean is the hypothetical value of a variable. It may or may not exist in the dataset.
1. Measures of Central Tendency

 Median : Median is called positional average of a variable. When we


arrange the observations of a variable in an ascending or descending
order then the middle value of the series of the observations is called
Median. Median value divides the observations into two equal halves.
Half of the observations of the variable are lower than the median value
and the other half observations are higher than the median value.
Quartiles, Deciles and Percentiles are the extensions of the median.
 Mode : The mode of a variable is the observation with the highest
frequency or highest concentration of frequencies.
2. Probability Theory

 Probability theory is a branch of mathematics that is concerned with


chance or probability. Probability theory deals with concepts by
expressing them in the form of axioms which formalize in terms of
probability space. The probability may take any value between 0 and 1.
The probability space assigns a value between 0 and 1 to a set of
outcomes which are called sample space. If a subset of the sample
space is taken, it is called an event.
 The probability theory involves use of discrete and continuous random
variables and probability distributions. The distributions provide
mathematical abstractions of non-deterministic or uncertain processes
or measured quantities which may occur as a single occurrence or over
time.
2. Probability Theory

Continuous Probability Distributions


• A continuous random variable is a random variable having an infinite
and uncountable range. If the random variable is continuous, its
probability distribution is called continuous probability distribution.
• A continuous distribution refers to the set of probabilities of the possible
values of a continuous random variable.
• A probability distribution can be described using an equation called
Probability Density Function (PDF). The area under the curve of a
random variable’s PDF shows the probabilities of the continuous random
variables.
• Probability of a continuous random variable having some value is zero.
2. Probability Theory

Discrete Probability Distributions

 Random events lead to discrete random variables. Usually, the discrete


random variables are denoted as X and their probability distribution is
denoted as P(X).
 Some of the most common discrete probability distributions used in
statistics include binomial distribution, geometric distribution,
hypergeometric distribution, multinomial distribution, negative binomial
distribution, and Poisson distribution.
 Discrete probability distributions can be described using frequency
distribution tables, graphs or charts.
2. Probability Theory

Classical Probability Distributions


There are four types of Classical Probability Distributions:
 Bernoulli Distribution: A Bernoulli distribution has only one trial and
only two possible outcomes, namely 1 (success) and 0 (failure).
 Uniform Distribution: In a uniform distribution, there may be any
number of outcomes and the probability of getting any outcome is
equally likely.
2. Probability Theory

 Binomial Distribution: A binomial distribution is the one wherein only


two outcomes are possible for all the trials and each trial’s results are
independent of each other.
 Normal Distribution: Normal distribution results in a bell-shaped
symmetrical curve. This distribution occurs naturally in many situations.
3. Statistical Inference

• Statistical inference refers to the process using which inferences about a


population are made on the basis of certain statistics calculated from a
sample of data drawn from that population. In other words, statistical
inference refers to the use of probability theory to make inferences
about a population from the sample data.
• Assume that we want to estimate the average life expectancy of males
living in Tamil Nadu, India or the percentage of public that is satisfied
with the work done by the current government. To know the actual
results, we cannot obtain data from each person in the population.
Therefore, we obtain the data from a part of the population called
sample. Data is obtained from the sample population and is analyzed to
draw inferences about the population.
3. Statistical Inference

In inferential statistics, the experimenter tries to achieve three goals as


follows:
• Parameter estimation: Parameters are the unknown constants in a
probability distribution that determine the properties of a distribution.
• Data prediction: After the parameters have been estimated for a
particular distribution, they can be used to predict the future data.
• Model comparison: After the data has been predicted for an entire
population, the experimenter selects one model which best explains the
observed data from two or more models.
4. Sampling Theory

The practice of collecting samples and analyzing them to derive some useful
information is called sampling theory. Some important concepts related to sampling
theory are as follows:
• Data: Data refers to the entire set of observations that have been collected.
• Population: An entire group of subjects or objects that are to be studied and
analysed is called population.
• Sample: A sample is a portion or sub-collection of elements that are examined in
order to estimate the characteristics of a population.
• Parameter: A parameter refers to a characteristic of a sample that is generalised for
the population.
• Statistics: It is a branch of mathematics that deals with planning and conducting
experiments, obtaining data, and organising, summarising, presenting, analysing,
interpreting and drawing conclusions based on data.
4. Sampling Theory

Sampling Frame
• Sampling frame refers to the complete list of all the items (everyone and everything)
that must be studied. At first, it would appear that a sampling frame is the same as
population. But, population is general, whereas sampling frame is specific.
• For example, we may define a population as all those individuals who can be
sampled (for example, all the Indian Americans living in Texas, USA), whereas an
exhaustive list of all the Indian Americans living in Texas, USA would be considered
as the sampling frame because it is not necessary that all the Indian Americans living
in Texas, USA would be listed under the list so provided.
• In statistical research, the experimenters require a list of items in order to draw a
sample from it. It must be ensured that the sampling frame is adequate for the needs
of the experimenter.
Sampling Theory

Sampling Methods
 In statistics, there are various sampling methods. Sampling methods are
divided into two categories, namely probability sampling and non-
probability sampling.
 Probability sampling is the one wherein the sample has a known
probability of being selected.
 In non-probability sampling, a sample does not have known probability
of being selected. In probability sampling, we can determine the
probability that each sample will be selected. In addition, we can also
determine which sampling units belong to which sample..
4. Sampling
Theory
Sampling Errors
• Errors that are involved in sampling
are shown in following figure:
5. Hypothesis
Testing
 A hypothesis is a statement or a proposed explanation about
one or more populations. A hypothesis statement is usually
associated with the population parameters. A hypothesis can
be tested using a research method.
 In hypothesis testing, there are two types of hypotheses,
namely null hypothesis and alternate hypothesis. The null
hypothesis (H0) is the hypothesis to be tested. Alternate
hypothesis (HA) is the hypothesis that must be accepted if the
sample data leads to rejection of H0.
 Hypothesis testing, also called significance testing, is a
method which is used to test the hypothesis regarding the
population parameters using the data collected from a
sample. Alternatively, we can say that hypothesis testing is a
method of evaluating samples to learn about the
characteristics of a given population.
Hypothesis Testing

Four Steps to Hypothesis Testing


• The process of hypothesis testing consists of four steps as follows:

1 2 3 4
Step 1: Identify Step 2: Set the Step 3: Select a Step 4: Make a
the hypothesis to criterion upon random sample decision –
be tested. which the from the Compare the
hypothesis would population and observed value of
be tested. measure the the sample to
sample mean what we expect to
(Compute the test observe if the
statistic). claim we are
testing is true.
Hypothesis Testing

Analysis of Variance (ANOVA)


• Independent-sample t-test can be applied to situations where there are
only two independent samples. In other words, we can use independent-
sample t-tests for comparing the means of two populations (such as
males and females). When we have more than two independent
samples, t-test is inappropriate. The Analysis of Variance (ANOVA) has
an advantage over t-test when the researcher wants to compare the
means of a larger number of population (i.e., three or more).
• ANOVA is a parametric test that is used to study the difference among
more than two groups in the datasets. It helps in explaining the amount
of variation in the dataset.
Regression Analysis

• Regression analysis is a statistical method that is used to model a


relationship between two or more variables of interest.
• Regression analysis is usually used to model a relationship between a
response variable (dependent variable) and one or more predictor
(independent) variables.
• There are various types of regression. However, the basic function of
these regression models is to examine the influence of one or more
independent variables on a dependent variable.
• Regression analysis helps in identifying which variables have an impact
on a variable of interest.
Regression Analysis

Types of Regression Techniques


Seven important types of regression techniques include:
 Linear Regression
 Logistic Regression
 Polynomial Regression
 Ordinal Regression
 Ridge Regression
 Principal Components Regression (PCR)
 Partial Least Squares (PLS) Regression
Let’s Sum Up

 The probability theory is a branch of mathematics that is concerned with chance or


probability.
 The probability theory involves use of discrete and continuous random variables and
probability distributions.
 Statistical inference describes the use of probability theory to make inferences about
a population from the sample data.
 The practice of collecting samples and analyzing them to derive some useful
information is called sampling theory.
 Hypothesis testing is a method of calculating samples to learn about the
characteristics of a given population.
 Regression analysis a statistical method which is used to model a relationship
between two or more variables of interest.
THANK YOU

You might also like