0% found this document useful (0 votes)
44 views46 pages

Basic Stat

Statistics is the study of collecting, analyzing, and interpreting data. It involves organizing, presenting, and drawing conclusions from data. There are two main types of statistics - descriptive statistics, which summarize data through measures like mean, median and mode, and inferential statistics, which make inferences about populations based on samples using hypothesis testing and regression analysis. Descriptive statistics are used to quantitatively describe known data through graphs, charts and tables, while measuring central tendency and dispersion. Measures of central tendency indicate a single central point of data like the mean, median or mode. Measures of dispersion describe how data is spread out, through ranges, variances and standard deviations.

Uploaded by

jatin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views46 pages

Basic Stat

Statistics is the study of collecting, analyzing, and interpreting data. It involves organizing, presenting, and drawing conclusions from data. There are two main types of statistics - descriptive statistics, which summarize data through measures like mean, median and mode, and inferential statistics, which make inferences about populations based on samples using hypothesis testing and regression analysis. Descriptive statistics are used to quantitatively describe known data through graphs, charts and tables, while measuring central tendency and dispersion. Measures of central tendency indicate a single central point of data like the mean, median or mode. Measures of dispersion describe how data is spread out, through ranges, variances and standard deviations.

Uploaded by

jatin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

STATISTI

CS
• Statistics is a branch of mathematics that deals with numbers and analysis of the data.

• Statistics is the study of the collection, analysis, interpretation, presentation, and


organization of data.

• Statistics interprets various results from it and forecast possibilities.

• Statistics deals with collecting, classifying, arranging, and presenting collected numerical
data
DATA TYPES
DESCRIPTIVE INFERENTIAL
STATISTICS STATISTICS
Measures of Measures of Hypothesis Regression
Central Tendency Dispersion Testing Analysis

Mean Range Z Test Linear


Regression
Median Standard Deviation F Test

Mode Variance T Test

Absolute Deviation
DESCRIPTIVE
STATISTICS
• Used to quantitatively describe the attribute of known data.

• Provide summaries of either the sample or population.

• Graphs, Charts and Tables are used to represent Descriptive Statistics.

• Measure of descriptive statistics are given as follows:

 Measure of Central Tendency

 Measures of Dispersion
MEASURE OF CENTRAL TENDENCY

• These measures are used to describe data with respect to a single central point.

• Mean, median, and mode are three types of central tendency.


MEAN
• Mean is the average of the given numbers and is calculated by dividing the sum of given
numbers by the total number of numbers.
• Formula-
Mean =

• Example-
Percentage obtained = 88, 91, 99, 95, 85, 90

Average Percentage= = 91.33

Mean = Average
MEDIAN
• It is defined as the middle value in a given set of numbers or data.
• Formula-
Median= term
• Example-
List of numbers: 4, 10, 7, 15, 2
 First arrange them in ascending order.

Numbers: 2, 4, 7, 10, 15
Median= term
median= 3th value

 Median is 7.
MODE
• Mode is the value that is repeatedly occurring in a given set.
• Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 as it
appeared twice in set.

Bimodal, Trimodal & Multimodal (More than one mode)


•When there are two modes in a data set, then the set is called bimodal
example,
The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because
both 2 and 5 is repeated three times in the given set.

•When there are three modes in a data set, then the set is called trimodal
example,
the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8

•When there are four or more modes in a data set, then the set is called multimodal
MEASURE OF DISPERSION
• Measures of dispersion help to describe the variability in data.
• It is a statistical term that can be used to describe the extent to which data is scattered.
• There are five most commonly used measures of dispersion- range, variance, standard
deviation, mean deviation, and quartile deviation.
ABSOLUTE MEASURES OF
DISPERSION
• It is used when dispersion of data within an experiment has to be determined.
• These measures usually express variations in a data set with respect to the average of the deviations of the
observations.
• The most commonly used absolute measures of deviation are listed below-
 Range: Given a data set, the range can be defined as the difference between the maximum value and the minimum
value.
 Variance: The average squared deviation from the mean of the given data set is known as the variance. This
measure of dispersion checks the spread of the data about the mean.
 Standard Deviation: The square root of the variance gives the standard deviation. Thus, the standard deviation
also measures the variation of the data about the mean.
 Mean Deviation: It gives the average of the data's absolute deviation about the central points. These central points
could be the mean, median, or mode.
 Quartile Deviation: It can be defined as half of the difference between the third quartile and the first quartile in a
given data set.
RELATIVE MEASURES OF
DISPERSION
• If the data of separate data sets have different units and need to be compared then relative
measures of dispersion are used.
• The measures are expressed in the form of ratios and percentages thus, making them unitless.
• Some of the relative measures of dispersion are given below:
 Coefficient of Range: It is the ratio of the difference between the highest and lowest value in a data
set to the sum of the highest and lowest value.
 Coefficient of Variation: It is the ratio of the standard deviation to the mean of the data set. It is
expressed in the form of a percentage.
 Coefficient of Mean Deviation: This can be defined as the ratio of the mean deviation to the value of
the central point from which it is calculated.
 Coefficient of Quartile Deviation: It is the ratio of the difference between the third quartile and the
first quartile to the sum of the third and first quartiles.
Measures of Dispersion Formula
GRAPHICAL
REPRESENTATION
• A graph is a kind of a chart where data are plotted as variables across the coordinate.
• A graphical representation is a visual representation of data statistics-based results using graphs,
plots, and charts.
• This kind of representation is more effective in understanding and comparing data than seen in a
tabular form.
• Rules of Graphical Representation of Data-
 Suitable Title: The title of the graph should be appropriate that indicate the subject of the presentation.
 Measurement Unit: The measurement unit in the graph should be mentioned.
 Proper Scale: A proper scale needs to be chosen to represent the data accurately.
 Index: For better understanding, index the appropriate colors, shades, lines, designs in the graphs.
 Data Sources: Data should be included wherever it is necessary at the bottom of the graph.
 Simple: The construction of a graph should be easily understood.
 Neat: The graph should be visually neat in terms of size and font to read the data accurately.
TYPES OF GRAPHICAL
REPRESENTATION
Data Representation Description
Bar Graph
• A group of data represented with rectangular bars with lengths proportional to the values
is a bar graph.

• The bars can either be vertically or horizontally plotted.

Pie Chart
• The pie chart is a type of graph in which a circle is divided into Sectors where each sector
represents a proportion of the whole.

• Two main formulas used in pie charts are:

o To calculate the percentage of the given data, we use the formula: (Frequency ÷ Total
Frequency) × 100

o To convert the data into degrees we use the formula: (Given Data ÷ Total value of Data) ×
360°
Data Representation Description
Line graph

The line graph represents the data in a form of series that is connected with a straight
line. These series are called markers.

Pictograph

Data shown in the form of pictures is a pictograph. Pictorial symbols for words,
objects, or phrases can be represented with different numbers.

Histogram

The histogram is a type of graph where the diagram consists of rectangles, the area is
proportional to the frequency of a variable and the width is equal to the class interval.
Here is an example of a histogram.
Data Representation Description
Frequency Distribution

• The frequency distribution table in statistics showcases the data in ascending order
along with their corresponding frequencies.

• The frequency of the data is often represented by f.

Stem and Leaf Plot

The stem and leaf plot is a way to represent quantitative data according to frequency
ranges or frequency distribution. It is a graph that shows numerical data arranged in
order. Each data value is broken into a stem and a leaf.

Scatter Plot

Scatter plot is a way of graphical representation by using Cartesian coordinates of two


variables. The plot shows the relationship between two variables.
SKEWNESS & KURTOSIS
Skewness Kurtosis
• A measure of peakedness of the
• A measure of asymmetry in the distribution.
distribution. • Mathematically it is given by E[(x-
• Mathematically it is given by E[(x- µ/σ)]4 -3
µ/σ)]3 • For symmetric distribution, negative
• Negative skewness impiles mass of kurtosis implies wider peak and
the distribution is concentrated on thinner tails.
the right.
BOX
PLOT
• The method to summarize a set of data that is measured using an interval scale is called a box and
whisker plot.
• These are maximum used for data analysis.
• A box plot is a chart that shows data from a five-number summary including one of the measures
of central tendency.
RANDOM VARIABLE

• A random variable is a variable that can take on many values.


• A random variable can be defined as a type of variable whose value depends upon the
numerical outcomes of a certain random phenomenon.
• It is also known as a stochastic variable.
• Random variables are always real numbers as they are required to be measurable.
Discrete Random Variables :
Random variables that can assume a countable number of values. If a random variable can
only take a finite
number of distinct values, it must be discrete

Ex : number of defective light bulbs in a box, the number of children in a family

Continuous Random Variables:


Random variables that can assume any value corresponding to any of the points contained
in one or more intervals They are usually measurements. Things like heights, weights, and
time are continuous random variables

Ex: time it takes to complete a race or the length of time between arrivals at a hospital
clinic.
PROBABILITY
• Probability is a concept used in math and science to know the likelihood or
occurrence of an event.
• For example, when a coin is tossed, there is a probability to get a head or tail.
• Probability can be defined as the ratio of the number of favorable outcomes to the
total number of outcomes of an event.

• Probability is of three types-


1. Theoretical or Classical Probability
2. Experimental Probability
3. Axiomatic Probability
THEORETICAL OR CLASSICAL
PROBABILITY
• Theoretical probabilty measures the favorable outcome of an event.

• P(Event)=

• For example, when we toss a coin, we get a head or tail.

P(Head)=

P(Tail)=
EXPERIMENTAL
PROBABILITY
• Experimental probability measures the total number of favorable outcomes for the
number of times an experiment is repeated.

• P(Event)=

• For example, if a coin is tossed 8 times, and heads occurs for 3 times, then

P(Head)=

P(Tail)=
AXIOMATIC
PROBABILITY
• Axiomatic probability is one more way to describe the outcomes of an event.

• There are three rules or axioms which apply to all types of probability.

• These rules were defined by Kolmogorov and is called Kolmogorov's axioms.

• The three axioms are as follows:


 For any event, the probability is greater than or equal to 0.

 Sample space defines the set of all possible outcomes of an event.

 If A and B are two mutually exclusive outcomes (two events that cannot occur at the
same time) then the probability of A or B occurring is a probability of A plus the
probability of B.
DISCRETE PROBABILITY

DISTRIBUTION
Discrete probability distribution is a type of probability distribution that shows all possible
values of a discrete random variable along with the associated probabilities.

• In other words, a discrete probability distribution gives the likelihood of occurrence of each
possible value of a discrete random variable.

• Geometric distributions, binomial distributions, and Bernoulli distributions are some


commonly used discrete probability distributions.

• A discrete probability distribution and a continuous probability distribution are two types of
probability distributions that define discrete and continuous random variables respectively.

• A probability distribution can be defined as a function that describes all possible values of a
random variable as well as the associated probabilities.
CONTINUOUS PROBABILITY
DISTRIBUTION
• A continuous random variable can be defined as a variable that can take on infinitely many
values.

• As the probability that a continuous random variable will take on an exact value is 0 hence,
we cannot use the probability mass function (pmf) to describe such a distribution.

• We use the probability density function in place of the pmf.

• The formulas for the probability distribution of a continuous random variable are given
below:
 Probability Distribution Function: F(x) = P (X ≤ x)

 Probability Density Function: f(x) = d/dx (F(x))


NORMAL PROBABILITY DISTRIBUTION

• A normal distribution is a type of continuous probability distribution.

• The mean and the variance are the two parameters required to describe such a
distribution.

• If X is a random variable that follows a normal distribution then it is denoted as X

• The probability distribution formulas are given below:

Probability Distribution Function:


F(x)= P=
•Probability Density Function: f(x) =
EXPECTED
VALUE
• The expected value formula is used to find the expected value of a random variable X,
denoted by E(x).

• It is also known as the mean, the average, or the first moment. In other words, the
expected value is equal to the sum of the product of each possible outcome with its
probability and is expressed as the formula for the expected value.

• The expected value formula is used to find the expected value which is a generalization
of the weighted average.
SAMPLING FUNNEL
SAMPLING
VARIATION
• Sample variance is used to measure the spread of the data points in a given data set around
the mean.

• Sample variance can be defined as the expectation of the squared difference of data points
from the mean of the data set.

• It is an absolute measure of dispersion and is used to check the deviation of data points
with respect to the data's average.
CENTRAL LIMIT
THEORY
• The central limit theorem states that irrespective of a random variable's distribution if
large enough samples are drawn from the population then the sampling distribution of
the mean for that random variable will approximate a normal distribution.

• This fact holds true for samples that are greater than or equal to 30.

• In other words, as more large samples are taken, the graph of the sample means starts
looking like a normal distribution.
CONFIDENCE INTERVAL
• A confidence interval gives the probability within which the true value of the parameter
will lie.

• The confidence level (in percentage) is selected by the investigator.

• The higher the confidence level is the wider is the confidence interval (less precise)
HYPOTHESIS
TESTING
• Hypothesis testing can be defined as a statistical tool that is used to identify if the results
of an experiment are meaningful or not.

• It involves setting up a null hypothesis and an alternative hypothesis. These two


hypotheses will always be mutually exclusive.

• This means that if the null hypothesis is true then the alternative hypothesis is false and
vice versa.

• An example of hypothesis testing is setting up a test to check if a new medicine works on


a disease in a more efficient manner.
HYPOTHESIS TESTING Z
TEST
• A z test is a way of hypothesis testing that is used for a large sample size (n ≥ 30).

• It is used to determine whether there is a difference between the population mean and the sample
mean when the population standard deviation is known.

• It can also be used to compare the mean of two samples. It is used to compute the z test statistic.

• The formulas are given as follows:

 One sample: z =

 Two samples: z =
EXAMPLE OF HYPOTHESIS Z TEST
Example: The average score on a test is 80 with a standard deviation of 10. With a new teaching
curriculum introduced it is believed that this score will change. On random testing, the score of 38
students, the mean was found to be 88. With a 0.05 significance level, is there any evidence to
support this claim?

Solution: This is an example of two-tail hypothesis testing. The z test will be used.
: μ = 80, : μ ≠ 80
= 88, μ = 80, n = 36, σ = 10.
α = 0.05 / 2 = 0.025
The critical value using the normal distribution table is 1.96
z=
z = = 4.8
As 4.8 > 1.96, the null hypothesis is rejected.
Answer: There is a difference in the scores after the new curriculum was introduced.
HYPOTHESIS TESTING T
TEST
• The t test is another method of hypothesis testing that is used for a small sample size (n
< 30).

• It is also used to compare the sample mean and population mean. However, the
population standard deviation is not known.

• Instead, the sample standard deviation is known.

• The mean of two samples can also be compared using the t test.

• One sample: t =
• Two samples: t =
EXAMPLES OF HYPOTHESIS T
TEST
Example 1: The average weight of a dumbbell in a gym is 90lbs. However, a physical trainer believes that the
average weight might be higher. A random sample of 5 dumbbells with an average weight of 110lbs and a
standard deviation of 18lbs. Using hypothesis testing check if the physical trainer's claim can be supported for a
95% confidence level.

Solution: As the sample size is lesser than 30, the t-test is used.
: μ = 90, : μ > 90
= 110, μ = 90, n = 5, s = 18.
α = 0.05

Using the t-distribution table, the critical value is 2.132


t=
t = 2.484

As 2.484 > 2.132, the null hypothesis is rejected.

Answer: The average weight of the dumbbells may be greater than 90lbs
Example 2: The average score of a class is 90. However, a teacher believes that the average score might be
lower. The scores of 6 students were randomly measured. The mean was 82 with a standard deviation of 18.
With a 0.05 significance level use hypothesis testing to check if this claim is true.

Solution: The t test will be used.


: μ = 90, : μ < 90
= 110, μ = 90, n = 6, s = 18
The critical value from the t table is -2.015
t=

t=

t = -1.088

As -1.088 > -2.015, we fail to reject the null hypothesis.

Answer: There is not enough evidence to support the claim.


ANOVA
• TEST
ANOVA test, in its simplest form, is used to check whether the means of three or more
populations are equal or not. The ANOVA test applies when there are more than two
independent groups.

• The goal of the ANOVA test is to check for variability within the groups as well as the
variability among the groups. The ANOVA test statistic is given by the f test.

• ANOVA test can be defined as a type of test used in hypothesis testing to compare whether
the means of two or more groups are equal or not.

• This test is used to check if the null hypothesis can be rejected or not
EXAMPLE OF ANOVA F TEST
Example 1: The average of grade point averages (GPAs) of college courses in a specific major is a measure of
difficulty of the major. An educator wishes to conduct a study to find out whether the difficulty levels of different
majors are the same. For such a study, a random sample of major grade point averages (GPA) of 11 graduating
seniors at a large university is selected for each of the four majors mathematics, English, education, and biology.
The data are given in table. Test, at the 5% level of significance, whether the data contain sufficient evidence to
conclude that there are differences among the average major GPAs of these four majors.

Solution:
• Step 1: The test of hypothesis is
===
vs : not all four population means are equal @ = 0.05
• Step 2: The test statistic is F= MST/MSE with (since n =44 and K=4) degrees of freedom
d = K-1= 4-1= 3 and d= n-K = 44-4 = 40

• Step 3: If we index the population of mathematics majors by 1, English majors by 2, education majors by 3, and
biology majors by 4, then the sample sizes, sample means, and sample variances of the four samples in table are
summarized (after rounding for simplicity) by:
The average of all 44 observations is x=3.15. we comute

MST=
= = 0.585

and MSE=

= = 0.181
so that F= = = 3.232
• Step 4: The test is right tailed. The single critical value is = =2.84. Thus the rejected region is [2.84, )

• Step 5: Since F= 3.232 > 2.84, we reject . The data provide sufficient evidence, at the 5% level of significance, to
conclude that the averages of major GPAs for the four majors considered are not all equal.
CHI SQUARE
• The chi-squared test checks the difference between the observed value and the expected
value.

• Chi-Square shows or in a way check the relationship between two categorical variables
which can be can be calculated by using the given observed frequency and expected
frequency.

• The Chi-Square test gives a P-value to help you know the correlation if any!
EXAMPLES OF CHI SQUARE
Example 1: Calculate the Chi-square value for the following data of incidences of water-borne diseases in
three tropical regions.
Solution:
Setting up the following table:
Example 2: As per the survey on cars owned by each family in the locality the data has been arranged in
the following table.
Solution:
Setting up the following table:

Therefore, χ2 = ∑(Oi – Ei)2/Ei = 0.837

Answer: Chi Square = 0.837

You might also like