( ) – optional kung kung gusto mo ibali or dai or pwede man script
Highlights – example para maintindihan kang reporter pero ok lang na dai ilaag sa
ppt
STATISTICAL TREATMENT
The term “statistical treatment” is a catch all term which means to apply
any statistical method to your data.
The term “statistical treatment” is a catch all term which means to apply
any statistical method to your data.
This process is important for businesses because it allows them to take
customer feedback and turn it into actionable insights.
( Treatments are divided into two groups: descriptive statistics, which summarize your data as a
graph or summary statistic and inferential statistics, which make predictions and test
hypotheses about your data. )
Descriptive Statistics
Descriptive statistics are used to describe the overall characteristics of a dataset.
The term ‘descriptive statistics’ can be used to describe both individual
quantitative observations (also known as ‘summary statistics’) as well as the
overall process of obtaining insights from these data.
Describe the features of populations and/or samples.
Organize and present data in a purely factual way.
Present final results visually, using tables, charts, or graphs.
Draw conclusions based on known data.
Use measures like central tendency, distribution, and variance.
Types of Descriptive Statistics
Distribution
shows us the frequency of different outcomes (or data points) in a
population or sample. (We can show it as numbers in a list or table, or we
can represent it graphically)
Examples
Central Tendency
measurements that look at the typical central values within a dataset.
a general term used to describe a variety of central measurements.
might include central measurements from different quartiles of a larger
dataset.
Common measures of central tendency include:
The mean: The average value of all the data points.
The median: The central or middle value in the dataset.
The mode: The value that appears most often in the dataset.
Variability
Known as dispersion.
describes how values are distributed or spread out.
Identifying variability relies on understanding the central tendency
measurements of a dataset. (However, like central tendency, variability is
not just one measure. It is a term used to describe a range of
measurements.)
Common measures of variability include:
Standard deviation: This shows us the amount of variation or dispersion. Low
standard deviation implies that most values are close to the mean. High standard
deviation suggests that the values are more broadly spread out.
Minimum and maximum values: These are the highest and lowest values in a
dataset or quartile. Using the example of our hair color dataset again, the
minimum and maximum values are 13 and 130 respectively.
Range: This measures the size of the distribution of values. This can be easily
determined by subtracting the smallest value from the largest. So, in our hair
color dataset, the range is 117 (130 minus 13).
Kurtosis: This measures whether or not the tails of a given distribution contain
extreme values (also known as outliers). If a tail lacks outliers, we can say that it
has low kurtosis. If a dataset has a lot of outliers, we can say it has high kurtosis.
Skewness: This is a measure of a dataset’s symmetry. If you were to plot a bell-
curve and the right-hand tail was longer and fatter, we would call this positive
skewness. If the left-hand tail is longer and fatter, we call this negative skewness.
This is visible in the following image.
Inferential Statistics
inferential statistics focus on making generalizations about a larger population
based on a representative sample of that population.
inferential statistics focuses on making predictions (rather than stating facts) its
results are usually in the form of a probability.
Use samples to make generalizations about larger populations.
Help us to make estimates and predict future outcomes.
Present final results in the form of probabilities.
Draw conclusions that go beyond the available data.
Use techniques like hypothesis testing, confidence intervals, and regression and
correlation analysis.
(Random sampling is very important for carrying out inferential techniques, but it is not
always straightforward)
Random sample
1. Defining a population
This simply means determining the pool from which you will draw your sample. As we
explained earlier, a population can be anything—it isn’t limited to people. So it could be
a population of objects, cities, cats, pugs, or anything else from which we can derive
measurements!
2. Deciding your sample size
The bigger your sample size, the more representative it will be of the overall population.
Drawing large samples can be time-consuming, difficult, and expensive. Indeed, this is
why we draw samples in the first place—it is rarely feasible to draw data from an entire
population. Your sample size should therefore be large enough to give you confidence
in your results but not so small that the data risk being unrepresentative (which is just
shorthand for inaccurate). This is where using descriptive statistics can help, as they
allow us to strike a balance between size and accuracy.
3. Randomly select a sample
Once you’ve determined the sample size, you can draw a random selection. You might
do this using a random number generator, assigning each value a number and selecting
the numbers at random. Or you could do it using a range of similar techniques or
algorithms (we won’t go into detail here, as this is a topic in its own right, but you get the
idea).
4. Analyze the data sample
Once you have a random sample, you can use it to infer information about the larger
population. It’s important to note that while a random sample is representative of a
population, it will never be 100% accurate. For instance, the mean (or average) of a
sample will rarely match the mean of the full population, but it will give you a good idea
of it. For this reason, it’s important to incorporate your error margin in any analysis
(which we cover in a moment). This is why, as explained earlier, any result from
inferential techniques is in the form of a probability.
However, presuming we’ve obtained a random sample, there are many inferential
techniques for analyzing and obtaining insights from those data. The list is long, but
some techniques worthy of note include:
Hypothesis testing
Confidence intervals
Regression and correlation analysis
Hypothesis testing
involves checking that your samples repeat the results of your hypothesis (or
proposed explanation).
aim is to rule out the possibility that a given result has occurred by chance.
A topical example of this is the clinical trials for the covid-19 vaccine. Since it’s
impossible to carry out trials on an entire population, we carry out numerous trials on
several random, representative samples instead.
The hypothesis test, in this case, might ask something like: ‘Does the vaccine reduce
severe illness caused by covid-19?’ By collecting data from different sample groups, we
can infer if the vaccine will be effective. If all samples show similar results and we know
that they are representative and random, we can generalize that the vaccine will have
the same effect on the population at large. On the flip side, if one sample shows higher
or lower efficacy than the others, we must investigate why this might be. For instance,
maybe there was a mistake in the sampling process, or perhaps the vaccine was
delivered differently to that group. In fact, it was due to a dosing error that one of the
Covid vaccines actually proved to be more effective than other groups in the trial…
Which shows how important hypothesis testing can be. If the outlier group had simply
been written off, the vaccine would have been less effective!
Confidence interval
used to estimate certain parameters for a measurement of a population (such as
the mean) based on sample data.
Rather than providing a single mean value, the confidence interval provides a
range of values.
This is often given as a percentage.
For example, let’s say you’ve measured the tails of 40 randomly selected cats.
You get a mean length of 17.5cm. You also know the standard deviation of tail
lengths is 2cm. Using a special formula, we can say the mean length of tails in
the full population of cats is 17.5cm, with a 95% confidence interval.
Essentially, this tells us that we are 95% certain that the population mean
(which we cannot know without measuring the full population) falls within the
given range. This technique is very helpful for measuring the degree of
accuracy within a sampling method.
Regression Analysis
aims to determine how one dependent (or output) variable is impacted by one or
more independent (or input) variables.
It’s often used for hypothesis testing and predictive analytics.
For example, to predict future sales of sunscreen (an output variable) you might
compare last year’s sales against weather data (which are both input variables) to see
how much sales increased on sunny days.
Correlation Analysis
measures the degree of association between two or more datasets.
correlation does not infer cause and effect.
For instance, ice cream sales and sunburn are both likely to be higher on sunny days—
we can say that they are correlated. But it would be incorrect to say that ice cream
causes sunburn!