Statistics
Statistics
Statistics is a mathematical science that includes methods for collecting, organizing, analyzing
and visualizing data in such a way that meaningful conclusions can be drawn.
Statistics is also a field of study that summarizes the data, interpret the data making decisions
based on the data.
Statistics is composed of two broad categories:
1. Descriptive Statistics
2. Inferential Statistics
1. Descriptive Statistics
Descriptive statistics describes the characteristics or properties of the data. It helps to summarize
the data in a meaningful data in a meaningful way. It allows important patterns to emerge from
the data. Data summarization techniques are used to identify the properties of data. It is helpful in
understanding the distribution of data. They do not involve in generalizing beyond the data.
1.1 Two types of descriptive statistics
1. Measures of Central Tendency: (Mean , Median , Mode)
2. Measures of data spread or dispersion (range, quartiles, variance and standard deviation)
Measures of spread are the ways of summarizing a group of data by describing how scores are
spread out. To describe this spread, a number of statistics are available to us, including the range,
quartiles, absolute deviation, variance and standard deviation.
• The degree to which numerical data tend to spread is called the dispersion, or variance of the data.
The common measures of data dispersion: Range, Quartiles, Outliers, and Boxplots.
Inferential statistics is generally used when the user needs to make a conclusion about the whole
population at hand, and this is done using the various types of tests available. It is a technique which
is used to understand trends and draw the required conclusions about a large population by taking
and analyzing a sample from it. Descriptive statistics, on the other hand, is only about the smaller
sized data set at hand – it usually does not involve large populations. Using variables and the
relationships between them from the sample, we will be able to make generalizations and predict
other relationships within the whole population, regardless of how large it is.
With inferential statistics, data is taken from samples and generalizations are made about a
population. Inferential statistics use statistical models to compare sample data to other samples or
to previous research.
1. Estimating parameters:
This means taking a statistic from the sample data (for example the sample mean) and using it to
infer about a population parameter (i.e. the population mean).There may be sampling variations
because of chance fluctuations, variations in sampling techniques, and other sampling errors.
Estimation about population characteristics may be influenced by such factors. Therefore, in
estimation the important point is that to what extent our estimate is close to the true value.
Characteristics of Good Estimator: A good statistical estimator should have the following
characteristics, (i) Unbiased (ii) Consistent (iii) Accuracy
i) Unbiased
An unbiased estimator is one in which, if we were to obtain an infinite number of random samples of
a certain size, the mean of the statistic would be equal to the parameter. The sample mean, ( x ) is
an unbiased estimate of population mean (μ)because if we look at possible random samples of size
N from a population, then mean of the sample would be equal to μ.
ii) Consistent
A consistent estimator is one that as the sample size increased, the probability that estimate has a
value close to the parameter also increased. Because it is a consistent estimator, a sample mean
based on 20 scores has a greater probability of being closer to (μ) than does a sample mean based
upon only 5 scores
iii) Accuracy
The sample mean is an unbiased and consistent estimator of population mean (μ).But we should not
over look the fact that an estimate is just a rough or approximate calculation. It is unlikely in any
estimate that ( x ) will be exactly equal to population mean (μ). Whether or not x is a good estimate
of (μ) depends upon the representativeness of sample, the sample size, and the variability of scores
in the population.
2. Hypothesis tests. This is where sample data can be used to answer research questions. For
example, we might be interested in knowing if a new cancer drug is effective. Or if breakfast helps
children perform better in schools.
Inferential statistics is closely tied to the logic of hypothesis testing. We hypothesize that this value
characterise the population of observations. The question is whether that hypothesis is reasonable
evidence from the sample. Sometimes hypothesis testing is referred to as statistical decision-making
process. In day-to-day situations.
Concerned with describing the target population Make inferences from the sample and generalize
them to the population
Organise, analyse, present the data in a Compare, tests and predicts future outcomes
meaningful way
The analysed results are in the form of graphs, The analysed results are the probability scores
charts etc
Describes the data which is already known Tries to make conclusions about the population
beyond the data available
Tools: Measures of central tendency and Tools: Hypothesis tests, analysis of variance etc
measures of spread
Random Variables
A random variable, X, is a variable whose possible values are numerical outcomes of a random
phenomenon. There are two types of random variables, discrete and continuous.
- Length of a tweet.
A discrete random variable is one which may take on only a countable number of distinct values such
as 0,1,2,3,4,........ Discrete random variables are usually counts. If a random variable can take only a
finite number of distinct values, then it must be discrete. Examples of discrete random variables
include the number of children in a family, the Friday night attendance at a cinema, the number of
patients in a doctor's surgery, the number of defective light bulbs in a box of ten.
The probability distribution of a discrete random variable is a list of probabilities associated with
each of its possible values. It is also sometimes called the probability function or the probability mass
function
Suppose a random variable X may take k different values, with the probability that X = xi defined to
be P(X = xi) = pi. The probabilities pi must satisfy the following:
2: p1 + p2 + ... + pk = 1.
Example
Suppose a variable X can take the values 1, 2, 3, or 4. The probabilities associated with each outcome
are described by the following table:
Outcome
Probability
0.1
0.3
0.4
0.2
The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X = 2)
+ P(X = 3) = 0.3 +
0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X = 1) = 1 - 0.1 = 0.9, by
the complement rule.
A continuous random variable is one which takes an infinite number of possible values. Continuous
random variables are usually measurements. Examples include height, weight, the amount of sugar
in an orange, the time required to run a mile.
A continuous random variable is not defined at specific values. Instead, it is defined over an interval
of values, and is represented by the area under a curve (known as an integral). The probability of
observing any single value is equal to 0, since the number of values which may be assumed by the
random variable is infinite.
Suppose a random variable X may take all values over an interval of real numbers. Then the
probability that X is in the set of outcomes A, P(A), is defined to be the area above A and under a
curve. The curve, which represents a function p(x), must satisfy the following:
All random variables (discrete and continuous) have a cumulative distribution function. It is a
function giving the probability that the random variable X is less than or equal to x, for every value x.
For a discrete random variable, the cumulative distribution function is found by