Lec 1
Lec 1
Associate Professor
Department of Statistics, DU
Introduction to Statistics
The science of collecting, organizing, classifying, summarizing, analyzing, and interpreting numerical
information.
Example: Measuring average rainfall in Dhaka over 10 years to predict future weather patterns.
Why Study Statistics
• Data are everywhere.
• Statistical techniques are used to make many decisions that affect our lives or our personal
welfare
• No matter what line of work you select, you will find yourself faced with decisions where an
understanding of data analysis is helpful.
✓ Sociolinguistics: A phonological variable might span the different realizations of a vowel. In some
words, like pen, one would say pen rhymes as pin (A), while other speakers say pen (B).
Perhaps, also, the likelihood that one will say [A] is influenced by age, socioeconomic status,
gender, current peer group, etc.
✓ Medical researchers study the cure rates for diseases using different drugs and different forms of
treatment. For example, what is the effect of treating a certain type of knee injury surgically or with
physical therapy?
Terminologies
Population: In statistics the term “population” is used to refer collection of all items/entities, of
whatever kind that we are interested in studying. The number of individuals in the population is
called population size. It is usually denoted by 𝑁. For example, populations may include (1) all
students of DU, (2) all registered voters in BD, (3) all students in a school of special children
(4) all sentences of a novel (say, Iris Murdoch’s The Bell, Penguin edition, 1962) etc.
✓ In studying a population, we focus on one or more characteristics or properties of the units in the
population. We call such characteristics variables.
Example: To estimate average IQ of DU students, you test 400 students instead of all 30,000.
Terminologies
• Sample: The term “sample” refers to a portion of the population that is selected from the
population with the view to representing the population. The number of individuals in the
sample is called sample size. Sample size is usually denoted by 𝑛.
✓ Populations are usually too large or inaccessible to measure completely, so looking at a sample
of the population of interest is the only feasible way to estimate it.
✓ For instances, (1) it would not be practical to measure the average life span of the entire mosquito
population in Dhaka. We would need to measure the life span of a random sample of mosquitoes
and use these data to estimate the average life span of the entire population. (2) suppose we are
interested in the IQ of the DU students. Instead of examining all 30000 students, one may select a
sample of just 400 students and he would record (measure) the IQ of each sampled students.
Variable and its type
• A characteristic whose value varies from person to person, object to object or from
phenomenon to phenomenon. For example: age, gender etc.
• Qualitative Variable: A variable whose numerical measurement is not possible, such as,
gender (male=1, female=2), the rating of words by informants, on a scale of pleasantness,
ranging from 1 to 5 (1 = very unpleasant, 2 = unpleasant, 3 = neither unpleasant nor
pleasant, 4 = pleasant, 5 = very pleasant); Working memory profiles of children with
attention deficit hyperactivity disorder→ADHD (Verbal STM→short-term memory, Visuo-
spatial STM, Verbal WM→working memory, Visuo-spatial WM) etc.
Types of Variables
• Quantitative variable: A variable whose resulting observations are numeric and thus
possesses a natural ordering. Example: family size , yearly rainfall (mm) in Dhaka etc.
✓ a) Discrete variable: A discrete variable can assume only isolated values. Example: Household size,
Number of accidents on a certain highway, the number of sentences remembered correctly by
informants ten minutes after first hearing them,
✓ b) Continuous variable: Theoretically, a continuous variable can assume any value within its
domain. Example: Weight of children, the length of pauses (milliseconds) in a conversation.
Scale of Measurement/ Level of measurement
• Measurement is a process of assigning number to some characteristics or variables
according to scientific rules.
• Variables can be measured under four levels or scales of measurement. The measurement
scales are:
1. Nominal scale
2. Ordinal scale
3. Interval scale
4. Ratio scale
Level of Measurement (cont…)
Nominal Scale
• For the nominal level of measurement, observations of a qualitative variable can only be
classified and counted as numbers are assigned to the variable values for identification only.
• For example: consider the variable Working memory profiles of children with ADHD which
can be categorized as Verbal STM (1), Visuo-spatial STM (2), Verbal WM (3) and Visuo-spatial
WM (4).
• However, the numbers do not permissible in doing mathematics. To explain, 1 + 2 does not
equal 3; that is, Verbal STM + Visuo-spatial STM does not yield Verbal WM .
Level of Measurement (cont…)
Ordinal Scale
• The next higher level of data is the ordinal level. In this level, observations of a qualitative variable
are represented by sets of labels that have relative importance. Hence, observations can be ranked
or ordered so that numbers are assigned to the variable values for identification as well as for
ranking.
• Ordinal scales are often used for measures of satisfaction, happiness, and so on. For example:
Have you ever taken one of those surveys, like this? "How likely are you to recommend our services
to your friends?“ Possible responses are (1) very likely (2) likely (3) neutral (4) unlikely (5) very
unlikely.
• Here, we don't really know what the difference is between very unlikely and unlikely - or if it's the
same amount of likeliness (or, unlikeliness) as between likely and very likely. We just know that likely
is more than neutral and unlikely is more than very unlikely. It's all in the order. To explain, 2 - 1 does
not equal 5-4.
Level of Measurement (cont…)
Interval Scale
• The interval level of measurement is the next highest level. It includes all the characteristics of
the ordinal level, but in addition, the difference between values is a constant size.
• Under this level, numbers are assigned to the variable values in such a way that the
measurement scale is broken down on a scale of equal units and the zero value on the scale is
not absolutely zero. For example: temperature (F), IQ score, dress size etc.
• Suppose the temperatures on three consecutive days in Dhaka are 82, 87, and 68. These temperatures can be easily
ranked, but we can also determine the difference between temperatures.
• This is possible because 1F represents a constant unit of measurement. Equal differences between two temperatures
are the same, regardless of their position on the scale. That is, the difference between 10F and 15F is 5, the difference
between 50 and 55 degrees is also 5 degrees.
• It is also important to note that 0 is just a point on the scale. It does not represent the absence of the condition. Zero
degrees does not represent the absence of heat, just that it is cold! In fact 0 degrees Fahrenheit is about 18 degrees on
the Celsius scale.
Level of Measurement (cont…)
Ratio Scale
• Practically all quantitative data is recorded on the ratio level of measurement. The ratio level is
the “highest” level of measurement. It has all the characteristics of the interval level, but in
addition, the 0 point is meaningful and the ratio between two numbers is meaningful.
• Under this level, numbers are assigned to the variable values in such a way that the
measurement scale is broken down on a scale of equal units and the zero value on the scale is
absolutely zero. For example: weight, pulse rate, etc.
Scales of Measurement
ScaleProperties Example
NominalCategories only (no order) Hair color (1=Black, 2=Brown)
OrdinalOrdered categories Likert scale (1=Strongly Agree)
IntervalEqual intervals, no true zero Temperature (°C)
RatioTrue zero, all math operations Weight (kg)
Comparative Study of scales of measurement
Variables
Qualitative Quantitative
hair color, gender, Economic family size , yearly rainfall (mm),
Status IQ score
Ordinal = ordered but unequal intervals Exercise Ratio: True zero exists and 200ms is twice as long as 100ms
(ii) the rating of words by informants, on a scale of pleasantness, ranging from 1 to 5 (1 = very unpleasant, 2
= unpleasant, 3 = neither unpleasant nor pleasant, 4 = pleasant, 5 = very pleasant);
(iii) the presence or absence of a finite verb in each clause in a particular text;
(iv) the degree of grammaticality of sentences, on a scale from 0 (absolutely ungrammatical) to 4 (entirely
grammatical);
(v) the number of sentences remembered correctly by informants ten minutes after first hearing them,
(vi) Working memory profiles of children with attention deficit hyperactivity disorder→ADHD (Verbal
STM→short-term memory, Visuo-spatial STM, Verbal WM→working memory, Visuo-spatial WM)
• Classify each of the variables as qualitative/discrete/continuous. Also identify the level of measurement
(nominal/ordinal/interval/ratio). Give reasons to support your answers.
Data versus Information
• Data are the foundation of the field of statistics and can be defined as the values assigned
to specific observations or measurements.
• The main examples of data are weights, prices, costs, numbers of items sold, employee
names, product names, addresses, tax codes, registration marks etc.
• The word “data” is plural for “datum.” Data is the raw material that can be processed by any
computing machine. When data are processed, organized, structured or presented in a
given context so as to make them useful, they are called Information.
• Information is data that are transformed into useful facts that can be used for a specific
purpose, such as making a decision.
Descriptive and Inferential Statistics
• Two branches of Statistics: (1) Descriptive statistics and (2) Inferential statistics.
• Descriptive statistics: The methods of statistics that help to describe, show or summarize
data in a meaningful way (patterns might emerge from the data). However, they do not allow
us to make conclusions beyond the data we have analyzed (cannot conclude regarding the
population).
✓ The main focus of descriptive statistics is to summarize and display data.
✓ The elements of descriptive statistics include tables, graphs and numerical summary tools.
• Inferential statistics: This method uses theory of probability to make inferences about the
population, based on a random sample.
✓ It covers a large variety of techniques that allow us to make actual claims about a population
Descriptive: Summarizes data (e.g., average rainfall in April).Ex;Histogram of student heights.
based on a sampled data.
Inferential: Draws conclusions (e.g., predicting election results from a sample).Ex; Using sample data to
estimate DU’s average IQ.
Parameter: Population characteristic (e.g., mean IQ of all DU
students, ).
• Estimator: A statistic which is used to estimate the value of the unknown parameter is known as an
estimator of that parameter. For example: 𝑋ത is an estimator of 𝜇.
✓ An estimator is a random variable which takes different values from sample to sample.
• Estimate: An estimate is a numerical value of the estimator obtained from a particular sample.
✓ Suppose there are 20 students in a population and we want to investigate the mean pause
length (in milliseconds) of all students by taking a sample of size 10. Let 𝑋𝑖 be the pause length
σ20
𝑖=1 𝑋𝑖
(in milliseconds) of ith students. Then (i) the parameter, 𝜇 = 20
=mean pause length of all 20
σ 𝑋𝑖 10
students (which is unknown to us. why?) (ii) The functions: sample mean, 𝑋ത = 𝑖=1
10
; sample
10
1 10 σ 𝑋𝑖
2
variance 𝑆 = σ σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2
etc are statistic (iii) The statistic 𝑋ത = 𝑖=1 is an estimator for
9 𝑖=1 10
𝜇. Let us consider we have selected a sample of 10 students and let their pause length (in
milliseconds) be 564.6, 562.6, 565.4, 564.4, 561.5,561.9,565.6,563.3,559.3, and 562.7. (iv)Then
𝑋ത = 563.1 is an estimate of 𝜇.
Exercise
• (1) 500 students from the University of Dhaka were selected to estimate the proportion of
students with (any) drug addiction. The sample proportion was 0.12. In this context, identify
the (i) population (ii) sample (iii) parameter (iv) statistic (v) estimator and (vi) estimate.
• (2) 100 sentences of a novel (say, Iris Murdoch’s The Bell, Penguin edition, 1962) were
selected to estimate the mean length of sentences of that novel. The sample mean was 8.7.
In this context, identify the (i) population (ii) sample (iii) parameter (iv) statistic (v) estimator
and (vi) estimate.
Variable Classification
Estimate: 8.7.