0% found this document useful (0 votes)
11 views22 pages

Lec 1

The document is a course outline for BSLP 2407 Statistics, taught by M. Ershadul Haque, covering the definition, importance, and applications of statistics in various fields such as linguistics and medicine. It explains key terminologies including population, sample, variables, and levels of measurement, as well as the distinction between descriptive and inferential statistics. The document emphasizes the role of statistics in decision-making and data analysis across different disciplines.

Uploaded by

Hassan Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Lec 1

The document is a course outline for BSLP 2407 Statistics, taught by M. Ershadul Haque, covering the definition, importance, and applications of statistics in various fields such as linguistics and medicine. It explains key terminologies including population, sample, variables, and levels of measurement, as well as the distinction between descriptive and inferential statistics. The document emphasizes the role of statistics in decision-making and data analysis across different disciplines.

Uploaded by

Hassan Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

COURSE TITLE: Statistics

COURSE CODE: BSLP 2407

Instructor: M. Ershadul Haque

Associate Professor

Department of Statistics, DU
Introduction to Statistics

• It is difficult to define statistics in a few words


✓ its dimension, scope, function, use and importance are constantly changing over time.

• The science of statistics is essentially a branch of applied mathematics and may be


regarded as mathematics, applied to observational data (according to Fisher)
✓ The science of learning from data.

✓ It involves collecting, organizing, classifying, summarizing, analyzing, and interpreting numerical


information.
What is Statistics?

The science of collecting, organizing, classifying, summarizing, analyzing, and interpreting numerical
information.

Example: Measuring average rainfall in Dhaka over 10 years to predict future weather patterns.
Why Study Statistics
• Data are everywhere.

• Statistical techniques are used to make many decisions that affect our lives or our personal
welfare

• No matter what line of work you select, you will find yourself faced with decisions where an
understanding of data analysis is helpful.

Statistics is studied because:

Data are everywhere.

It helps make decisions (e.g., medical treatments).

It analyzes trends (e.g., vowel frequencies in linguistics).

Researchers use it to compare solutions (e.g., surgical vs. therapy outcomes).


Why Study Statistics (cont…)
• Linguistic research involve measuring and observing data. Therefore, statistical analysis
can be an important tool in linguistic investigation.
✓ Phonetics: Phoneticians have a long tradition of quantifying their observations of pronunciation
and hearing. For example, we may interested in comparing the F1 frequency (Hz) of male and
female for five different vowels.

✓ Sociolinguistics: A phonological variable might span the different realizations of a vowel. In some
words, like pen, one would say pen rhymes as pin (A), while other speakers say pen (B).
Perhaps, also, the likelihood that one will say [A] is influenced by age, socioeconomic status,
gender, current peer group, etc.

✓ Medical researchers study the cure rates for diseases using different drugs and different forms of
treatment. For example, what is the effect of treating a certain type of knee injury surgically or with
physical therapy?
Terminologies

Population: In statistics the term “population” is used to refer collection of all items/entities, of
whatever kind that we are interested in studying. The number of individuals in the population is
called population size. It is usually denoted by 𝑁. For example, populations may include (1) all
students of DU, (2) all registered voters in BD, (3) all students in a school of special children
(4) all sentences of a novel (say, Iris Murdoch’s The Bell, Penguin edition, 1962) etc.
✓ In studying a population, we focus on one or more characteristics or properties of the units in the
population. We call such characteristics variables.

Population vs. Sample

Population: All items of interest (e.g., all DU students).

Sample: A subset of the population (e.g., 400 DU students surveyed).

Example: To estimate average IQ of DU students, you test 400 students instead of all 30,000.
Terminologies

• Sample: The term “sample” refers to a portion of the population that is selected from the
population with the view to representing the population. The number of individuals in the
sample is called sample size. Sample size is usually denoted by 𝑛.
✓ Populations are usually too large or inaccessible to measure completely, so looking at a sample
of the population of interest is the only feasible way to estimate it.

✓ For instances, (1) it would not be practical to measure the average life span of the entire mosquito
population in Dhaka. We would need to measure the life span of a random sample of mosquitoes
and use these data to estimate the average life span of the entire population. (2) suppose we are
interested in the IQ of the DU students. Instead of examining all 30000 students, one may select a
sample of just 400 students and he would record (measure) the IQ of each sampled students.
Variable and its type

• A characteristic whose value varies from person to person, object to object or from
phenomenon to phenomenon. For example: age, gender etc.

• Qualitative Variable: A variable whose numerical measurement is not possible, such as,
gender (male=1, female=2), the rating of words by informants, on a scale of pleasantness,
ranging from 1 to 5 (1 = very unpleasant, 2 = unpleasant, 3 = neither unpleasant nor
pleasant, 4 = pleasant, 5 = very pleasant); Working memory profiles of children with
attention deficit hyperactivity disorder→ADHD (Verbal STM→short-term memory, Visuo-
spatial STM, Verbal WM→working memory, Visuo-spatial WM) etc.
Types of Variables

Type Definition Example


Qualitative Non-numeric (categories) Gender (Male/Female)
Quantitative Numeric Age, Weight
Discrete Whole numbers only Number of accidents
Continuous Any value in a range Height (e.g., 165.5 cm)
Variable and its type

• Quantitative variable: A variable whose resulting observations are numeric and thus
possesses a natural ordering. Example: family size , yearly rainfall (mm) in Dhaka etc.

• A quantitative variable can be either (a) Discrete or (b) Continuous.

✓ a) Discrete variable: A discrete variable can assume only isolated values. Example: Household size,
Number of accidents on a certain highway, the number of sentences remembered correctly by
informants ten minutes after first hearing them,

✓ b) Continuous variable: Theoretically, a continuous variable can assume any value within its
domain. Example: Weight of children, the length of pauses (milliseconds) in a conversation.
Scale of Measurement/ Level of measurement
• Measurement is a process of assigning number to some characteristics or variables
according to scientific rules.

• Data can be classified according to levels of measurement.


• The level of measurement of the data dictates the calculations that can be done to
summarize and present the data.
• It will also determine the statistical tests that should be performed.

• Variables can be measured under four levels or scales of measurement. The measurement
scales are:

1. Nominal scale
2. Ordinal scale
3. Interval scale
4. Ratio scale
Level of Measurement (cont…)

Nominal Scale

• For the nominal level of measurement, observations of a qualitative variable can only be
classified and counted as numbers are assigned to the variable values for identification only.

• There is no particular order to the labels.

• For example: consider the variable Working memory profiles of children with ADHD which
can be categorized as Verbal STM (1), Visuo-spatial STM (2), Verbal WM (3) and Visuo-spatial
WM (4).

• However, the numbers do not permissible in doing mathematics. To explain, 1 + 2 does not
equal 3; that is, Verbal STM + Visuo-spatial STM does not yield Verbal WM .
Level of Measurement (cont…)
Ordinal Scale

• The next higher level of data is the ordinal level. In this level, observations of a qualitative variable
are represented by sets of labels that have relative importance. Hence, observations can be ranked
or ordered so that numbers are assigned to the variable values for identification as well as for
ranking.

• Ordinal scales are often used for measures of satisfaction, happiness, and so on. For example:
Have you ever taken one of those surveys, like this? "How likely are you to recommend our services
to your friends?“ Possible responses are (1) very likely (2) likely (3) neutral (4) unlikely (5) very
unlikely.

• Here, we don't really know what the difference is between very unlikely and unlikely - or if it's the
same amount of likeliness (or, unlikeliness) as between likely and very likely. We just know that likely
is more than neutral and unlikely is more than very unlikely. It's all in the order. To explain, 2 - 1 does
not equal 5-4.
Level of Measurement (cont…)

Interval Scale
• The interval level of measurement is the next highest level. It includes all the characteristics of
the ordinal level, but in addition, the difference between values is a constant size.
• Under this level, numbers are assigned to the variable values in such a way that the
measurement scale is broken down on a scale of equal units and the zero value on the scale is
not absolutely zero. For example: temperature (F), IQ score, dress size etc.
• Suppose the temperatures on three consecutive days in Dhaka are 82, 87, and 68. These temperatures can be easily
ranked, but we can also determine the difference between temperatures.
• This is possible because 1F represents a constant unit of measurement. Equal differences between two temperatures
are the same, regardless of their position on the scale. That is, the difference between 10F and 15F is 5, the difference
between 50 and 55 degrees is also 5 degrees.
• It is also important to note that 0 is just a point on the scale. It does not represent the absence of the condition. Zero
degrees does not represent the absence of heat, just that it is cold! In fact 0 degrees Fahrenheit is about 18 degrees on
the Celsius scale.
Level of Measurement (cont…)

Ratio Scale
• Practically all quantitative data is recorded on the ratio level of measurement. The ratio level is
the “highest” level of measurement. It has all the characteristics of the interval level, but in
addition, the 0 point is meaningful and the ratio between two numbers is meaningful.
• Under this level, numbers are assigned to the variable values in such a way that the
measurement scale is broken down on a scale of equal units and the zero value on the scale is
absolutely zero. For example: weight, pulse rate, etc.

Scales of Measurement
ScaleProperties Example
NominalCategories only (no order) Hair color (1=Black, 2=Brown)
OrdinalOrdered categories Likert scale (1=Strongly Agree)
IntervalEqual intervals, no true zero Temperature (°C)
RatioTrue zero, all math operations Weight (kg)
Comparative Study of scales of measurement

Scale of Mathematical Operations Examples


Measurement

Nominal Counting Gender, Religion


Ordinal Counting & Ranking Economic Status
Counting, Ranking, Addition &
Interval Subtraction Temperature, IQ score

Counting, Ranking, Addition,


Ratio Subtraction, Multiplication & Division Age, Family size
Classification of variable by level of measurement

Variables

Qualitative Quantitative
hair color, gender, Economic family size , yearly rainfall (mm),
Status IQ score

Nominal Ordinal Interval Ratio


hair color, gender Economic Status IQ score family size
Continuous = any value in range (decimals possible) Length of pauses (ms)
Discrete = countable whole numbers
Continuous: Can take any value in a range (e.g., 156.3ms)
Nominal = categories only

Ordinal = ordered but unequal intervals Exercise Ratio: True zero exists and 200ms is twice as long as 100ms

Interval = equal intervals, no true zero

Ratio = true zero exists (can say "twice as much")

• (1) Some of the variables are as follows:

(i) The length of pauses in a sample of conversation, measured in milliseconds;

(ii) the rating of words by informants, on a scale of pleasantness, ranging from 1 to 5 (1 = very unpleasant, 2
= unpleasant, 3 = neither unpleasant nor pleasant, 4 = pleasant, 5 = very pleasant);

(iii) the presence or absence of a finite verb in each clause in a particular text;

(iv) the degree of grammaticality of sentences, on a scale from 0 (absolutely ungrammatical) to 4 (entirely
grammatical);

(v) the number of sentences remembered correctly by informants ten minutes after first hearing them,

(vi) Working memory profiles of children with attention deficit hyperactivity disorder→ADHD (Verbal
STM→short-term memory, Visuo-spatial STM, Verbal WM→working memory, Visuo-spatial WM)

• Classify each of the variables as qualitative/discrete/continuous. Also identify the level of measurement
(nominal/ordinal/interval/ratio). Give reasons to support your answers.
Data versus Information

• Data are the foundation of the field of statistics and can be defined as the values assigned
to specific observations or measurements.

• The main examples of data are weights, prices, costs, numbers of items sold, employee
names, product names, addresses, tax codes, registration marks etc.

• The word “data” is plural for “datum.” Data is the raw material that can be processed by any
computing machine. When data are processed, organized, structured or presented in a
given context so as to make them useful, they are called Information.

• Information is data that are transformed into useful facts that can be used for a specific
purpose, such as making a decision.
Descriptive and Inferential Statistics
• Two branches of Statistics: (1) Descriptive statistics and (2) Inferential statistics.

• Descriptive statistics: The methods of statistics that help to describe, show or summarize
data in a meaningful way (patterns might emerge from the data). However, they do not allow
us to make conclusions beyond the data we have analyzed (cannot conclude regarding the
population).
✓ The main focus of descriptive statistics is to summarize and display data.

✓ The elements of descriptive statistics include tables, graphs and numerical summary tools.

• Inferential statistics: This method uses theory of probability to make inferences about the
population, based on a random sample.
✓ It covers a large variety of techniques that allow us to make actual claims about a population
Descriptive: Summarizes data (e.g., average rainfall in April).Ex;Histogram of student heights.
based on a sampled data.
Inferential: Draws conclusions (e.g., predicting election results from a sample).Ex; Using sample data to
estimate DU’s average IQ.
Parameter: Population characteristic (e.g., mean IQ of all DU
students, ).

Statistic: Sample characteristic (e.g., mean IQ of 400 students,


Terminologies X).

• Parameter: Any characteristics of population about which inferences are to be made, is


called parameter. Suppose we have a population of size 𝑁 , and 𝑋1 , 𝑋2 , … , 𝑋𝑁 be the
observations corresponding to the variable of interest. Then the population mean 𝜇 is
σ𝑁
𝑖=1 𝑋𝑖
defined as 𝜇 = . Population mean 𝜇 is a parameter.
𝑁

• Statistic: Any characteristic of sample is usually known as statistic. Suppose we have a


sample of size 𝑛 from a population of size 𝑁, and 𝑋1 , 𝑋2 , … , 𝑋𝑛 be the sampled observations
corresponding to the variable of interest. Then the sample mean denoted by 𝑋ത is defined as
1
𝑋ത = σ𝑛𝑖=1 𝑋𝑖 . The sample mean (𝑋)
ത is a statistic.
𝑛
Estimator: Estimator: Formula
Formula (e.g., (e.g.,
sample sample
mean X). mean X).
Estimate: Calculated
Estimate: Calculated value (e.g.,value
*X = (e.g.,
563.1*X
ms= from
563.1 ms from sample data*).

Terminologies sample data*).

• Estimator: A statistic which is used to estimate the value of the unknown parameter is known as an
estimator of that parameter. For example: 𝑋ത is an estimator of 𝜇.

✓ An estimator is a random variable which takes different values from sample to sample.

• Estimate: An estimate is a numerical value of the estimator obtained from a particular sample.

✓ Suppose there are 20 students in a population and we want to investigate the mean pause
length (in milliseconds) of all students by taking a sample of size 10. Let 𝑋𝑖 be the pause length

σ20
𝑖=1 𝑋𝑖
(in milliseconds) of ith students. Then (i) the parameter, 𝜇 = 20
=mean pause length of all 20

σ 𝑋𝑖 10
students (which is unknown to us. why?) (ii) The functions: sample mean, 𝑋ത = 𝑖=1
10
; sample

10
1 10 σ 𝑋𝑖
2
variance 𝑆 = σ σ𝑛𝑖=1 𝑋𝑖 − 𝑋ത 2
etc are statistic (iii) The statistic 𝑋ത = 𝑖=1 is an estimator for
9 𝑖=1 10

𝜇. Let us consider we have selected a sample of 10 students and let their pause length (in
milliseconds) be 564.6, 562.6, 565.4, 564.4, 561.5,561.9,565.6,563.3,559.3, and 562.7. (iv)Then
𝑋ത = 563.1 is an estimate of 𝜇.
Exercise

• (1) 500 students from the University of Dhaka were selected to estimate the proportion of
students with (any) drug addiction. The sample proportion was 0.12. In this context, identify
the (i) population (ii) sample (iii) parameter (iv) statistic (v) estimator and (vi) estimate.

• (2) 100 sentences of a novel (say, Iris Murdoch’s The Bell, Penguin edition, 1962) were
selected to estimate the mean length of sentences of that novel. The sample mean was 8.7.
In this context, identify the (i) population (ii) sample (iii) parameter (iv) statistic (v) estimator
and (vi) estimate.
Variable Classification

Variable Type Level of Measurement Reason


(i) Length of pauses in milliseconds Quantitative, Continuous Ratio Measured in exact numbers with a true zero (0ms = no pause)
(ii) Pleasantness rating (1-5 scale) Qualitative Ordinal Ordered categories but unequal intervals between ratings
(iii) Presence/absence of finite verb Qualitative Nominal Binary classification (yes/no) with no order
(iv) Grammaticality (0-4 scale) Qualitative Ordinal Ordered categories but intervals between levels aren't equal
(v) Number of sentences remembered Quantitative, Discrete Ratio Whole numbers only (0,1,2,...) with true zero
(vi) ADHD working memory profiles Qualitative Nominal Categories (Verbal STM, etc.) with no inherent order

Drug addiction study:

Population: All DU students.

Sample: 500 students.

Parameter: True proportion of addicted students (unknown).

Statistic: Sample proportion (0.12).

Estimator: Sample proportion formula.


Thank You
Estimate: 0.12.

Sentence length study:

Population: All sentences in The Bell.

Sample: 100 sentences.

Parameter: True mean length (unknown).

Statistic: Sample mean (8.7).

Estimator: Sample mean formula.

Estimate: 8.7.

You might also like