NOTES STATISTICS Final
NOTES STATISTICS Final
Week 1 – Lecture 1
MEASUREMENT
Measurement is the task of gathering information that characterises or represent a social phenomena. For example, we can
mention:
- People’s opinion on gun control
- An individual’s income
- The severity of unemployment in a country
- The typical wages offered in a given industry
Before we can measure, we must determine the unit of analysis: the type of thing which we are collecting information about;
common units of analysis are:
- People (the most common)
- Organisations
- Countries
- Schools
- Industries
- Families
➔ NOMINAL SCALES
Nominal comes from the Latin name for “name” or “label”, and in fact the strategy behind nominal scales is to assign labels
to different cases; for example:
- Measuring gender: male, female, …
- Country of residence: US, Japan, Italy, …
- Religion: Protestant, Catholic, Atheist, …
One problem that could arise happens, supposedly, if a person when asked the measure religion replies that she is both
Protestant and Jewish; what do you do? There are two options:
1
1) You design a better survey that can cope with this, like adding a category for “Jewish/Protestant” or “multiple
religion”
2) Or, you destroy information by forcing the respondent to choose
➔ ORDINAL SCALES
Ordinal scales are similar to nominal, but, in addition to putting people in groups, those groups are ordered; for example:
- Lower class, middle class, upper class
- Elementary school, middle school, high school
Ordinal scales do not specify the distance between categories: in the case of department rankings, we do not know how big
the difference is between ranks of 1 & 2, or 20 and 21 (these differences may be small or large); that’s why ordinal scales are
distinct from nominal: nominal can not be meaningfully ordered (you cannot perform >, <, = on nominal values)
NB nominal and ordinal variables are called qualitative variable because we are measuring attributes
➔ INTERVAL SCALES
Interval measures (also ratio scales) are:
A) Homogeneous
B) Ordered
C) Measured in comparable units
For example, we can mention the budget of a university, the number of children in a household, a worker’s annual income.
Data are always an approximation, and statistics is just a way to summarise the results of the scientific experiments in the
scientific methods. Now many think that the data are the experiment, but in the Galilean approach the data are just the
approximation, and depend on the way you chose to collect data, the tools used for the measurement
On the overall, poor measurement results, regardless of the perfect statistical analysis you may carry our, in incorrect
conclusions
Week 1 – Lecture 2
VALIDITY
Validity is the degree to which a measurement captures what it is intended to; if validity is very poor, measurements become
meaningless, but we have to remember that validity is not an “all or nothing” thing.
For example, measuring a person’s wealth starting from its hourly wage, does it have validity problems? Yes, because retired
people might be wealthy, but have no income or wages; a more valid measure would be the total value of all their assets.
RELIABILITY
Reliability is the extent to which a measure produces consistent result; if reliability is poor, measures are meaningless.
For example, in measuring the overall happiness in life with the question “how happy are you right now?”, has potential reliability
issues, since mood varies a lot from moment to moment and answers may not reflect the true overall happiness in life (we are
constantly influenced by the weather, the day of the week, the time, and many other circumstances. The way to solve this is to
find out less time-sensitive measures.
2
1) Sample survey: since you cannot ask the whole population (especially at the same time) the survey question, one
approach is to sample people from a given population and interview them. Here, the issue is how to find a representative
sample of the population
2) Experiment: it consists in comparing the responses of subject under different conditions, with subjects assigned to the
conditions; the great advantage of this method is that you basically can control the conditions of your subjects, and also
purposely alter them
Randomisation is the mechanism for achieving reliable data by reducing potential bias; considering the simple random sample,
each possible sample of size n (size sample) has the same chance of being selected. Basically, you cannot control the unit that is
eventually selected, even if it is important to mention that not all samples are truly random. The simple random sample is an
example of a probability sampling method, because we can specify the probability any particular sample will be selected.
To implement random sampling, we can use random number tables or statistical softwares that can generate random numbers;
nevertheless, the sampling frame, so a listing of all subjects in a population, must exist to implement the simple random sampling
- Other probability sampling methods include systematic, stratified, cluster random sampling.
- For nonprobability sampling, cannot specify probabilities for the possible samples; inferences based on them may be
highly unreliable. Example: volunteer samples, such as polls on the Internet, often are severely biased (but, sometimes
volunteer samples are all we can get, as in most medical studies)
The sampling error of a statistic equals the error that occurs when we use a sample statistics to predict the value of a population
parameter. As a matter of fact, being unable to interview an entire population, we need a scaled version of it with the exact same
features: a representative sample, so that we can reach a conclusion that will always be an estimate; its precision includes the
error, that depends on how you select the sample. Randomisation protects against bias, with sampling error tending to fluctuate
around 0 with predictable size; there are methods that let us predict magnitude (the margin of error) e.g., in estimating a
percentage, no more than about +3% or -3% when n about 1000.
Other factors besides sampling error can cause results to vary from sample to sample:
- Sampling bias (e.g., no probability sampling)
- Response bias (e.g., poorly worded questions and misunderstandings; as the NY Times poll on gasoline tax shows, results
of surveys can depend greatly on question wording)
- Non-response bias (e.g., under coverage and missing data: people do not answer the questions and typically those people
are not random but are a group; it is what happens with social media and elderly people that generally avoid them, so
that a given demographic group is ultimately missing)
We end up with sets of measurements on groups of cases; data is often organised in spreadsheets:
- Rows = all measurements on each case
- Columns = reflect sets of measurements or “variables”
Another thing we can do is listing variables, so listing the values of a variable for all cases (by looking at the raw data); this is
something that makes sense only for very small samples and for given kinds of variables. To perform that, we have the List
command in STATA.
o Advantages: it is easy and gives a rich description of the dataset (you can see every case)
o Disadvantages: it is not workable for large datasets and if data involves complex coding you might not be able
to interpret it visually
FREQUENCY LISTS
Frequency lists are tables that show how many cases take on a particular value; they are the simplest descriptive tool that is also
called “frequencies” or “frequency distributions”. For example, in the case of a congressional vote you might count the number
of “Yes” or “No” with the STATA command “tabulate (tab)”
o Advantages: frequency lists are useful for large datasets and provide for a rich description of data
o Disadvantages: unlike a list, you can’t see which case is which or compare with other variables. They are best for
nominal and some ordinal variables, while are not so useful if all values are unique (rank orderings, many
continuous variables, …), especially if you do not envision bins.
4
VISUAL REPRESENTATIONS
Bar charts are essentially a visual representation of a frequency list: the height of bars represent the number of cases. It is used
for nominal and some ordinal variables only, because again, ranking orders and continuous measures do not work.
As regards the matter of graphing continuous measures, the issue is that continuous variables have an infinite possible number
of unique values; in the case of a bar chart, you would have many bars of height 1 (and what would you do with zeros?). One
possible solution is to use grouped data (bins), so that sets of similar values are grouped and lumped together by constant
intervals; nevertheless, information is destroyed in the process. The result is an histogram, in which the height of bars represent
the number of cases within a given range of values. If you have, for example, people grouped by age, you might have 5-years
intervals with the consequent bars, but you might also group people within a 1-year interval or a 50-year interval:
- Small interval means more bars in the histogram —> greater detail, but you might have difficulties in interpreting the
data
- Wide interval means fewer bars in the histogram —> greater simplification of date
NB Histograms look very different depending on how wide you set the intervals: you should try different intervals and don’t over-
interpret a crude histogram.
If bins are not equally big, then you cannot compare frequencies but have you have densities; you have to compute the ratio
between frequencies and range
5
MEASURES OF CENTRAL TENDENCY
Often, it is important to assess the typical values of a variable; for example how much money the typical family earns, or the age
of the typical person in the dataset. The solution to the question is to conduct calculations to determine what values are typical,
but this is not as easy as it sounds; moreover, it does not exist a unique way of measuring what the centre is, so even different
results are allowed.
Lastly, the typical value that results does nothing to represent the variability in the sample.
MODE
The mode is the value representing the largest number of cases (modal value).
This measure is useful for nominal and ordinal values, while it is only useful for
continuous variables if you have grouped data into a histogram (otherwise, all
values may very likely be unique); nevertheless, the mode is not very helpful (or
even misleading) in certain circumstances (if there are many peaks, or a single
unusual one, or if the variable is distributed quite evenly)
In case of character grouped into classes, the mode is the average (central) value of the class to which the highest frequency
density 𝒍𝒊 (𝒅𝒊 ) Is associated
6
7
Week 2 – Lecture 1
MEDIAN
The median of a variable is the modality that occupied the central position in the ordered distribution of a variable (=value of
the middle case, since you have an equal number of cases that fall higher or lower). It can be performed for ordinal and
continuous variables, but cannot be calculated for nominal variables because they do not naturally possess an order; it is more
informative than the mode.
o Advantages: it is not influenced by unusual peaks (outliers) + it is useful even in very even distributions
o Disadvantages: it is not useful for data spread in two distinct “clumps”
- If the number of statistical units n is odd, there is only one central position P = (n + 1) / 2.
- If the number of statistical units n is even, there are two middle positions: n / 2 and n / 2 + 1. If the units corresponding to
these two positions have the same modality, this modality is the median; if they have different modalities: the median is
indeterminate (if the variable is ordinal), the median is the average of the two modalities (if the variable is quantitative).
The calculation of the median for grouped data follows this procedure:
QUANTILES
Quantile is the general term, but there exist percentiles, quantiles, deciles, …; using quantiles means dividing cases up into fixed
number of equal “bunches”:
- 100 chunks = percentile
- 10 chunks = deciles
- 5 chunks = quintile
8
- 4 chunks = quartiles
Identifying quartile of a case is a powerful way of describing where a case falls relative to others; in this case, a person with 200CDs
is in the top quartile, meaning that 75% of people have less. We must not forget that quantiles are relative, so a person of average
height in the US would be in the bottom quartile in a dataset of basketball players for example
Moreover, upper and lower bounds of quantiles are useful reference points that describe your data
– The border of the 2nd and 3rd quartile is the median, the middle of your data
– The border of the top quartile (178 CDs) gives you a sense of how many are owned by people toward the upper end of the
distribution
Sometimes people report “interquartile range”: the range of values that contains the middle 50% of cases.
9
The closes value to the one you have calculated is the corresponding mark in the ordered sequence 10
Quartiles are portrayed graphically by box plots; in this case, weekly TV
watching for n=60 from student survey data file, 3 outliers
MEAN – AVERAGE
It is the most well-known way of assessing the middle in a distribution; it is calculated by adding values of all cases, then dividing
by the total number of cases. It has its advantages because it is applicable for continuous measures and it is not overly influenced
by any single peak; as a disadvantage, it can be influenced by extreme values (outliers)
11