Unit III Dev Notes
Unit III Dev Notes
Introduction to Single variable: Distribution Variables - Numerical Summaries of Level and Spread -
Scaling and Standardising – Inequality - Smoothing Time Series.
How many households have no access to a car? What is a typical household income in Britain?
Which country in Europe has the longest working hours? To answer these kinds of questions we
need to collect information from a large number of people, and we need to ensure that the people
questioned are broadly representative of the population we are interested in. Conducting large-scale
surveys is a time-consuming and costly business. However, increasingly information or data from
survey research in the social sciences are available free of charge to researchers and students. The
development of the worldwide web and the ubiquity and power of computers makes accessing
these types of data quick and easy.
The aim is to explore data. We can use the 'Statistical Package for the Social Sciences' (SPSS)
package to start analysing data and answering the questions posed above.
Preliminaries
Two organizing concepts have become the basis of the language of data analysis: cases and
variables. The cases are the basic units of analysis, the things about which information is collected.
The word variable expresses the fact that this feature varies across different cases.
We will look at some useful techniques for displaying information about the values of single
variables, and will also introduce the differences between interval level and ordinal level variables.
It is a multipurpose survey carried out by the social survey division of the Office for National
Statistics (ONS). The main aim of the survey is to collect data on a range of core topics, covering
household, family and individual information. Government departments and other organizations
use this information for planning, policy and monitoring purposes, and to present a picture of
households, family and people in Great Britain.
Column 5 indicates the social class of individual based on the occupation.
Bar charts and pie charts can be an effective medium of communication if they are well drawn.
Histograms
Charts that are somewhat similar to bar charts can be used to display interval level variables
grouped into categories and these are called histograms. They are constructed in exactly the same
way as bar charts except that the ordering of the categories is fixed, and care has to be taken to
show exactly how the data were grouped.
II) NUMERICAL SUMMARIES OF LEVEL AND SPREAD
In this, how simple descriptive statistics can be used to provide numerical summaries of level and
spread.
Summaries of level
The level expresses where on the scale of numbers found in the dataset the distribution is
concentrated. In the previous example, it expresses where on a scale running from 1 hour per week
to 100 hours per week is the distribution's centre point. To summarize these values, one number
must be found to express the typical hours worked by men, for example. The problem is: how do
we define 'typical'? There are many possible answers. The value half-way between the extremes
might be chosen, or the single most common number of hours worked, or a summary of the
middle portion of the distribution.
Residuals
A residual can be defined as the difference between a data point and the observed typical, or
average, value.
For example if we had chosen 40 hours a week as the typical level of men's working hours, using
data from the General Household Survey in 2005, then a man who was recorded in the survey as
working 45 hours a week would have a residual of 5 hours. Another way of expressing this is to say
that the residual is the observed data value minus the predicted value and in this case 45-40 = 5. In
this example the process of calculating residuals is a way of recasting the hours worked by each man
in terms of distances from typical male working hours in the sample. Any data value such as a
measurement of hours worked or income earned can be thought of as being composed of two
components: a fitted part and a residual part. This can be expressed as an equation:
The median
The value of the case at the middle of an ordered distribution would seem to have an intuitive
claim to typicality. Finding such a number is easy when there are very few cases. In the example of
hours worked by a small random sample of 15 men (figure 2.3A), the value of 48 hours per week fits
the bill. There are six men who work fewer hours and seven men who work more hours while two
men work exactly 48 hours per week. Similarly, in the female data, the value of the middle case is 3 7
hours. The data value that meets this criterion is called the median.
The median is easy to find when, as here, there are an odd number of data points. When
the number of data points is even, it is an interval, not one case, which splits the distribution into
two. The value of the median is conventionally taken to be half-way between the two middle
cases. Thus
the median in a dataset with fifty data points would be half-way between the values of the 25th and
26th data points.
Put formally, with N data points, the median M is the value at depth (N + 1)/2. It is not the value at
depth N/2. With twenty data points, for example, the tenth case has nine points which lie below it
and ten above.
The arithmetic mean
Another commonly used measure of the centre of a distribution is the arithmetic mean. To
calculate it, first all of the values are summed, and then the total is divided by the number of data
points. In more mathematical terms:
The symbol Y is conventionally used to refer to an actual variable. The subscript i is an index to tell
us which case is being referred to. So, in this case, Yi refers to all the values of the hours
variable. The Greek letter pronounced 'sigma', is the mathematician's way of saying 'the sum of'.
Summaries of spread
The second feature of a distribution visible in a histogram is the degree of variation or spread in the
variable.
The midspread
The range of the middle 5 0 per cent of the distribution is a commonly used measure of spread
because it concentrates on the middle cases. It is quite stable from sample to sample. The points
which divide the distribution into quarters are called the quartiles (or sometimes 'hinges' or
'fourths'). The lower quartile is usually denoted QL and the upper quartile Q0. (The middle quartile
is of course the median.) The distance between QL and Q0 is called the midspread (sometimes the
'interquartile range'), or the dQ for short.
The standard deviation
The standard deviation essentially calculates a typical value of these distances from the mean. It is
conventionally denoted s, and defined as:
The deviations from the mean are squared, summed and divided by the sample size, and then the
square root is taken to return to the original units. The order in which the calculations are
performed is very important. As always, calculations within brackets are performed first, then
multiplication and division, then addition (including summation) and subtraction. Without the
square root, the measure is called the variance, s2.
The layout for a worksheet to calculate the standard deviation of the hours worked by this small
sample of men is shown in figure 2.4.
The original data values are written in the first column, and the sum and mean calculated at the
bottom. The residuals are calculated and displayed in column 2, and their squared values are
placed in column 3. The sum of these squared values is shown at the foot of column 3, and from it
the standard deviation is calculated.
We must recognize that errors of all sorts creep into the very best data sources. The values found at
the extremes of a distribution are more likely to have suffered error than the values at the centre.
The word 'data' must be treated with caution. Literally translated, it means 'things that are
given'.
There are often problems with using official statistics, especially those which are the by-products
of some administrative process like, for example, reporting deaths to the Registrar-General or
police forces recording reported crimes. Data analysts have to learn to be critical of the
measures available to them, but in a constructive manner. As well as asking 'Are there any errors
in this measure?' we also have to ask 'Is there anything better available?' and, if not, 'How can I
improve what I've got?'
Improvements can often be made to the material at hand without resorting to the expense of
collecting new data.
We must feel entirely free to rework the numbers in a variety of ways to achieve the following
goals:
to make them more amenable to analysis
to promote comparability
to focus attention on differences.
Consider various manipulations that can be applied to the data to achieve the above goals:
The change made to the data by adding or subtracting a constant is fairly trivial. Only
the level is affected; spread, shape and outliers remain unaltered. The reason for
doing it is usually to force the eye to make a division above and below a particular point.
A negative sign would be attached to all those incomes which were below the median in
the example above. However, we sometimes add or subtract a constant to bring the
data within a particular range.
The overall shape of the distributions in figures 3. lA and 3. lB are the same. The data points are all
in the same order, and the relative distances between them have not been altered apart from the
effects of rounding. The whole distribution has simply been scaled by a constant factor.
In SPSS it is very straightforward to multiply or divide a set of data by a constant value. For example,
using syntax, the command to create the variable drday ‘Average daily alcohol consumption’ from
the variable drating ‘Average weekly alcohol consumption’ is as follows:
COMPUTE DRDAY — DRATING/7.
Alternatively, to create a new variable ‘NEWVAR’ by multiplying an existing variable ‘OLDVAR’ by
seven the syntax would be:
COMPUTE NEWVAR = OLDVAR*7.
The ‘Compute’ command can also be used to add or subtract a
constant, for example:
COMPUTE NEWVAR = OLDVAR + 100.
COMPUTE NEWVAR = OLDVAR—60.
The value of multiplying or dividing by a constant is often to promote comparability between
datasets where the absolute scale values are different. For example, one way to compare the cost
of a loaf of bread in Britain and the United States is to express the British price in dollars.
Percentages are the result of dividing frequencies by one particular constant — the total number
of cases.
A variable which has been standardized in this way is forced to have a mean or median of 0 and a
standard deviation or midspread of 1.
Two different uses of variable standardization are found in social science literature. The first is in
building causal models, where it is convenient to be able to compare the effect that two different
variables have on a third on the same scale. The second use which is more immediately intelligible:
standardized variables are useful in the process of building complex measures based on more than
one indicator.
In order to illustrate this, we will use some data drawn from the National Child Development Study
(NCDS). This is a longitudinal survey of all children born in a single week of 1958. There is a great
deal of information about children’s education in this survey. Information was sought from the
children’s schools about their performance at state examinations, but the researchers also decided
to administer their own tests of attainment.
Rather than attempt to assess knowledge and abilities across the whole range of school subjects,
the researchers narrowed their concern down to verbal and mathematical abilities. Each child was
given a reading comprehension test which was constructed by the National Foundation for
Educational Research for use in the study, and a test of mathematics devised at the University of
Manchester. The two tests were administered at the child’s school and had very different
methods of scoring. As a result they differed in both level and spread.
As can be seen from the descriptive statistics in figure 3.4, the sixteen-year-olds in the National Child
Development Study apparently found the mathematics test rather more difficult than the reading
comprehension test. The reading comprehension was scored out of a total of 35 and sixteen-year-
olds gained a mean score of 25.37, whereas the mathematics test was scored out of a possible
maximum of 31, but the 16-year-olds only gained a mean score of 12.75.
The first two columns of figure 3.5 show the scores obtained on the reading and mathematics test by
fifteen respondents in this study. There is nothing inherently interesting or intelligible about the raw
numbers. The first score of 31 for the reading test can only be assessed in comparison with what
other children obtained. Both tests can be thought of as indicators of the child’s general attainment
at school. It might be useful to try to turn them into a single measure of that construct.
In order to create such a summary measure of attainment at age 16, we want to add the two scores
together. But this cannot be done as they stand, because as we saw before, the scales of
measurement of these two tests are different. If this is not immediately obvious try the following
thought experiment. A 16-year-old who is average at reading but terrible at mathematics will
perhaps score 25.4 (i.e. the mean score) on the reading comprehension test and 0 on the
mathematics test. If these were summed the total is 25.4. However, a 16-year-old who is average
at mathematics but can’t read is likely to score 12.7 (i.e. the mean score) on the maths score and 0
on the reading comprehension. If these are summed the total would only be 12.7. If the two tests
can be forced to take the same scale, then they can be summed.
This is achieved by standardizing each score. One common way of standardizing is to first subtract
the mean from each data value, and then divide the result by the standard deviation. This process
is summarized by the following formula, where the original variable ‘Y’ becomes the standardized
variable ‘Z’:
The same individual’s mathematics score becomes (17 — 12.75)/7, or 0.61. This first respondent is
therefore above average in both reading and maths. To summarize, we can add these two together
and arrive at a score of 1.41 for attainment in general.
Similar calculations for the whole batch are shown in columns 3 and 4 of figure 3.5. We can see that
the sixth person in this extract of data is above average in reading but slightly below average (by a
quarter of a standard deviation) in mathematics. It should also be noted that any individual scoring
close to the mean for both their reading comprehension and their mathematics test will have a total
score close to zero. For example, the tenth case in figure 3.5 has a total score of —0.02.
The final column of figure 3.5 now gives a set of summary scores of school attainment, created by
standardizing two component scores and summing them, so attainment in reading and maths
have effectively been given equal weight.
It is very straightforward to create standardized variables using SPSS. by using the Descriptives
command, the SPSS package will automatically save a standardized version of any variable.
First select the menus
The next stage is to select the variables that you wish to standardize, in this case N2928 and N2930,
and check the box next to ‘Save standardized values as variables.’ The SPSS package will then
automatically save new standardized variables with the suffix Z. In this example, two new variables
ZN2928 and ZN2930 are created.
Standardizing the variables was a necessary, but not a sufficient condition for creating a simple
summary score. It is also important to have confidence that the components are both valid
indicators of the underlying construct of interest.
However, many distributions do have a characteristic shape — a lump in the middle and tails
straggling out at both ends. How convenient it would be if there was an easy way to define a more
complex shape like this and to know what proportion of the distribution would lie above and below
different levels.
One such shape, investigated in the early nineteenth century by the German mathematician and
astronomer, Gauss, and therefore referred to as the Gaussian distribution, is commonly used. It is
possible to define a symmetrical, bell-shaped curve which looks like those in figure 3.8, and which
contains fixed proportions of the distribution at different distances from the centre. The two curves
in figure 3.8 look different — (a) has a smaller spread than (b) — but in fact they only differ by a
scaling factor.
Any Gaussian distribution has a very useful property: it can be defined uniquely by its mean and
standard deviation. Given these two pieces of information, the exact shape of the curve can be
reconstructed, and the proportion of the area under the curve falling between various points can
be calculated.
This bell-shaped curve is often called ‘the normal distribution’. Its discovery was associated with
the observation of errors of measurement. If sufficient repeated measurements were made of the
same object, it was discovered that most of them centred around one value (assumed to be the true
measurement), quite a few were fairly near the centre, and measurements fairly wide of the mark
were unusual but did occur. The distribution of these errors of measurement often approximated to
the bell-shape in figure 3.8.
One approach would be to treat the distribution of incomes for each sex in each year as a separate
distribution, and express each of the quartiles relative to the median. The result of doing this is
given in figure 3.14.
IV) INEQUALITY
Prosperity and inequality :
There are a number of reasons why we might want to reduce inequality in society. For example, as
Layard (2005) argues, if we accept that extra income has a bigger impact on increasing the happiness
of the poor than the rich, this means that if some money is transferred from the rich to the poor this
will increase the happiness of the poor more than it diminishes the happiness of the rich. This in turn
suggests that the overall happiness rating of a country will go up if income is distributed more
equally. Of course, as Layard acknowledges, the problem with this argument is that it only works if it
is possible to reduce inequality without raising taxes to such an extent that there is no longer an
incentive for individuals to strive to make money so that the total income is reduced as a result of
policies aimed at redistribution. It is clearly important to understand the principal ways of measuring
inequality if we are to monitor the consequences of changing levels of inequality in society. This
chapter will focus on how we can measure inequality in such a way as to make it possible to
compare levels of inequality in different societies and to look at changes in levels of inequality over
time.
Income and Wealth :
Considered at the most abstract level, income and wealth are two different ways of looking at the
same thing. Both concepts try to capture ways in which members of society have different access to
the goods and services that are valued in that society. Wealth is measured simply in pounds, and is a
snapshot of the stock of such valued goods that any person owns, regardless of whether this is
growing or declining. Income is measured in pounds per given period, and gives a moving picture,
telling us about the flow of revenue over time.
For the sake of simplicity, we restrict our focus to the distribution of income. We will look in detail at
the problems of measuring income and then consider some of the distinctive techniques for
describing and summarizing inequality that have evolved in the literature on economic inequality.
There are four major methodological problems encountered when studying the distribution of
income:
1. How should income be defined?
2. What should be the unit of measurement?
3. What should be the time period considered?
4. What sources of data are available?
Definition of income
To say that income is a flow of revenue is fine in theory, but we have to choose between two
approaches to making this operational. One is to follow accounting and tax practices, and make a
clear distinction between income and additions to wealth. With this approach, capital gains in a
given period, even though they might be used in the same way as income, would be excluded from
the definition. This is the approach of the Inland Revenue, which has separate taxes for income and
capital gains. In this context a capital gain is defined as the profit obtained by selling an asset that
has increased in value since it was obtained. However, interestingly, in most cases this definition (for
the purposes of taxation) does not include any profit made when you sell your main home.
The second approach is to treat income as the value of goods and services consumed in a given
period plus net changes in personal wealth during that period. This approach involves constantly
monitoring the value of assets even when they do not come to the market. That is a very hard task.
So, although the second approach is theoretically superior, it is not very practical and the first is
usually adopted.
The definition of income usually only includes money spent on goods and services that are
consumed privately. But many things of great value to different people are organized at a collective
level: health services, education, libraries, parks, museums, even nuclear warheads. The benefits
which accrue from these are not spread evenly across all members of society. If education were not
provided free, only families with children would need to use their money income to buy schooling.
Sources of income are often grouped into three types:
• earned income, from either employment or self-employment;
• unearned income which increases from ownership of investments, property, rent and so on;
• transfer income, that is benefits and pensions transferred on the basis of entitlement, not on the
basis of work or ownership, mainly by the government but occasionally by individuals .
Notice that the key term involves a weighted sum of the data values, where the
weight is the unit's rank order in the income distribution.
V) SMOOTHING TIME SERIES
The aim is to introduce a method for presenting time series data that brings out the underlying
major trends and removes any fluctuations that are simply an artefact of the ways the data have
been collected.
The total number of crimes recorded every year from 1965 to 1994, as shown in figure 5.1, is an
example of a time series.
Other examples might be the monthly Retail Price Index over a period of ten years, the monthly
unemployment rate or the quarterly balance of payment figures during the last Conservative
government. These examples all have the same structure: a well-defined quantity is recorded at
successive equally spaced time points over a specific period. But problems can occur when any
one of these features is not met - for example if the recording interval is not equally spaced.
The aim of smoothing
Figure 5.2 was constructed by joining points together with straight lines. Only the points contain
real information of course. The lines merely help the reader to see the points. The result has a
somewhat jagged appearance. The sharp edges do not occur because very sudden changes really
occur in numbers of recorded crimes. They are an artefact of the method of constructing the plot,
and it is justifiable to want to remove them. According to Tukey (1977, p. 205), the value of
smoothing is ‘the clearer view of the general, once it is unencumbered by detail’. The aim of
smoothing is to remove any upward or downward movement in the series that is not part of a
sustained trend.
Sharp variations in a time series can occur for many reasons. Part of the variation across time may
be error. For example, it could be sampling error.
In engineering terms we want to recover the signal from a message by filtering out the noise. The
process of smoothing time series also produces such a decom- position of the data. In other
words, what we might understand in engineering as
Opinion polls
Opinion polls represent only a small fraction of all the social research that is conducted in Britain,
but they have become the public face of social research because they are so heavily reported.
Predicting who is going to win an election makes good newspaper copy. The newspaper industry was
therefore among the first to make use of the development of scientific surveys for measuring
opinion. In all general elections in Britain since the Second World War, polls have been conducted to
estimate the state of the parties at the time, and the number of such polls continues to grow. By-
elections and local elections are now also the subject of such investigations.
Opinion polls in Britain have historically almost always been conducted on quota samples. In such
a sample, the researcher specifies what type of people he or she wants in the sample, within broad
categories (quotas), and it is then left up to the interviewer to find such people to interview. In a
national quota sample, fifty constituencies might be selected at random, and then quotas set within
each constituency on age, sex and employment status. Interviewers would then have to find so
many women, so many unemployed and so many young people, etc. In the better quota samples,
such quotas are interlocked: the interviewer is told how many young housewives, how many male
unemployed and so on to interview. The idea is that when all these quotas are added together, the
researcher will be sure that the national profile on age, sex and employment status will have been
faith-fully reproduced.
For example, since 2002, MORI’s ‘headline’ voting intention figure has been calculated by excluding
all those who are not ‘absolutely certain to vote’. This is measured by asking respondents to rate
their certainty to vote on a scale from 1 to 4 0, where 1 means absolutely certain not to vote and
‘10’ means absolutely certain to vote, and only those rating their likelihood of voting at ‘10’ are
included. Figure 5.5 shows MORI’s data on trends in voting intention leading up to the 2005 General
Election, held on 5 May.
Techniques
To smooth a time series we replace each data value by a smoothed value that is determined by
the value itself and its neighbours. The smoothed value should be close to each of the values which
determine it except those which seem atypical. We therefore want some form of resistant numerical
summary — some local typical value.
This involves two decisions: which neighbouring points are to be considered local and which changes
are atypical? The answers to these questions must depend in part on the particular problem, but this
chapter presents some multipurpose procedures which give generally satisfactory results. These
procedures answer the two questions as follows: take one point either side as local and treat as
real an upward or downward change of direction which is sustained for at least two successive
points.
Summaries of Three
The simplest such resistant average is to replace each data value by the median of three values: the
value itself, and the two values immediately adjacent in time. Consider, again, the percentage of
respondents who intended to vote Labour (column 2 of data in figure 5.5). To smooth this column,
we take the monthly figures in groups of three, and replace the value of the middle month by the
median of all three months.
In March, April and May 2003, the median is 43 per cent, so April’s value is unchanged. In April, May
and June 2003, the median is 41 per cent, so the value for May is altered to 41 as shown in figure
5.7. The process is repeated down the entire column of figures.
Since, for the purpose of this exercise, we are supposing that the January 2003 and April 2005 rates
are unknown, we simply copy on the first and last values, 41and 37, for February 2003 and March
2005. More sophisticated rules for smoothing these end values are available, but discussion of them
is postponed for the present.
One other possible method of smoothing would be to use means rather than medians. The result of
using the mean of each triple instead of the median is shown in columns 4 and 5 of figure 5.8.
Hanning
Although smoothing by repeated medians of three is adequate for most purposes and successfully
dealt with seemingly atypical values, the results still have a some- what jagged appearance. One way
to smooth off the corners would be to use running means of three on the 3R smooth. However, we
can do better than taking simple means of three. This would give equal weight, one-third, to each
value. As the data have already been smoothed, it would seem sensible to give more weight to the
middle value.
A procedure called hanning is given any three consecutive data values, the adjacent values are
each given weight one-quarter, whereas the middle value, the value being smoothed, is given
weight one- half. This is achieved in the following way: first calculate the mean of the two
adjacent values — the skip mean — thus skipping the middle value; then calculate the mean of the
value to be smoothed and the skip mean. It is easy to show that these two steps combine to give
the required result.
In practice, we first form a column of skip means alongside the values to be smoothed and then form
a column of the required smoothed values.
This procedure is depicted above for the first three values of the repeated median smooth, shown in
full in figure 5.10.
Thus 43 is the value to be smoothed, the skip mean 42 is the mean of 41 and 43 and the smoothed
value 42.5 is the mean of 43 and 42. A new element of notation has been introduced into figure
5.10: the column of hanned data values is sometimes labelled ‘H’.
The results are plotted in figure 5.11 and this also displays the percentage of individuals saying they
would vote Conservative and Liberal Democrat over the same period. Hanning has produced a
smoother result than repeated medians alone. Whether the extra computational effort is
worthwhile depends on the final purpose of the analysis. Repeated medians are usually sufficient
for exploratory purposes but, if the results are to be presented to a wider audience, the more
pleasing appearance that can be achieved by hanning may well repay the extra effort.
Residuals
Having smoothed time series data, much can be gained by examining the residuals between the
original data and the smoothed values, here called the rough. Residuals can tell us about the
general level of variability of data over and above that accounted for by the fit provided by the
smoothed line. We can judge atypical behaviour against this variability, as measured, for example,
by the midspread of the residuals.
Ideally we want residuals to be small, centred around zero and patternless, and, if possible,
symmetrical in shape with a smooth and bell-shaped appearance. These properties will indicate
that the residuals represent little more than negligible random error and that we are not distorting
the main patterns in the data by removing them. Displaying residuals as a histogram will reveal
their typical magnitude and the shape of their distribution. Figure 5.12 shows the histogram of the
residuals from the repeat median and hanning smooth (the 3RH for short). This shows that the
residuals are small in relation to the original data, fairly symmetrical, centred on zero and devoid of
outliers.
Refinements
There are a number of refinements designed to produce even better smooths. We can only give
cursory attention to these here but more details are given in books by Tukey (1977) and Velleman
and Hoaglin (1981). Before discussing how the first and last values in a time series might also be
smoothed it is helpful to introduce a convenient special notation y1 , y2 , ..., yN, or yt, in general; yt
refers to the value of the quantity y, recorded at time t. It is conventional to code i from 1 to N, the
total period of observation. For example, in figure 5.10 the months February 2003 to March 2005
would be coded from 1 to 26.
Endpoint smoothing
Instead of copying on y 1 ,we first create a new value to represent y at time 0, January 2003. This will
give us a value on either side of y, so that it can be smoothed. This value is found by extrapolating
the smoothed values for times 2 and 3, which we will call Z 2 and Z3 and this is shown graphically in
figure 5.13.