Unit 3 Univariate Analysis
Unit 3 Univariate Analysis
UNIT-III
UNIVARIATE ANALYSIS
3.1 Introduction
In this first part, methods are introduced to display the essential features of one variable at a
time; we defer until the next part the consideration of relationships between variables.
Two organizing concepts have become the basis of the language of data analysis: cases and
variables. The cases are the basic units of analysis, the things about which information is
collected. There can be a wide variety of such things. Researchers interested in the
consequences of long-term unemployment may treat unemployed individuals as their units of
analysis, and collect information directly from them.
1
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Researches usually proceed by collecting information on certain features of all cases in their
sample; these features are the a variables .In a survey of individuals their income, sex, and age
are some of the variables might be recorded
One official survey which is a modern inheritor of the nineteenth-century survey tradition is
the General Household Survey (GHS). It is a multipurpose survey carried out by the social
survey division of the Office for National Statistics (ONS).
The main aim of the survey is to collect data on a range of core topics, covering household,
family and individual information. Government departments and other organizations use this
information for planning, policy and monitoring purposes, and to present a picture of
households, family and people in Great Britain.
The GHS has been conducted continuously in Britain since 1971, except for breaks between
1997 and 1998 (when the survey was reviewed) and again between 1999 and 2000 when the
survey was re-developed. It is a large survey, with a sample size of approximately 8,600
households containing over 20,000 individuals, and is a major source for social scientists.
You can read more information about the GHS in the appendix to this chapter on the
accompanying website. The sample is designed so that all households in the country have an
equal probability of being selected
The numbers in the different columns of figure 1.1 do not all have the same properties; some
of them merely differentiate categories (as in sex or drinking behaviour) whereas some of
them actually refer to a precise amount of something (like years of age, or units of alcohol
drunk each week). They represent different levels of measurement. When numbers are used
to represent categories that have no inherent order, this is called a nominal scale. When
numbers are used to convey full arithmetic properties, this is called an interval scale. Figure
1.1 shows a specimen case by variable data matrix
2
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
One final point should be made before we start considering techniques for actually getting our
eyes and hands on these variables. The human brain is easily confused by an excess of detail.
Numbers with many digits are hard to read, and important features, such as their order of
magnitude, may be obscured. Some of the digits in a dataset vary, while others do not. In the
following case:
134
3
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
121
167
The first method is known as rounding. Values from zero to four are rounded down, and six
to ten are rounded up. The digit five causes a problem; it can be arbitrarily rounded up or
down according to a fixed rule, or it could be rounded up after an odd digit and down after an
even digit.
A second method of losing digits is simply cutting off or 'truncating' the ones that we do
not want. Thus, when cutting, all the numbers from 899.0 to 899.9 become 899.
Blocks or lines of data are very hard to make sense of. We need an easier way of visualizing
how any variable is distributed across our cases. One simple device is the bar chart, a visual
display in which bars are drawn to represent each category of a variable such that the length
of the bar is proportional to the number of cases in the category.
A pie chart can also be used to display the same information. It is largely a matter of taste
whether data from a categorical variable are displayed in a bar chart or a pie chart. In general,
4
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
pie charts are to be preferred when there are only a few categories and when the sizes of the
categories are very different.
In modern society there is considerable interest in the length of time people spend at work.
The measurement of hours that people work is important when analysing a variety of
economic and social phenomena. The number of hours worked is a measure of labour input
that can be used to derive key measures of productivity and labour costs. The patterns of
hours worked and comparisons of the hours worked by different groups within society give
important evidence for studying and understanding lifestyles, the labour market and social
changes.
In this chapter we will focus on the topic of working hours to demonstrate how simple
descriptive statistics can be used to provide numerical summaries of level and spread.
The chapter will begin by examining data on working hours in Britain taken from the
General Household Survey discussed in the previous chapter. These data are used to
illustrate measures of level such as the mean and the median and measures of spread or
variability such as the standard deviation and the midspread.
The histograms of the working hours distributions of men and women in the 2005 General
Household Survey are shown in figures 2.1 and 2.2.
5
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
We can compare these two distributions in terms of the four features introduced in the
previous chapter, namely level, spread, shape and outliers. We can then see that:
Summaries of level
The level expresses where on the scale of numbers found in the dataset the distribution is
concentrated. In the previous example, it expresses where on a scale running from 1 hour per
week to 100 hours per week is the distribution's centre point.
Residuals
A residual can be defined as the difference between a data point and the observed typical,
or average, value. For example if we had chosen 40 hours a week as the typical level of
men's working hours, using data from the General Household Survey in 2005, then a man
who was recorded in the survey as working 45 hours a week would have a residual of 5 hours.
Another way of expressing this is to say that the residual is the observed data value minus the
predicted value and in this case 45-40 = 5. Any data value such as a measurement of hours
worked or income earned can be thought of as being composed of two components: a fitted
part and a residual part. This can be expressed as an equation:
The median
6
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
The value of the case at the middle of an ordered distribution would seem to have an intuitive
claim to typicality. Finding such a number is easy when there are very few cases.
In the example of hours worked by a small random sample of 15 men (figure 2.3A), the value
of 48 hours per week fits the bill. There are six men who work fewer hours and seven men
who work more hours while two men work exactly 48 hours per week. Similarly, in the
female data, the value of the middle case is 3 7 hours. The data value that meets this criterion
is called the median: the value of the case that has equal numbers of data points above and
below it. The median hours worked by women in this very small sample is 11 hours less than
the median for men
The median is easy to find when, as here, there are an odd number of data points. When the
number of data points is even, it is an interval, not one case, which splits the distribution into
two. Thus the median in a dataset with fifty data points would be half-way between the values
of the 25th and 26th data points. Put formally, with N data points, the median Mis the value at
depth (N + 1)/2. It is not the value at depth N/2.
Another commonly used measure of the centre of a distribution is the arithmetic mean.
Indeed, it is so commonly used that it has even become known as the average. To calculate it,
7
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
first all of the values are summed, and then the total is divided by the number of data points.
In more mathematical terms:
We have come across N before. The symbol Y is conventionally used to refer to an actual
variable. The subscript i is an index to tell us which case is being referred to. So, in this case,
Y; refers to all the values of the hours variable. The Greek letter 2:, pronounced 'sigma', is the
mathematician's way of saying 'the sum of'.
Well, one important consequence is that the mean is more affected by unusual data values
than the median. The mean of the male working hours in the above dataset is 51 hours, not
very different from the median in this case. The mean and the median will tend to be the same
in symmetrical datasets.
Summaries of spread
The second feature of a distribution visible in a histogram is the degree of variation or spread
in the variable. The two histograms of male and female working hours shown in figures 2.1
and 2.2 allow visual inspection of the extent to which the data values are relatively spread out
or clustered together.
8
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
The word 'relatively' is important: a single distribution can look very tightly clustered
simply because the scale has not been carefully chosen. In figures 2.1 and 2.2 the scale along
the horizontal axis ranges from 0 to 100 in both cases, which makes it possible directly to
compare the two distributions.
One might be the distance between the two extreme values (the range). Or we might work
out what was the most likely difference between any two cases drawn at random from the
dataset. There are also two very commonly used measures which follow on from the logic of
using the median or mean as the summary of the level.
The midspread
The range of the middle 5 0 per cent of the distribution is a commonly used measure of spread
because it concentrates on the middle cases. It is quite stable from sample to sample. The
points which divide the distribution into quarters are called the quartiles (or sometimes
'hinges' or 'fourths'). The lower quartile is usually denoted QL and the upper quartile Q0.
(The middle quartile is of course the median.) The distance between QL and Q0 is called the
midspread (sometimes the 'interquartile range'), or the dQ for short.
Just as the median cut the whole distribution in two, the upper and lower quartiles cut each
half of the distribution in two. So, to find the depth at which they fall, we take the depth of the
median (cutting the fractional half off if there is one), add one and divide by two. There are
15 cases in the small datasets of men's working hours and women's working hours in figure
2.3A and . The median is at depth 8, so the quartiles are at depth 4.5. Counting in from either
end of the male distribution, we see that for men QL is 42.5 hours and Q0 is 55 hours. The
distance between them, the dQ, is therefore 12.5 hours.
9
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
The arithmetic mean, you will recall, minimizes squared residuals. There is a measure of
spread which can be calculated from these squared distances from the mean. The standard
deviation essentially calculates a typical value of these distances from the mean. It is
conventionally denoted s, and defined as:
The deviations from the mean are squared, summed and divided by the sample size (well, N -
1 actually, for technical reasons), and then the square root is taken to return to the original
units. The order in which the calculations are performed is very important. As always,
calculations within brackets are performed first, then multiplication and division, then
addition (including summation) and subtraction. Without the square root, the measure is
called the variance(s2).
10
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
The original data values are written in the first column, and the sum and mean calculated at
the bottom. The residuals are calculated and displayed in column 2, and their squared values
are placed in column 3. The sum of these squared values is shown at the foot of column 3, and
from it the standard deviation is calculated.
The word 'data' must be treated with caution. Literally translated, it means 'things that are
given'. However, classical scholarship must be rejected. The numbers that present themselves
to us are not given naturally in that form. Any particular batch of numbers has been fashioned
by human hand. The numbers did not drop from the sky ready made. The numbers usually
reflect aspects of the social process which created them. Data, in short, are produced, not
given.
11
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Data analysts have to learn to be critical of the measures available to them, but in a
constructive manner. This chapter considers various manipulations that can be applied to the
data to achieve the above goals. We start by recalling how a constant may be added to or
subtracted from each data point, and then look at the effect of multiplying or dividing by a
constant. Then we consider a powerful standardizing technique which makes the level and
spread of any distribution identical.
Instead of adding a constant, we could change each data point by multiplying or dividing it by
a constant. A common example of this is the re-expression of one currency in terms of
another. For example, in order to convert pounds to US dollars, the pounds are multiplied by
the current exchange rate. Multiplying or dividing each of the values has a more powerful
effect than adding or subtracting.
The result of multiplying or dividing by a constant is to scale the entire variable by a factor,
evenly stretching or shrinking the axis like a piece of elastic. To illustrate this, let us see what
happens if data from the General Household Survey on the weekly alcohol consumption of
men who classify themselves as moderate or heavy drinkers are divided by seven to give the
average daily alcohol consumption.
12
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Standardized variables
In this section we will look at how these two ideas may be combined to produce a very
powerful tool which can render any variable into a form where it can be compared with any
other. The result is called a standardized variable.
To standardize a variable, a typical value is first subtracted from each data point, and then
each point is divided by a measure of spread. It is not crucial which numerical summaries of
level and spread are picked. The mean and standard deviation could be used, or the median
and midspread:
A variable which has been standardized in this way is forced to have a mean or median of 0
and a standard deviation or midspread of 1.
Standardization
Two different uses of variable standardization are found in social science literature. The first
is in building causal models, where it is convenient to be able to compare the effect that two
different variables have on a third on the same scale. But there is a second use which is more
13
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
immediately intelligible: standardized variables are useful in the process of building complex
measures based on more than one indicator.
In order to illustrate this, we will use some data drawn from the National Child Development
Study (NCDS). This is a longitudinal survey of all children born in a single week of 1958.
There is a great deal of information about children's education in this survey. Information was
sought from the children's schools about their performance at state examinations, but the
researchers also decided to administer their own tests of attainment.
The first two columns of figure 3.5 show the scores obtained on the reading and mathematics
test by fifteen respondents in this study. There is nothing inherently interesting or intelligible
about the raw numbers. The first score of 31 for the reading test can only be assessed in
comparison with what other children obtained
As can be seen from the descriptive statistics in figure 3.4, the sixteen-year-olds in the
National Child Development Study apparently found the mathematics test rather more
difficult than the reading comprehension test. The reading comprehension was scored out of a
total of 35 and sixteen-year-olds gained a mean score of 25.37, whereas the mathematics test
was scored out of a possible maximum of 31, but the 16-year-olds only gained a mean score
of 12.75.
14
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
This is achieved by standardizing each score. One common way of standardizing is to first
subtract the mean from each data value, and then divide the result by the standard deviation.
This process is summarized by the following formula, where the original variable 'Y' becomes
the standardized variable ‘Z’:
One such shape, investigated in the early nineteenth century by the German mathematician
and astronomer, Gauss, and therefore referred to as the Gaussian distribution, is commonly
used. It is possible to define a symmetrical, bell-shaped curve which looks like those in figure
3.8, and which contains fixed proportions of the distribution at different distances from the
15
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
centre. The two curves in figure 3.8 look different - (a) has a smaller spread than (b) - but in
fact they only differ by a scaling factor.
Any Gaussian distribution has a very useful property: it can be defined uniquely by its
mean and standard deviation. Given these two pieces of information, the exact shape of the
curve can be reconstructed, and the proportion of the area under the curve falling between
various points can be calculated (see figure 3.9).
The Gaussian distribution is defined theoretically; you can think of it as being based on an
infinitely large number of cases. For this reason it is perfectly smooth, and has infinitely long
tails with infinitely small proportions of the distribution falling under them. The theoretical
definition of the curve is given by an equation. This bell-shaped curve is often called 'the
normal distribution'. Its discovery was associated with the observation of errors of
measurement. The distribution of these errors of measurement often approximated to the
bell-shape in figure 3.8
3.5 Inequality
How true is the old proverb that the rich get richer and the poor get poorer?
What evidence can we use to look at how the gap between the richest and the poorest
in society is shifting over time?
Does the way that we measure inequality impact on our conclusions?
16
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Those who are interested in income inequality have traditionally used techniques of
displaying income distributions and summarizing their degree of spread.
Over the past three decades or so, the British economy has grown and this has made a real
difference to people's lives. Household disposable income per head, adjusted for inflation,
increased more than one and a third times between 1971 and 2003 so that for every £100 a
household had to spend in 1971, by 2003 they had to spend £234 (Summerfield and Gill,
2005). During the 1970s and early 1980s, growth in household income was somewhat erratic,
and in some years there were small year on year falls, such as in 1974, 1976, 1977,
1981and1982. However, since then there has been growth each year, with the exception of
1996 when there was a very small fall. Data from the British Social Attitudes Survey (Park et
al., 2004) show that whereas in 1983, 24 per cent of people said they were living comfortably
and 25 per cent said they found it difficult or very difficult to cope, by the early 1990s, 40 per
cent said they were comfortable while 16 per cent said they were finding it hard to cope.
This chapter will focus on how we can measure inequality in such a way as to make it
possible to compare levels of inequality in different societies and to look at changes in
levels of inequality over time.
Considered at the most abstract level, income and wealth are two different ways of looking at
the same thing. Both concepts try to capture ways in which members of society have different
access to the goods and services that are valued in that society. Wealth is measured simply in
pounds, and is a snapshot of the stock of such valued goods that any person owns, regardless
of whether this is growing or declining. Income is measured in pounds per given period, and
gives a moving picture, telling us about the flow of revenue over time.
There are four major methodological problems encountered when studying the distribution of
income:
17
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Definition of income
To say that income is a flow of revenue is fine in theory, but we have to choose between two
approaches to making this operational. One is to follow accounting and tax practices, and
make a clear distinction between income and additions to wealth.
The first approach of the Inland Revenue, which has separate taxes for income and capital
gains. In this context a capital gain is defined as the profit obtained by selling an asset that has
increased in value since it was obtained. However, interestingly, in most cases this definition
(for the purposes of taxation) does not include any profit made when you sell your main
home.
The second approach is to treat income as the value of goods and services consumed in a
given period plus net changes in personal wealth during that period.
The definition of income usually only includes money spent on goods and services that are
consumed privately. But many things of great value to different people are organized at a
collective level: health services, education, libraries, parks, museums, even nuclear warheads.
unearned income which accrues from ownership of investments, property, rent and so
on;
transfer income, that is benefits and pensions transferred on the basis of entitlement,
not on the basis of work or ownership, mainly by the government but occasionally by
individuals (e.g. alimony).
The first two sources are sometimes added together and referred to as original income.
In many of the reports that examine inequality the basic unit of analysis used is the
household, and not the family or the individual. This chapter will also use the household
18
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
as the main unit of analysis. . This is one person, or a group of persons, who have the
accommodation as their only or main residence and (for a group) share the living
accommodation, that is a living or sitting room, or share meals together or have common
housekeeping. Up until 1999-2000, the definition was based on the pre-1981 Census
definition. This required a group of persons to share eating and budgeting arrangements as
well as shared living accommodation in order to be considered as a household. The effect
of the change was fairly small, but not negligible.
While most income is paid to individuals, the benefits of that income are generally shared
across broader units. Spending on many items, particularly on food, housing, fuel and
electricity, is largely joint spending by the members of the household. While there are
many individuals who receive little or no income of their own, many of these will live
with other people who do receive an income. This makes the household a good unit to
study for those who are interested in inequality. This approach means that total household
income is understood as representing the (potential) standard of living of each of its
members.
An equivalence scale assigns a value to each individual based on their age and
circumstances. The values for each household member are then summed to give the total
equivalence number for that household. This number is then divided into the disposable
income for that household to give equivalized disposable income. For example, a
household with a married couple and one child aged six would have an equivalence
number of 1.0 + 0.21 = 1.21 (these figures are found in the appendix to this chapter). In
the example above the household's disposable income (for Household B) is £40,000, and
so its equivalized disposable income is £33,058 (i.e. £40,000/1.21). The equivalence
number for Household A is 0.61 for a single head of household and therefore the
equivalized disposable income would be £49,180.
It is difficult to decide what the appropriate period should be for the assessment of
income. It is usually important to distinguish inequalities between the same people over
the course of their life-cycle and inequalities between different people. If a short period,
like a week, is chosen, two people may appear to have identical incomes even though they
19
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
are on very different lifetime career paths; conversely, two individuals who appear to have
very different incomes may be on identical lifetime career paths but just be at different
ages.
The solution might be to take a longer period - ideally a lifetime, perhaps. However, either
guesses will have to be made about what people will earn in the future, or definitive
statements will only be possible about the degree of inequality pertaining in society many
decades previously. For most purposes, one tax year is used as the period over which
information about income is collected.
There are now many large-scale, regular surveys of individuals and households in Britain
that include questions about income from employment and benefits. For example, to
mention just three, there are the General Household Survey discussed in the appendix to
chapter 1, the British Household Panel Study and the Labour Force Survey. However,
there are two main sources of information about income that Government departments use
to produce regular annual publications on income inequalities. The Department for Work
and Pensions uses the Family Resources Survey (FRS) to produce its publication
'Households Below Average Income', while the Office for National Statistics uses the
'Expenditure and Food Survey' (EFS) to produce a series on the redistribution of income
published annually as 'The effects of taxes and benefits on household incomes'.
Figure 4.1 illustrates one method for summarizing data on the income received by
households. It displays the gross income of different deciles of the distribution . For
example, figure 4.1 shows that in 2003-4 the poorest ten per cent of households had a
gross income of less than £124 per week, while the richest ten per cent of households had
20
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
a gross income of over £1,092 per week. The median gross income is £445 per week.
An alternative technique for examining the distribution of incomes is to adopt the quantile
shares approach. This is illustrated in figure 4.2, which is a modified version of a table
produced as part of the annual report from the Office for National Statistics 'The effects of
taxes and benefits on household income'. The income of all units falling in a particular
quantile group - for example, all those with income above the top decile, is summed and
expressed as a proportion of the total income received by everyone.
Original income is defined as the income in cash of all members of the household before
the deduction of taxes or the addition of any state benefits. It therefore includes income
from employment and self-employment as well as investment income, occupational
pensions and annuities.
Gross income is then calculated by adding cash benefits and tax credits to original
income. Cash benefits and tax credits include contributory benefits such as retirement
pension, incapacity benefit and statutory maternity pay and non-contributory benefits such
as income support, child benefit, housing benefit and working families tax credit.
Income tax, Council tax and National Insurance contributions are then deducted to give
disposable income. The final stage is to deduct indirect taxes to give post-tax income.
21
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Neither quantiles nor quantile shares lend themselves to an appealing way of presenting
the distribution of income in a graphical form. This is usually achieved by making use of
cumulative distributions. The income distribution is displayed by plotting cumulative
income shares against the cumulative percentage of the population.
The cumulative distribution is obtained by counting in from one end only. Income
distributions are traditionally cumulated from the lowest to the highest incomes. To see
how this is done, consider the worksheet in figure 4.3. The bottom 5 per cent receive 0.47
per cent of the total original income, and the next 5 per cent receive 0.51 per cent. In
summing these, we can say that the bottom 10 per cent receive 0.98 per cent of the total
original income. We work our way up through the incomes in this fashion.
22
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
The cumulative percentage of the population is then plotted against the cumulative share
of total income. The resulting graphical display is known as a Lorenz curve.
Lorenz curve
It was first introduced in 1905 and has been repeatedly used for visual communication of
income and wealth inequality. The Lorenz curve for pre-tax income in 2003-4 in the UK is
shown in figure 4.4.
23
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Lorenz curves have visual appeal because they portray how near total equality or total
inequality a particular distribution falls. If everyone in society had the same income, then
the share received by each decile group, for example, would be 10 per cent, and the
Lorenz curve would be completely straight, described by the diagonal line in figure 4.4. If,
on the other hand, one person received all the income and no one else got anything, the
curve would be the L-shape described by the two axes. The nearer the empirical line
comes to the diagonal, the more equally distributed income in society is.
24
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
In order to trace trends in income inequality over time, or in order to make comparisons
across nations, or in order to compare inequalities in income with inequalities in wealth or
housing or health, however, a single numerical summary is desirable.
Scale independence
We have already come across two measures of the spread of a distribution - the standard
deviation and the midspread. Unfortunately, if money incomes change because they are
expressed in yen rather than pounds, or, less obviously, if they increase simply to keep
pace with inflation, the standard deviation and midspread of the distribution will also
change. We want a measure of inequality that is insensitive to such scaling factors.
A measure that summarizes what is happening across all the distribution is the Gini
coefficient. An intuitive explanation of the Gini coefficient can be given by looking back
at figure 4.4. The Gini coefficient expresses the ratio between the area between the Lorenz
curve and the line of total equality and the total area in the triangle formed between the
perfect equality and perfect inequality lines. It therefore varies between 0 (on the line of
perfect equality) and 1 (on the L-shaped line of perfect inequality), although it is
sometimes multiplied by 100 to express the coefficient in percentage form.
The Lorenz curve of original income in figure 4.4 represents a Gini coefficient of 52.
There is a measure of spread which is the average absolute difference between the value of
every individual compared with every other individual. The Gini coefficient is this amount
divided by twice the mean. As you might expect, a measure which requires you to look at
every possible pair of incomes is tremendously laborious to calculate, although relatively
straightforward when a computer takes the strain. Because income distributions are so
often presented in grouped form, the intuitive definition based on the Lorenz curve is
usually sufficient. A rough guide to the numerical value of the Gini coefficient can always
be obtained by plotting the Lorenz curve on to squared paper and counting the proportion
of squares that fall in the shaded area.
the Gini coefficient exists (Cowell, 1977), the formula is presented here:
25
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Notice that the key term (∑i Y) involves a weighted sum of the data values, where the
weight is the unit's rank order in the income distribution. Figure 4.5 shows the trends in
inequality as measured by the Gini coefficient of original, gross, disposable and post-tax
income from 1981 to 2003-4.
Time series
The aim of this chapter is to introduce a method for presenting time series data that brings
out the underlying major trends and removes any fluctuations that are simply an artefact of
the ways the data have been collected. In addition the chapter will highlight the
26
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
importance of examining how statistics such as the number of recorded crimes are
produced. Governments, and other producers of statistics, frequently revise the way that
statistics are calculated and this can lead to apparent change (or stability) over time that is
no more than a reflection of the changing way in which a statistic is derived.
The total number of crimes recorded every year from 1965 to 1994, as shown in figure
5.1, is an example of a time series. Other examples might be the monthly Retail Price
Index over a period of ten years, the monthly unemployment rate or the quarterly balance
of payment figures during the last Conservative government. These examples all have the
same structure: a well-defined quantity is recorded at successive equally spaced time
points over a specific period. But problems can occur when any one of these features is
not met - for example if the recording interval is not equally spaced.
Smoothing
Time series such as that shown in the second column of figure 5.1 are displayed by
plotting them against time, as shown in figure 5 .2. When such trend lines are smoothed,
the jagged edges are sawn off. A smoothed version of the total numbers of recorded
crimes over the thirty years from the mid 1960s to the mid 1990s is displayed in figure
5.3.
27
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Figure 5.2 was constructed by joining points together with straight lines. Only the points
contain real information of course. The lines merely help the reader to see the points. The
result has a somewhat jagged appearance. The sharp edges do not occur because very
sudden changes really occur in numbers of recorded crimes. They are an artefact of the
method of constructing the plot, and it is justifiable to want to remove them. According to
Tukey (1977, p. 205), the value of smoothing is 'the clearer view of the general, once it is
unencumbered by detail'. The aim of smoothing is to remove any upward or downward
movement in the series that is not part of a sustained trend.
Sharp variations in a time series can occur for many reasons. Part of the variation across
time may be error. For example, it could be sampling error. The opinion-poll data used
later in this chapter were collected in monthly sample surveys, each of which aimed to
interview a cross-section of the general public, but each of which will have deviated from
the parent population to some extent. Similarly, repeated measures may each contain a
degree of measurement error. In such situations, smoothing aims to remove the error
component and reveal the underlying true trend.
28
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
In engineering terms we want to recover the signal from a message by filtering out the
noise. The process of smoothing time series also produces such a decomposition of the
data. In other words, what we might understand in engineering as
Opinion polls
The opinion polls which aim to capture individuals' political allegiances and voting
intentions. The purpose of the following discussion is to highlight that just as crime
statistics do not simply reflect the numbers of crimes committed, opinion polls do not
provide a direct window onto individuals' voting intentions. The data reported by polling
companies are a product of the methodologies used, in just the same way that information
about the number of crimes committed can be described as 'constructed'.
Opinion polls represent only a small fraction of all the social research that is conducted in
Britain, but they have become the public face of social research because they are so
heavily reported. Predicting who is going to win an election makes good newspaper copy.
29
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
The newspaper industry was therefore among the first to make use of the development of
scientific surveys for measuring opinion.
Opinion polls in Britain have historically almost always been conducted on quota samples.
In such a sample, the researcher specifies what type of people he or she wants in the
sample, within broad categories (quotas), and it is then left up to the interviewer to find
such people to interview. In a national quota sample, fifty constituencies might be selected
at random, and then quotas set within each constituency on age, sex and employment
status. In the better quota samples, such quotas are interlocked: the interviewer is told
how many young housewives, how many male unemployed and so on to interview. The
idea is that when all these quotas are added together, the researcher will be sure that the
national profile on age, sex and employment status will have been faithfully reproduced.
Technique
Figure 5 .6 shows the percentage of those who said they were certain to vote, and who
intended to vote Labour, plotted over time without being smoothed. The curve is jagged
because the values of raw time series data at adjacent points can be very different. On a
smooth curve, both the values and the slopes at neighbouring time points are close
together.
To smooth a time series we replace each data value by a smoothed value that is
determined by the value itself and its neighbours. The smoothed value should be close to
each of the values which determine it except those which seem atypical.
30
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Summaries of three
The simplest such resistant average is to replace each data value by the median of three
values: the value itself, and the two values immediately adjacent in time. Consider, again,
the percentage of respondents who intended to vote Labour (column 2 of data in figure
5.5). To smooth this column, we take the monthly figures in groups of three, and replace
the value of the middle month by the median of all three months.
In March, April and May 2003, the median is 43 per cent, so April's value is unchanged.
In April, May and June 2003, the median is 41 per cent, so the value for May is altered to
41 as shown in figure 5. 7. The process is repeated down the entire column of figures.
31
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
The data, the smoothed values and the residuals are shown in the first three columns of
data in figure 5.8. Notice the large residuals for the somewhat atypical results in August,
September and October 2004. The effect of median smoothing is usually to exchange the
jagged peaks for flat lines.
One other possible method of smoothing would be to use means rather than medians. As
with the median smoothing, the residuals in the seemingly atypical months are large, but
the sharp contrast between the typical and atypical months has been lost
32
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
2. List the times and data in two adjacent columns, rescaling and relocating to minimize
writing and computational effort. Judicious cutting or rounding can reduce work
considerably.
3. Record the median of three consecutive data values alongside the middle value. With a
little practice, this can be done quickly and with very little effort.
4. Pass through the data, recording medians of three as many times as required.
Hanning
33
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
any three consecutive data values, the adjacent values are each given weight one-quarter,
whereas the middle value, the value being smoothed, is given weight one-half. This is
achieved in the following way: first calculate the mean of the two adjacent values - the
skip mean - thus skipping the middle value; then calculate the mean of the value to be
smoothed and the skip mean. It is easy to show that these two steps combine to give the
required result.
In practice, we first form a column of skip means alongside the values to be smoothed and
then form a column of the required smoothed values
This procedure is depicted above for the first three values of the repeated median smooth,
shown in full in figure 5.10.
34
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Thus 43 is the value to be smoothed, the skip mean 42 is the mean of 41 and 43 and the
smoothed value 42.5 is the mean of 43 and 42. A new element of notation has been
introduced into figure 5.10: the column of banned data values is sometimes labelled 'H'.
We can now summarize the smoothing recipe used in this figure as '3RH'.
Hanning has produced a smoother result than repeated medians alone. Whether the extra
computational effort is worthwhile depends on the final purpose of the analysis. Repeated
medians are usually sufficient for exploratory purposes but, if the results are to be
presented to a wider audience, the more pleasing appearance that can be achieved by
banning may well repay the extra effort.
Figure 5 .11 now tells a much clearer story. The proportion of people reporting that they
would vote Labour declined from early 2003 to a low point around July 2004, but then
revived somewhat. T
35
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Residuals
Having smoothed time series data, much can be gained by examining the residuals
between the original data and the smoothed values, here called the rough. Residuals can
tell us about the general level of variability of data over and above that accounted for by
the fit provided by the smoothed line. We can judge atypical behaviour against this
variability, as measured, for example, by the midspread of the residuals.
Ideally we want residuals to be small, centred around zero and patternless, and, if possible,
symmetrical in shape with a smooth and bell-shaped appearance. Displaying residuals as a
histogram will reveal their typical magnitude and the shape of their distribution. Figure
5 .12 shows the histogram of the residuals from the repeat median and hanning smooth
(the 3RH for short). This shows that the residuals are small in relation to the original data,
fairly symmetrical, centred on zero and devoid of outliers.
36
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
Refinements
There are a number of refinements designed to produce even better smooths. Before
discussing how the first and last values in a time series might also be smoothed it is
helpful to introduce a convenient special notation yl' y2, ... , yN, or y1 in general; y1 refers
to the value of the quantity, y, recorded at time t. It is conventional to code t from 1 to N,
the total period of observation. For example, in figure 5 .10 the months February 2003 to
March 2005 would be coded from 1 to 2
Endpoint smoothing
So far we have been content to copy on the initial and final values for February 2003 and
March 2005 (y1 and yN), but we can do better. Instead of copying on yl' we first create a
new value to represent y at time 0, January 2003. This will give us a value on either side
of y1 so that it can be smoothed. This value is found by extrapolating the smoothed values
for times 2 and 3, which we will call z2 and z3 and this is shown graphically in figure 5.13
37
Downloaded by RITHIK SAVIO G 21EC ([email protected])
lOMoARcPSD|24718896
To compute this new value without recourse to graph paper, the following formula can be
used:
For example, a hypothetical value for January 2003 is given by (3 X 42.5) - (2 X 42.5) or
42.5 (data derived from figure 5.10). To provide a smooth endpoint value, we replace y1
by Zp the median of y0, y1 and z2. In this case, the median of 42.5, 42.5 and 42.5 is
simply 42.5, so this becomes the new, smoothed endpoint value.
Sometimes time series exhibit an obvious change in level and it may be sensible to analyse
the two parts separately, producing two roughs and two smooths. In such cases, the two
sections often exhibit markedly different levels of variability. Breaks in time series also
occur when the method for collecting data changes.
38
Downloaded by RITHIK SAVIO G 21EC ([email protected])