0% found this document useful (0 votes)

36 views30 pages

Unit III Dev Notes

The document introduces data exploration, focusing on single variable distributions, numerical summaries, and the importance of scaling and standardizing data for analysis. It emphasizes the use of statistical tools like SPSS to analyze survey data and presents techniques for visualizing data through bar charts, pie charts, and histograms. Additionally, it discusses methods for summarizing data, including measures of central tendency and spread, and highlights the significance of standardizing variables for comparative analysis.

Uploaded by

kumaresan7751

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views30 pages

Unit III Dev Notes

Uploaded by

kumaresan7751

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 30

UNIT I INTRODUCTION TO DATA EXPLORATION

Introduction to Single variable: Distribution Variables - Numerical Summaries of Level and Spread -
Scaling and Standardising – Inequality - Smoothing Time Series.

I. DISTRIBUTIONS AND VARIABLES

How many households have no access to a car? What is a typical household income in Britain?
Which country in Europe has the longest working hours? To answer these kinds of questions we
need to collect information from a large number of people, and we need to ensure that the people
questioned are broadly representative of the population we are interested in. Conducting large-scale
surveys is a time-consuming and costly business. However, increasingly information or data from
survey research in the social sciences are available free of charge to researchers and students. The
development of the worldwide web and the ubiquity and power of computers makes accessing
these types of data quick and easy.

The aim is to explore data. We can use the 'Statistical Package for the Social Sciences' (SPSS)
package to start analysing data and answering the questions posed above.

Preliminaries

Two organizing concepts have become the basis of the language of data analysis: cases and
variables. The cases are the basic units of analysis, the things about which information is collected.
The word variable expresses the fact that this feature varies across different cases.

We will look at some useful techniques for displaying information about the values of single
variables, and will also introduce the differences between interval level and ordinal level variables.

Variables on Household Survey

It is a multipurpose survey carried out by the social survey division of the Office for National
Statistics (ONS). The main aim of the survey is to collect data on a range of core topics, covering
household, family and individual information. Government departments and other organizations
use this information for planning, policy and monitoring purposes, and to present a picture of
households, family and people in Great Britain.
Column 5 indicates the social class of individual based on the occupation.

1. Managerial and professional occupations

2. Intermediate occupations
3. Small employers and own account workers
4. Lower supervisory and technical occupations
5. Semi-routine occupations

Bar charts and pie charts

One simple device is the bar chart, a visual display in which bars are drawn to represent each
category of a variable such that the length of the bar is proportional to the number of cases in the
category.
A pie chart can also be used to display the same information. It is largely a matter of taste whether
data from a categorical variable are displayed in a bar chart or a pie chart. In general, pie charts are
to be preferred when there are only a few categories and when the sizes of the categories are very
different.

Bar charts and pie charts can be an effective medium of communication if they are well drawn.
Histograms
Charts that are somewhat similar to bar charts can be used to display interval level variables
grouped into categories and these are called histograms. They are constructed in exactly the same
way as bar charts except that the ordering of the categories is fixed, and care has to be taken to
show exactly how the data were grouped.
II) NUMERICAL SUMMARIES OF LEVEL AND SPREAD
In this, how simple descriptive statistics can be used to provide numerical summaries of level and
spread.
Summaries of level
The level expresses where on the scale of numbers found in the dataset the distribution is
concentrated. In the previous example, it expresses where on a scale running from 1 hour per week
to 100 hours per week is the distribution's centre point. To summarize these values, one number
must be found to express the typical hours worked by men, for example. The problem is: how do
we define 'typical'? There are many possible answers. The value half-way between the extremes
might be chosen, or the single most common number of hours worked, or a summary of the
middle portion of the distribution.
Residuals
A residual can be defined as the difference between a data point and the observed typical, or
average, value.
For example if we had chosen 40 hours a week as the typical level of men's working hours, using
data from the General Household Survey in 2005, then a man who was recorded in the survey as
working 45 hours a week would have a residual of 5 hours. Another way of expressing this is to say
that the residual is the observed data value minus the predicted value and in this case 45-40 = 5. In
this example the process of calculating residuals is a way of recasting the hours worked by each man
in terms of distances from typical male working hours in the sample. Any data value such as a
measurement of hours worked or income earned can be thought of as being composed of two
components: a fitted part and a residual part. This can be expressed as an equation:
The median
The value of the case at the middle of an ordered distribution would seem to have an intuitive
claim to typicality. Finding such a number is easy when there are very few cases. In the example of
hours worked by a small random sample of 15 men (figure 2.3A), the value of 48 hours per week fits
the bill. There are six men who work fewer hours and seven men who work more hours while two
men work exactly 48 hours per week. Similarly, in the female data, the value of the middle case is 3 7
hours. The data value that meets this criterion is called the median.

The median is easy to find when, as here, there are an odd number of data points. When
the number of data points is even, it is an interval, not one case, which splits the distribution into
two. The value of the median is conventionally taken to be half-way between the two middle
cases. Thus
the median in a dataset with fifty data points would be half-way between the values of the 25th and
26th data points.
Put formally, with N data points, the median M is the value at depth (N + 1)/2. It is not the value at
depth N/2. With twenty data points, for example, the tenth case has nine points which lie below it
and ten above.
The arithmetic mean
Another commonly used measure of the centre of a distribution is the arithmetic mean. To
calculate it, first all of the values are summed, and then the total is divided by the number of data
points. In more mathematical terms:

The symbol Y is conventionally used to refer to an actual variable. The subscript i is an index to tell
us which case is being referred to. So, in this case, Yi refers to all the values of the hours
variable. The Greek letter pronounced 'sigma', is the mathematician's way of saying 'the sum of'.
Summaries of spread
The second feature of a distribution visible in a histogram is the degree of variation or spread in the
variable.
The midspread
The range of the middle 5 0 per cent of the distribution is a commonly used measure of spread
because it concentrates on the middle cases. It is quite stable from sample to sample. The points
which divide the distribution into quarters are called the quartiles (or sometimes 'hinges' or
'fourths'). The lower quartile is usually denoted QL and the upper quartile Q0. (The middle quartile
is of course the median.) The distance between QL and Q0 is called the midspread (sometimes the
'interquartile range'), or the dQ for short.
The standard deviation
The standard deviation essentially calculates a typical value of these distances from the mean. It is
conventionally denoted s, and defined as:

The deviations from the mean are squared, summed and divided by the sample size, and then the
square root is taken to return to the original units. The order in which the calculations are
performed is very important. As always, calculations within brackets are performed first, then
multiplication and division, then addition (including summation) and subtraction. Without the
square root, the measure is called the variance, s2.
The layout for a worksheet to calculate the standard deviation of the hours worked by this small
sample of men is shown in figure 2.4.

The original data values are written in the first column, and the sum and mean calculated at the
bottom. The residuals are calculated and displayed in column 2, and their squared values are
placed in column 3. The sum of these squared values is shown at the foot of column 3, and from it
the standard deviation is calculated.

In most distributions, the standard deviation is smaller than the

midspread.

Choosing between measures

How do we decide between the median and mean to summarize a typical value, or between the
range, the midspread and the standard deviation to summarize spread?
Twyman's law: The more unusual or interesting the data, the more likely they are to have been
the result of an error of one kind or another.

We must recognize that errors of all sorts creep into the very best data sources. The values found at
the extremes of a distribution are more likely to have suffered error than the values at the centre.

III) SCALING AND STANDARDIZING

Data are produced not given

The word 'data' must be treated with caution. Literally translated, it means 'things that are
given'.

There are often problems with using official statistics, especially those which are the by-products
of some administrative process like, for example, reporting deaths to the Registrar-General or
police forces recording reported crimes. Data analysts have to learn to be critical of the
measures available to them, but in a constructive manner. As well as asking 'Are there any errors
in this measure?' we also have to ask 'Is there anything better available?' and, if not, 'How can I
improve what I've got?'

Improvements can often be made to the material at hand without resorting to the expense of
collecting new data.

We must feel entirely free to rework the numbers in a variety of ways to achieve the following
goals:
 to make them more amenable to analysis
 to promote comparability
 to focus attention on differences.

Consider various manipulations that can be applied to the data to achieve the above goals:

i) Adding or subtracting a constant

One way of focusing attention on a particular feature of a dataset is to add or subtract a

constant from every data value.

For example, in a set of data on weekly family incomes, it would be possible to

subtract the median from each of the data values, thus drawing attention to which
families had incomes below or above a hypothetical typical family.

The change made to the data by adding or subtracting a constant is fairly trivial. Only
the level is affected; spread, shape and outliers remain unaltered. The reason for
doing it is usually to force the eye to make a division above and below a particular point.
A negative sign would be attached to all those incomes which were below the median in
the example above. However, we sometimes add or subtract a constant to bring the
data within a particular range.

ii) MultipIying or dividing by a constant

Instead of adding a constant, we could change each data point by multiplying or
dividing it by a constant.
A common example of this is the re-expression of one currency in terms of another. For
example, in order to convert pounds to US dollars, the pounds are multiplied by the
current exchange rate. Multiplying or dividing each of the values has a more powerful
effect than adding or subtracting. The result of multiplying or dividing by a
constant is to scale the entire variable by a factor, evenly stretching or
shrinking the axis like a piece of elastic. To illustrate this, let us see what happens if
data from the General Household Survey on the weekly alcohol consumption of men
who classify themselves as moderate or heavy drinkers are divided by seven to give the
average daily alcohol consumption.

The overall shape of the distributions in figures 3. lA and 3. lB are the same. The data points are all
in the same order, and the relative distances between them have not been altered apart from the
effects of rounding. The whole distribution has simply been scaled by a constant factor.
In SPSS it is very straightforward to multiply or divide a set of data by a constant value. For example,
using syntax, the command to create the variable drday ‘Average daily alcohol consumption’ from
the variable drating ‘Average weekly alcohol consumption’ is as follows:
COMPUTE DRDAY — DRATING/7.
Alternatively, to create a new variable ‘NEWVAR’ by multiplying an existing variable ‘OLDVAR’ by
seven the syntax would be:
COMPUTE NEWVAR = OLDVAR*7.
The ‘Compute’ command can also be used to add or subtract a
constant, for example:
COMPUTE NEWVAR = OLDVAR + 100.
COMPUTE NEWVAR = OLDVAR—60.
The value of multiplying or dividing by a constant is often to promote comparability between
datasets where the absolute scale values are different. For example, one way to compare the cost
of a loaf of bread in Britain and the United States is to express the British price in dollars.
Percentages are the result of dividing frequencies by one particular constant — the total number
of cases.

iii) Standardized variables

In sections 3.2 and 3.3, we saw that subtracting a constant from every data value altered the level of
the distribution and dividing by a constant scaled the values by a factor. In this section we will
look at how these two ideas may be combined to produce a very powerful tool which can render
any variable into a form where it can be compared with any other. The result is called a standardized
variable.
To standardize a variable, a typical value is first subtracted from each data point, and then each
point is divided by a measure of spread. It is not crucial which numerical summaries of level and
spread are picked. The mean and standard deviation could be used, or the median and midspread:

A variable which has been standardized in this way is forced to have a mean or median of 0 and a
standard deviation or midspread of 1.
Two different uses of variable standardization are found in social science literature. The first is in
building causal models, where it is convenient to be able to compare the effect that two different
variables have on a third on the same scale. The second use which is more immediately intelligible:
standardized variables are useful in the process of building complex measures based on more than
one indicator.

In order to illustrate this, we will use some data drawn from the National Child Development Study
(NCDS). This is a longitudinal survey of all children born in a single week of 1958. There is a great
deal of information about children’s education in this survey. Information was sought from the
children’s schools about their performance at state examinations, but the researchers also decided
to administer their own tests of attainment.
Rather than attempt to assess knowledge and abilities across the whole range of school subjects,
the researchers narrowed their concern down to verbal and mathematical abilities. Each child was
given a reading comprehension test which was constructed by the National Foundation for
Educational Research for use in the study, and a test of mathematics devised at the University of
Manchester. The two tests were administered at the child’s school and had very different
methods of scoring. As a result they differed in both level and spread.
As can be seen from the descriptive statistics in figure 3.4, the sixteen-year-olds in the National Child
Development Study apparently found the mathematics test rather more difficult than the reading
comprehension test. The reading comprehension was scored out of a total of 35 and sixteen-year-
olds gained a mean score of 25.37, whereas the mathematics test was scored out of a possible
maximum of 31, but the 16-year-olds only gained a mean score of 12.75.

The first two columns of figure 3.5 show the scores obtained on the reading and mathematics test by
fifteen respondents in this study. There is nothing inherently interesting or intelligible about the raw
numbers. The first score of 31 for the reading test can only be assessed in comparison with what
other children obtained. Both tests can be thought of as indicators of the child’s general attainment
at school. It might be useful to try to turn them into a single measure of that construct.
In order to create such a summary measure of attainment at age 16, we want to add the two scores
together. But this cannot be done as they stand, because as we saw before, the scales of
measurement of these two tests are different. If this is not immediately obvious try the following
thought experiment. A 16-year-old who is average at reading but terrible at mathematics will
perhaps score 25.4 (i.e. the mean score) on the reading comprehension test and 0 on the
mathematics test. If these were summed the total is 25.4. However, a 16-year-old who is average
at mathematics but can’t read is likely to score 12.7 (i.e. the mean score) on the maths score and 0
on the reading comprehension. If these are summed the total would only be 12.7. If the two tests
can be forced to take the same scale, then they can be summed.
This is achieved by standardizing each score. One common way of standardizing is to first subtract
the mean from each data value, and then divide the result by the standard deviation. This process
is summarized by the following formula, where the original variable ‘Y’ becomes the standardized
variable ‘Z’:

The same individual’s mathematics score becomes (17 — 12.75)/7, or 0.61. This first respondent is
therefore above average in both reading and maths. To summarize, we can add these two together
and arrive at a score of 1.41 for attainment in general.
Similar calculations for the whole batch are shown in columns 3 and 4 of figure 3.5. We can see that
the sixth person in this extract of data is above average in reading but slightly below average (by a
quarter of a standard deviation) in mathematics. It should also be noted that any individual scoring
close to the mean for both their reading comprehension and their mathematics test will have a total
score close to zero. For example, the tenth case in figure 3.5 has a total score of —0.02.
The final column of figure 3.5 now gives a set of summary scores of school attainment, created by
standardizing two component scores and summing them, so attainment in reading and maths
have effectively been given equal weight.
It is very straightforward to create standardized variables using SPSS. by using the Descriptives
command, the SPSS package will automatically save a standardized version of any variable.
First select the menus

The next stage is to select the variables that you wish to standardize, in this case N2928 and N2930,
and check the box next to ‘Save standardized values as variables.’ The SPSS package will then
automatically save new standardized variables with the suffix Z. In this example, two new variables
ZN2928 and ZN2930 are created.

The syntax to achieve this is as follows:

DESCRIPTIVES
VARIABLES = n2928 n2930 / SAVE
/STATISTICS = MEAN STDDEV MIN MAX.

Standardizing the variables was a necessary, but not a sufficient condition for creating a simple
summary score. It is also important to have confidence that the components are both valid
indicators of the underlying construct of interest.

iv) The Gaussian distribution

We are now ready to turn to the third feature of distributions, their shape. With level and spread
taken care of, the shape of the distribution refers to everything that is left. In order to summarize
the shape of a distribution, it would need to be simple enough to be able to specify how it should be
drawn in a very few statements. For example, if the distribution were completely flat (a uniform
distribution), this would be possible. We would only need to specify the value of the extremes and
the number of cases for it to be reproduced accurately, and it would be possible to say exactly
what proportion of the cases fell above and below a certain level.

However, many distributions do have a characteristic shape — a lump in the middle and tails
straggling out at both ends. How convenient it would be if there was an easy way to define a more
complex shape like this and to know what proportion of the distribution would lie above and below
different levels.
One such shape, investigated in the early nineteenth century by the German mathematician and
astronomer, Gauss, and therefore referred to as the Gaussian distribution, is commonly used. It is
possible to define a symmetrical, bell-shaped curve which looks like those in figure 3.8, and which
contains fixed proportions of the distribution at different distances from the centre. The two curves
in figure 3.8 look different — (a) has a smaller spread than (b) — but in fact they only differ by a
scaling factor.
Any Gaussian distribution has a very useful property: it can be defined uniquely by its mean and
standard deviation. Given these two pieces of information, the exact shape of the curve can be
reconstructed, and the proportion of the area under the curve falling between various points can
be calculated.

This bell-shaped curve is often called ‘the normal distribution’. Its discovery was associated with
the observation of errors of measurement. If sufficient repeated measurements were made of the
same object, it was discovered that most of them centred around one value (assumed to be the true
measurement), quite a few were fairly near the centre, and measurements fairly wide of the mark
were unusual but did occur. The distribution of these errors of measurement often approximated to
the bell-shape in figure 3.8.

V) Standardizing with respect to an appropriate base

In the scaling and standardizing techniques considered up to now, the same numerical adjustment
has been made to each of the values in a batch of data. Sometimes, however, it can be useful to
make the same conceptual adjustment to each data value, which may involve a different number
in each case.
A batch of numbers may be reworked in several different ways in order to reveal different aspects of
the story they contain. A dataset which can be viewed from several angles is shown in figure 3.13:
the value of the lower quartile, the median and the upper quartile of male and female earnings
in the period between 1990 and 2000. The data are drawn from the New Earnings Survey that
collects information about earnings in a fixed period each year from the employers of a large sample
of employees
As the figures stand, the most dominant feature of the dataset is a rather uninteresting one: the
change in the value of the pound. While the median and mid-spreads of the money incomes each
year have increased substantially in this period, real incomes and differentials almost certainly have
not. How could we present the data in order to focus on the trend in real income differentials over
time?

One approach would be to treat the distribution of incomes for each sex in each year as a separate
distribution, and express each of the quartiles relative to the median. The result of doing this is
given in figure 3.14.

IV) INEQUALITY
Prosperity and inequality :
There are a number of reasons why we might want to reduce inequality in society. For example, as
Layard (2005) argues, if we accept that extra income has a bigger impact on increasing the happiness
of the poor than the rich, this means that if some money is transferred from the rich to the poor this
will increase the happiness of the poor more than it diminishes the happiness of the rich. This in turn
suggests that the overall happiness rating of a country will go up if income is distributed more
equally. Of course, as Layard acknowledges, the problem with this argument is that it only works if it
is possible to reduce inequality without raising taxes to such an extent that there is no longer an
incentive for individuals to strive to make money so that the total income is reduced as a result of
policies aimed at redistribution. It is clearly important to understand the principal ways of measuring
inequality if we are to monitor the consequences of changing levels of inequality in society. This
chapter will focus on how we can measure inequality in such a way as to make it possible to
compare levels of inequality in different societies and to look at changes in levels of inequality over
time.
Income and Wealth :
Considered at the most abstract level, income and wealth are two different ways of looking at the
same thing. Both concepts try to capture ways in which members of society have different access to
the goods and services that are valued in that society. Wealth is measured simply in pounds, and is a
snapshot of the stock of such valued goods that any person owns, regardless of whether this is
growing or declining. Income is measured in pounds per given period, and gives a moving picture,
telling us about the flow of revenue over time.
For the sake of simplicity, we restrict our focus to the distribution of income. We will look in detail at
the problems of measuring income and then consider some of the distinctive techniques for
describing and summarizing inequality that have evolved in the literature on economic inequality.
There are four major methodological problems encountered when studying the distribution of
income:
1. How should income be defined?
2. What should be the unit of measurement?
3. What should be the time period considered?
4. What sources of data are available?
Definition of income
To say that income is a flow of revenue is fine in theory, but we have to choose between two
approaches to making this operational. One is to follow accounting and tax practices, and make a
clear distinction between income and additions to wealth. With this approach, capital gains in a
given period, even though they might be used in the same way as income, would be excluded from
the definition. This is the approach of the Inland Revenue, which has separate taxes for income and
capital gains. In this context a capital gain is defined as the profit obtained by selling an asset that
has increased in value since it was obtained. However, interestingly, in most cases this definition (for
the purposes of taxation) does not include any profit made when you sell your main home.
The second approach is to treat income as the value of goods and services consumed in a given
period plus net changes in personal wealth during that period. This approach involves constantly
monitoring the value of assets even when they do not come to the market. That is a very hard task.
So, although the second approach is theoretically superior, it is not very practical and the first is
usually adopted.
The definition of income usually only includes money spent on goods and services that are
consumed privately. But many things of great value to different people are organized at a collective
level: health services, education, libraries, parks, museums, even nuclear warheads. The benefits
which accrue from these are not spread evenly across all members of society. If education were not
provided free, only families with children would need to use their money income to buy schooling.
Sources of income are often grouped into three types:
• earned income, from either employment or self-employment;
• unearned income which increases from ownership of investments, property, rent and so on;
• transfer income, that is benefits and pensions transferred on the basis of entitlement, not on the
basis of work or ownership, mainly by the government but occasionally by individuals .

Measuring inequality: quantiles and quantile shares :

Figure 4.1 illustrates one method for summarizing data on the income received by households. It
displays the gross income of different deciles of the distribution (gross income is defined as income
from employment, self-employment, investments, pensions, etc. plus any cash benefits or tax
credits). For example, figure 4.1 shows that in 2003/4 the poorest ten per cent of households had a
gross income of less than 124 pounds per week, while the richest ten per cent of households had a
gross income of over 1,092 pounds per week. The median gross income is 445 pounds per week.
An alternative technique for examining the distribution of incomes is to adopt the quantile shares
approach. This is illustrated in figure 4.2, which is a modified version of a table produced as part of
the annual report from the Office for National Statistics ’The effects of taxes and benefits on
household:1 income’. The income of all units falling in a particular quantile group — for example, all
those with income above the top decile, is summed and expressed as a proportion of the total
income received by everyone.
Cumulative income shares and Lorenz curves :
Neither quantiles nor quantile shares lend themselves to an appealing way of presenting the
distribution of income in a graphical form. This is usually achieved by making use of cumulative
distributions. The income distribution is displayed by plotting cumulative income shares against
the cumulative percentage of the population.
The cumulative distribution is obtained by counting in from one end only. Income distributions are
traditionally cumulated from the lowest to the highest incomes. To see how this is done, consider
the worksheet in figure 4.3. The bottom 5 percent receive 0.47 percent of the total original income,
and the next 5 percent receive 0.51 percent. In summing these, we can say that the bottom 10 per
cent receive 0.98 per cent of the total original income. We work our way up through the incomes in
this fashion. It can be noted that the first two columns of this table are simply a more detailed
version of the data presented in figure 4.2. For example, from figure 4.2 we can see that the top
quintile group receives 51 per cent of original income; this figure is also obtained if you sum the first
three numbers in the first column of figure 4.3.
The cumulative percentage of the population is then plotted against the cumulative share of total
income. The resulting graphical display is known as a Lorenz curve. It was first introduced in 1905
and has been repeatedly used for visual communication of income and wealth inequality. The
Lorenz curve for pre-tax income in 2003/4 in the UK is shown in figure 4.4.
Lorenz curves have visual appeal because they portray how near total equality or total inequality a
particular distribution falls. If everyone in society had the same income, then the share received by
each decile group, for example, would be 10 per cent, and the Lorenz curve would be completely
straight, described by the diagonal line.
Desirable properties in a summary measure of inequality
Scale independence
However, it is important that the measure be sensitive to the level of the distribution. Imagine a
hypothetical society containing three individuals who earned 5,000, 10,000 and 15,000 pounds
respectively. If they all had an increase in their incomes of l million pound, we would expect a
measure of inequality to decline, since the differences between these individuals would have
become trivial. The standard deviation and midspread would be unaffected. A popular approach is
to log income data before calculating the numerical summaries of spread. If two distributions
differ by a scaling factor, the logged distributions will differ only in level. However, if they differ by
an arithmetic constant, they will have different spreads when logged. The existence of units with
zero incomes leads to problems, since the log of zero cannot be defined mathematically. An easy
technical solution to this problem is to add a very small number to each of the zeros. If a numerical
summary of spread in a logged distribution met the other desirable features of a measure of
inequality, we could stop here.
The principle of transfers
It makes intuitive sense to require that a numerical summary of inequality should decline whenever
money is given by a rich person to a poor person, regardless of how poor or how rich , and
regardless of how much money is transferred.One numerical summary — the income share of a
selected quantile group — fails to meet this principle. By focusing on one part of the distribution
only, perhaps the top 5 per cent, it would fail to record a change if a transfer occurred elsewhere in
the distribution. Similar objections apply to another commonly used summary, the decile ratio,
which simply expresses the ratio of the upper decile to the lower decile. Other inequality measures
meet this principle, and so are to be preferred.
The Gini coefficient
A measure that summarizes what is happening across all the distribution is the Gini coefficient.
The Gini coefficient expresses the ratio between the area between the Lorenz curve and the line of
total equality and the total area in the triangle formed between the perfect equality and perfect
inequality lines. It therefore varies between 0 (on the line of perfect equality) and 1 (on the L-
shaped line of perfect inequality), although it is sometimes multiplied by 100 to express the
coefficient in percentage form.

Notice that the key term involves a weighted sum of the data values, where the
weight is the unit's rank order in the income distribution.
V) SMOOTHING TIME SERIES
The aim is to introduce a method for presenting time series data that brings out the underlying
major trends and removes any fluctuations that are simply an artefact of the ways the data have
been collected.
The total number of crimes recorded every year from 1965 to 1994, as shown in figure 5.1, is an
example of a time series.

Other examples might be the monthly Retail Price Index over a period of ten years, the monthly
unemployment rate or the quarterly balance of payment figures during the last Conservative
government. These examples all have the same structure: a well-defined quantity is recorded at
successive equally spaced time points over a specific period. But problems can occur when any
one of these features is not met - for example if the recording interval is not equally spaced.
The aim of smoothing
Figure 5.2 was constructed by joining points together with straight lines. Only the points contain
real information of course. The lines merely help the reader to see the points. The result has a
somewhat jagged appearance. The sharp edges do not occur because very sudden changes really
occur in numbers of recorded crimes. They are an artefact of the method of constructing the plot,
and it is justifiable to want to remove them. According to Tukey (1977, p. 205), the value of
smoothing is ‘the clearer view of the general, once it is unencumbered by detail’. The aim of
smoothing is to remove any upward or downward movement in the series that is not part of a
sustained trend.
Sharp variations in a time series can occur for many reasons. Part of the variation across time may
be error. For example, it could be sampling error.
In engineering terms we want to recover the signal from a message by filtering out the noise. The
process of smoothing time series also produces such a decom- position of the data. In other
words, what we might understand in engineering as
Opinion polls
Opinion polls represent only a small fraction of all the social research that is conducted in Britain,
but they have become the public face of social research because they are so heavily reported.
Predicting who is going to win an election makes good newspaper copy. The newspaper industry was
therefore among the first to make use of the development of scientific surveys for measuring
opinion. In all general elections in Britain since the Second World War, polls have been conducted to
estimate the state of the parties at the time, and the number of such polls continues to grow. By-
elections and local elections are now also the subject of such investigations.
Opinion polls in Britain have historically almost always been conducted on quota samples. In such
a sample, the researcher specifies what type of people he or she wants in the sample, within broad
categories (quotas), and it is then left up to the interviewer to find such people to interview. In a
national quota sample, fifty constituencies might be selected at random, and then quotas set within
each constituency on age, sex and employment status. Interviewers would then have to find so
many women, so many unemployed and so many young people, etc. In the better quota samples,
such quotas are interlocked: the interviewer is told how many young housewives, how many male
unemployed and so on to interview. The idea is that when all these quotas are added together, the
researcher will be sure that the national profile on age, sex and employment status will have been
faith-fully reproduced.
For example, since 2002, MORI’s ‘headline’ voting intention figure has been calculated by excluding
all those who are not ‘absolutely certain to vote’. This is measured by asking respondents to rate
their certainty to vote on a scale from 1 to 4 0, where 1 means absolutely certain not to vote and
‘10’ means absolutely certain to vote, and only those rating their likelihood of voting at ‘10’ are
included. Figure 5.5 shows MORI’s data on trends in voting intention leading up to the 2005 General
Election, held on 5 May.
Techniques
To smooth a time series we replace each data value by a smoothed value that is determined by
the value itself and its neighbours. The smoothed value should be close to each of the values which
determine it except those which seem atypical. We therefore want some form of resistant numerical
summary — some local typical value.
This involves two decisions: which neighbouring points are to be considered local and which changes
are atypical? The answers to these questions must depend in part on the particular problem, but this
chapter presents some multipurpose procedures which give generally satisfactory results. These
procedures answer the two questions as follows: take one point either side as local and treat as
real an upward or downward change of direction which is sustained for at least two successive
points.
Summaries of Three
The simplest such resistant average is to replace each data value by the median of three values: the
value itself, and the two values immediately adjacent in time. Consider, again, the percentage of
respondents who intended to vote Labour (column 2 of data in figure 5.5). To smooth this column,
we take the monthly figures in groups of three, and replace the value of the middle month by the
median of all three months.

In March, April and May 2003, the median is 43 per cent, so April’s value is unchanged. In April, May
and June 2003, the median is 41 per cent, so the value for May is altered to 41 as shown in figure
5.7. The process is repeated down the entire column of figures.
Since, for the purpose of this exercise, we are supposing that the January 2003 and April 2005 rates
are unknown, we simply copy on the first and last values, 41and 37, for February 2003 and March
2005. More sophisticated rules for smoothing these end values are available, but discussion of them
is postponed for the present.
One other possible method of smoothing would be to use means rather than medians. The result of
using the mean of each triple instead of the median is shown in columns 4 and 5 of figure 5.8.

To sum up, the recommended procedure so for is:

1. Plot the data first, as arithmetic smoothing may not be required
2. List the times and data in two adjacent columns, rescaling and relocating to minimize writing
and computational effort. Judicious cutting or rounding can reduce work considerably.
3. Record the median of three consecutive data values alongside the middle value. With a
little practice, this can be done quickly and with very little effort.
4. Pass through the data, recording medians of three as many times as required.
5. Copy on the two endpoint
values.

Hanning
Although smoothing by repeated medians of three is adequate for most purposes and successfully
dealt with seemingly atypical values, the results still have a somewhat jagged appearance. One way
to smooth off the corners would be to use running means of three on the 3R smooth. However, we
can do better than taking simple means of three. This would give equal weight, one-third, to each
value. As the data have already been smoothed, it would seem sensible to give more weight to the
middle value.
A procedure called hanning is given any three consecutive data values, the adjacent values are
each given weight one-quarter, whereas the middle value, the value being smoothed, is given
weight one- half. This is achieved in the following way: first calculate the mean of the two
adjacent values — the skip mean — thus skipping the middle value; then calculate the mean of the
value to be smoothed and the skip mean. It is easy to show that these two steps combine to give
the required result.
In practice, we first form a column of skip means alongside the values to be smoothed and then form
a column of the required smoothed values.

This procedure is depicted above for the first three values of the repeated median smooth, shown in
full in figure 5.10.

Thus 43 is the value to be smoothed, the skip mean 42 is the mean of 41 and 43 and the smoothed
value 42.5 is the mean of 43 and 42. A new element of notation has been introduced into figure
5.10: the column of hanned data values is sometimes labelled ‘H’.
The results are plotted in figure 5.11 and this also displays the percentage of individuals saying they
would vote Conservative and Liberal Democrat over the same period. Hanning has produced a
smoother result than repeated medians alone. Whether the extra computational effort is
worthwhile depends on the final purpose of the analysis. Repeated medians are usually sufficient
for exploratory purposes but, if the results are to be presented to a wider audience, the more
pleasing appearance that can be achieved by hanning may well repay the extra effort.

Residuals
Having smoothed time series data, much can be gained by examining the residuals between the
original data and the smoothed values, here called the rough. Residuals can tell us about the
general level of variability of data over and above that accounted for by the fit provided by the
smoothed line. We can judge atypical behaviour against this variability, as measured, for example,
by the midspread of the residuals.
Ideally we want residuals to be small, centred around zero and patternless, and, if possible,
symmetrical in shape with a smooth and bell-shaped appearance. These properties will indicate
that the residuals represent little more than negligible random error and that we are not distorting
the main patterns in the data by removing them. Displaying residuals as a histogram will reveal
their typical magnitude and the shape of their distribution. Figure 5.12 shows the histogram of the
residuals from the repeat median and hanning smooth (the 3RH for short). This shows that the
residuals are small in relation to the original data, fairly symmetrical, centred on zero and devoid of
outliers.

Refinements
There are a number of refinements designed to produce even better smooths. We can only give
cursory attention to these here but more details are given in books by Tukey (1977) and Velleman
and Hoaglin (1981). Before discussing how the first and last values in a time series might also be
smoothed it is helpful to introduce a convenient special notation y1 , y2 , ..., yN, or yt, in general; yt
refers to the value of the quantity y, recorded at time t. It is conventional to code i from 1 to N, the
total period of observation. For example, in figure 5.10 the months February 2003 to March 2005
would be coded from 1 to 26.
Endpoint smoothing
Instead of copying on y 1 ,we first create a new value to represent y at time 0, January 2003. This will
give us a value on either side of y, so that it can be smoothed. This value is found by extrapolating
the smoothed values for times 2 and 3, which we will call Z 2 and Z3 and this is shown graphically in
figure 5.13.

Unit 3 Univariate Analysis
No ratings yet
Unit 3 Univariate Analysis
39 pages
Social Theory For Alternative Societies 1
No ratings yet
Social Theory For Alternative Societies 1
243 pages
Probability, Statistics, and Data Analysis Notes # 1
No ratings yet
Probability, Statistics, and Data Analysis Notes # 1
5 pages
Variability
100% (1)
Variability
8 pages
Labor Law Reviewer (University of The Philippines College of Law)
No ratings yet
Labor Law Reviewer (University of The Philippines College of Law)
426 pages
Lecture6&7Human Poverty Index
No ratings yet
Lecture6&7Human Poverty Index
41 pages
Additional Mathematics Project Work 2013 (Form 5) : Statistics
88% (48)
Additional Mathematics Project Work 2013 (Form 5) : Statistics
37 pages
Data Collection and Display
No ratings yet
Data Collection and Display
34 pages
Quartile Deviation Chap3
100% (1)
Quartile Deviation Chap3
11 pages
1 - III YR, VII Unit Intro To Statistics
No ratings yet
1 - III YR, VII Unit Intro To Statistics
214 pages
List The Importance of Data Analysis in Daily Life
100% (1)
List The Importance of Data Analysis in Daily Life
22 pages
MATH30 6 Lecture 3
No ratings yet
MATH30 6 Lecture 3
66 pages
Data, Graphs and Measures of Central Tendency Educational Video in Yellow Blue Simple Lined Style
No ratings yet
Data, Graphs and Measures of Central Tendency Educational Video in Yellow Blue Simple Lined Style
64 pages
Further Bound Reference
No ratings yet
Further Bound Reference
42 pages
Analysis of Data - Unit III (New)
No ratings yet
Analysis of Data - Unit III (New)
90 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
30 pages
Measures of Central Tendency and Spread
No ratings yet
Measures of Central Tendency and Spread
26 pages
Starting-Statistics 2
No ratings yet
Starting-Statistics 2
13 pages
Module 5 Ge 114
No ratings yet
Module 5 Ge 114
15 pages
Notes - EDA-Unit3
No ratings yet
Notes - EDA-Unit3
24 pages
Descriptive Statistics Inferential Statistics Standard Deviation Confidence Interval The T-Test Correlation
No ratings yet
Descriptive Statistics Inferential Statistics Standard Deviation Confidence Interval The T-Test Correlation
14 pages
Graphical Presentation of Data
No ratings yet
Graphical Presentation of Data
2 pages
What Are The Measures of Central Tendency?: L04: Basic Statistical Descriptions of Data
No ratings yet
What Are The Measures of Central Tendency?: L04: Basic Statistical Descriptions of Data
9 pages
Descr Iptive Statis Tics: Inferential Statistics
No ratings yet
Descr Iptive Statis Tics: Inferential Statistics
36 pages
M1 & M2 Supplementaries
No ratings yet
M1 & M2 Supplementaries
52 pages
Measure of Central Tendancy
No ratings yet
Measure of Central Tendancy
5 pages
Subject:-Business Statistics Topic: - Business Statistics (BC.203)
No ratings yet
Subject:-Business Statistics Topic: - Business Statistics (BC.203)
10 pages
Lesson 3 4
No ratings yet
Lesson 3 4
5 pages
1 What Do You Mean by Measures of Central Tendency? Explain The Measures of Central Tendency. ANS
No ratings yet
1 What Do You Mean by Measures of Central Tendency? Explain The Measures of Central Tendency. ANS
5 pages
in What Sense Are Cities Mediums of Globalization?
100% (4)
in What Sense Are Cities Mediums of Globalization?
3 pages
Name: Roll No: Learning Centre: Subject: Mb0040 - Statistics For Management Date of Submission at The Learning Centre
No ratings yet
Name: Roll No: Learning Centre: Subject: Mb0040 - Statistics For Management Date of Submission at The Learning Centre
23 pages
The Idiomatic Programmer - Statistics Primer
No ratings yet
The Idiomatic Programmer - Statistics Primer
44 pages
Analytical Techniques Lec 1
No ratings yet
Analytical Techniques Lec 1
42 pages
SPSS Session 1 Descriptive Statistics and Univariate
No ratings yet
SPSS Session 1 Descriptive Statistics and Univariate
8 pages
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
No ratings yet
Measures of Central Tendency Position and Dispersion 1.Pptx 20241015 145631 0000
44 pages
Lecture 2-Descriptive Statistics
No ratings yet
Lecture 2-Descriptive Statistics
74 pages
Measure of Central Tendency
No ratings yet
Measure of Central Tendency
14 pages
Ymzv Further Mathematics Bound Reference
No ratings yet
Ymzv Further Mathematics Bound Reference
30 pages
Charts, Graphs and Averagesv1
No ratings yet
Charts, Graphs and Averagesv1
37 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
Descriptive Stats
No ratings yet
Descriptive Stats
22 pages
Growth Methods Introduction
No ratings yet
Growth Methods Introduction
11 pages
Locskew
No ratings yet
Locskew
8 pages
Week 2
No ratings yet
Week 2
27 pages
Statistics Intro
No ratings yet
Statistics Intro
7 pages
Chapter-4 Stat
No ratings yet
Chapter-4 Stat
34 pages
02 - Summarising Numerical Data The Median, Range, IQR and Box Plots
No ratings yet
02 - Summarising Numerical Data The Median, Range, IQR and Box Plots
23 pages
Basics Statistics
No ratings yet
Basics Statistics
11 pages
Measures of Central Tendency Dispersion and Correlation
100% (1)
Measures of Central Tendency Dispersion and Correlation
27 pages
1 - Descriptive Statistics Data: Frequency Distribution
No ratings yet
1 - Descriptive Statistics Data: Frequency Distribution
57 pages
Ststistical Concepts and Market Returns
No ratings yet
Ststistical Concepts and Market Returns
7 pages
Business Statistics ASSIGNMENT
No ratings yet
Business Statistics ASSIGNMENT
4 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
31 pages
Issues in Economic
0% (1)
Issues in Economic
12 pages
4689-2 Final
No ratings yet
4689-2 Final
11 pages
1.ungrouped Data Mean, Median&Mode
No ratings yet
1.ungrouped Data Mean, Median&Mode
39 pages
Securitizing Suburbia
No ratings yet
Securitizing Suburbia
105 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
21 pages
Evaluating Analytical Chemistry
No ratings yet
Evaluating Analytical Chemistry
4 pages
Mba Semester 1 Mb0040 - Statistics For Management-4 Credits (Book ID: B1129) Assignment Set - 1 (60 Marks)
No ratings yet
Mba Semester 1 Mb0040 - Statistics For Management-4 Credits (Book ID: B1129) Assignment Set - 1 (60 Marks)
10 pages
Mean Median Mode
No ratings yet
Mean Median Mode
16 pages
Anne Lodge - Equality and Power in Schools - Redistribution, Recognition and Representation (2002)
No ratings yet
Anne Lodge - Equality and Power in Schools - Redistribution, Recognition and Representation (2002)
260 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
5 pages
Lewis's Unlimited Supply of Labor & Fei-Ranis Growth
No ratings yet
Lewis's Unlimited Supply of Labor & Fei-Ranis Growth
21 pages
Figueroa Sigma Society
No ratings yet
Figueroa Sigma Society
30 pages
12th Economics (CH 1,2 & 3) EM
No ratings yet
12th Economics (CH 1,2 & 3) EM
63 pages
Milestone03 Ifeanyi-David Uzochukwu 02.11.2023
No ratings yet
Milestone03 Ifeanyi-David Uzochukwu 02.11.2023
20 pages
Education's Impact On Economic Growth and Productivity: Schooling Versus Skills
No ratings yet
Education's Impact On Economic Growth and Productivity: Schooling Versus Skills
8 pages
Economics and Finance
No ratings yet
Economics and Finance
10 pages
Inequality in India
No ratings yet
Inequality in India
10 pages
Global POVEQ IDN
No ratings yet
Global POVEQ IDN
2 pages
Chap. 4. Poverty, Inequality, and Development
No ratings yet
Chap. 4. Poverty, Inequality, and Development
44 pages
Bo 10 de Kiem Tra Giua-HK2-Anh-10-FIXED
No ratings yet
Bo 10 de Kiem Tra Giua-HK2-Anh-10-FIXED
165 pages
Sociology 4th James Fulcher John Scott Download
No ratings yet
Sociology 4th James Fulcher John Scott Download
89 pages
Green - 2023 - Online
No ratings yet
Green - 2023 - Online
25 pages
The University of Zambia
No ratings yet
The University of Zambia
6 pages
A La Kuznets Summary
No ratings yet
A La Kuznets Summary
3 pages
Ge Indian Economy Unit 1
No ratings yet
Ge Indian Economy Unit 1
17 pages
Revisiting The Debate On Inequality and Economic Development
No ratings yet
Revisiting The Debate On Inequality and Economic Development
31 pages
Inequality As A Social Problem
No ratings yet
Inequality As A Social Problem
4 pages
Capitalism Presentation
No ratings yet
Capitalism Presentation
32 pages
Globalization and Its Impact
No ratings yet
Globalization and Its Impact
13 pages
Mbekeani 2022 Income Based Gaps in College Going Activities High School Classes of 1992 and 2004
No ratings yet
Mbekeani 2022 Income Based Gaps in College Going Activities High School Classes of 1992 and 2004
18 pages
How To Tax A Billionaire - The Case of Argentina
No ratings yet
How To Tax A Billionaire - The Case of Argentina
13 pages
Marxism
No ratings yet
Marxism
8 pages
2014 Tax Tables
No ratings yet
2014 Tax Tables
1 page
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Descriptive Statistics: Six Sigma Thinking, #3
From Everand
Descriptive Statistics: Six Sigma Thinking, #3
Sumeet Savant
No ratings yet