0% found this document useful (0 votes)
18 views31 pages

Apuntes EECCSS - I - V4.es - en

Uploaded by

mateo.sc1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views31 pages

Apuntes EECCSS - I - V4.es - en

Uploaded by

mateo.sc1112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Translated from Spanish to English - www.onlinedoctranslator.

com

Statistics Applied to Social Sciences I

Topic 1. Introduction to statistical methods in social sciences.

1.1 Origins and historical evolution of Statistics.


Origins: The origin of Statistics dates back to the beginning of history. In ancient
Egyptian monuments (3050 BC) interesting documents were found that show that they
kept track of population movements and continually took censuses.

Ancient civilizations (Babylonians, Egyptians, Chinese, Greeks, Romans) collected data on


population, agricultural production and income that they used to collect taxes. Such an
amount of information had to be summarized in numerical values for easy interpretation
and use in political decision making. ThereforeStatistics is a tool to describe the
Population.

The beginning of Statistics as a science can be traced back to the 17th century, when John
Graunt made a study on mortality in different parishes of London in 1662. From the data
collected from the population it is necessary, for example, to obtain life expectancy. to
calculate pensions. The influence of Probability Theory begins. Several characters strive to
improve these statistics to serve the State. Achenwall coined the word statistics in 1760,
which has its roots in the word “statesman” which comes from the Latin term “status”
which means “things of the state.”In addition to describing, it is
discover analogies and statistical permanences.

The history of the Calculus of Probabilities is related to the evolution of


games of chance. The most important contribution to the synthesis of both
disciplines is due to Adolphe Quetelet (Belgian astronomer and
mathematician, 1796-1874) who applied statistics to social issues, trying to
estimate average social characteristics of the members of a community
(father of Sociology Quantitative).

19th century: Its evolution begins as a science that provides new


tools for the development of other Sciences. It constitutesa science itself.

At the beginning of the 20th century, the work of the statistician consists of gathering and
tabulating data to subsequently process the information they provide and interpret it, so
that it can be used to make predictions or make decisions. That is to say, there is a fusion
of the two existing aspects until that moment: Descriptive Statistics and Probability Theory.
Inferential Statistics Arises.

In the second part of the 20th century, this is the modern era of Statistics in which the
appearance of high-power computers revolutionizes its methodology and opens enormous
possibilities for the construction of more complex models.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 1
Statistics Applied to Social Sciences I

In the 21st Century, Statistics continues to develop. Currently, the interest of


researchers is focused on improving existing models and studying new models that
adapt to the needs of society and the development of other sciences.

Currently, Statistics is probably one of the most used and studied disciplines in all
areas of knowledge.

• In Medicine in the study of diseases and their treatments...


• In Biology in genetic studies, species classification, environmental...
• In Economics to evaluate the acceptance of a product before marketing it, to
measure the evolution of prices, to study consumer habits, to make forecasts
about the behavior of certain stocks on the stock market...
• In Political Science to carry out studies to know the preferences of voters
before a vote through polls that will help candidates decide their strategies...

• In Psychology to develop tests and quantify various aspects of human behavior


(for example the tests that are applied to candidates for a position in a
company)...
• In Sociology to study the opinions of social groups on different topics, to
characterize various populations by measuring the relationships between variables
and making predictions about them...

1.2 Statistics as a science.


We live in a world full of figures in which every day the media addresses us with
figures on various topics: unemployment, divorce, birth rates, diseases, public
spending, minimum wage, traffic accidents, population growth rates, tourism, political
tendencies, etc.

Thanks to statistics we can collect, organize and present data relating to any experiment or
phenomenon. Statistics is fundamental in the Social Sciences because it allows us to
describe, analyze, predict and model social reality.

TheStatistics It is a set of methods to collect, classify, represent and summarize data,


as well as to make scientific inferences (draw consequences) from them..

We can divide it into two large blocks:

• Descriptive statistics : Set of methods necessary for the collection,


classification, representation and summary of the data provided by an
experience. It has no inferential value.
• Inferential Statistics : Set of methods that allow, from the results of a
sample, to obtain valid conclusions for a population. Its purpose is to
reach valid conclusions.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 2
Statistics Applied to Social Sciences I

Historically, Statistics has begun by being descriptive. It has been necessary, above all, to
accumulate information, criticize it, put it in conditions, analyze it and synthesize it.
Subsequently, after analogies had been verified, statistical permanences had been discovered,
a certain number of standard distributions had been recognized, and some forms of quite
general structural dependencies had been observed, Statistics became explanatory, thanks, in
particular, to the contribution of the Calculus of Probability.

TheProbability Theory It is the formal mathematical instrument that Statistics uses


to manipulate information regarding uncertainty..

In line with its origins, the terms population and statistical unit are used to refer to
groups and individuals.

It is calledpopulation to the set of beings or objects about which you wish to obtain
information.

It is calledstatistical unit ,individual , eitherelement to each member of the


population.

Depending on the number of individuals from whom information is collected, we could speak of
a census or a sample.Acensus consists of writing down certain characteristics of the
ENTIRE population. Whileis calledsample to the subset of individuals in the population
that are observed to obtain information about the total population to which they belong.

Generally two phases can be distinguished in carrying out any scientific experiment or
study. A first, which consists of the observation and analysis of the events that occur
(collection of information, data collections) and a second, of interpretation and
drawing conclusions. Descriptive Statistics is the first tool for managing data and
provides methods to summarize and organize it. In the second phase of a study, we
are usually interested in making inferences about the population from the sample
data and then we have to use Inferential Statistics and it requires the calculation of
probabilities.

Statistics is the science that allows us to SYSTEMATIZE, COLLECT (sampling), ORDER


AND PRESENT data referring to a phenomenon that presents variability for its
methodical study (Descriptive Statistics), in order to DEDUCT THE LAWS that govern
these phenomena (Probability). and thus be able to make forecasts about them, MAKE
DECISIONS or obtain conclusions (Inferential Statistics).

For example, with Statistics we could address the following problems.

- Study if in a certain group there is salary discrimination due to the sex of the
employee.

- Determine the profile of workers in terms of economic and social conditions in


different communities.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 3
Statistics Applied to Social Sciences I

- Study the consumption of people in a given area in terms of clothing, food, leisure
and housing.

- Determine the time that workers in different companies in the country dedicate to work
and family.

- Study the sociodemographic profile of the students of a University or the monthly mobile
phone spending of the students of a University, and whether this has any relationship with their
age or other characteristics.

1.3 Statistics and social sciences.


Statistics plays a very important role in the development of social sciences. We must
not forget that its origin is closely linked to the interest in quantifying the social
aspects of states. Therefore, many applications of Statistics can be found in numerous
areas of the Social Sciences. Some of them are:

• Public administration. Public administrations must know data about the


inhabitants in order to plan the development of services, infrastructure,
political actions, etc.
• Sociology. In general, Sociology studies social relationships and their dynamics
over time. The phenomena that Sociology deals with present individual
properties of the subjects that make up social groups.
The appropriate tool for understanding this type of phenomena is Statistics.
The description of social institutions, their organization and interrelationships,
the analysis and comparison of the structure of social systems are all fields
where the help of Statistics is claimed.
• Demography. The development of Statistics has been linked to demographic studies of
the population. The object of this discipline is the static and dynamic description of the
population.
• Social investigation. Nowadays, knowledge of “public opinion” is very important.
Politicians, businessmen, public administrators, etc. They need to know what
people think in order to make decisions.
• Economy. Since Economics handles numerical data, the use of statistical
methods in this discipline is natural. Some examples are calculations of index
numbers (such as the Consumer Price Index) and Time Series studies (values
that change over time). On the other hand, many economic theories use
statistical models to describe economic phenomena. These types of ideas have
opened a specialization of Statistics called Econometrics.

• Psychology and Education. Psychological and educational studies have also


given rise to the birth of some important statistical techniques, such as factor
analysis. Behavioral studies, attitude profiles, orientation tests

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 4
Statistics Applied to Social Sciences I

vocational, selection of workers in companies, etc. These are some examples where
Statistics is used in psychology and education.
• Humanities. Although the world of the Humanities may seem at first glance an
unusual terrain for the application of Statistics, in recent times new methodologies
have been initiated in history, geography and literature that use Statistics among
their methods.
• Legal Sciences. Some applications of Statistics can also be found in the world
of law. Specifically, in Criminology, statistical tools can be used in the study of
crime prevention. Some applications have also been developed to judge the
reliability of witnesses.

1.4 Phases of a statistical investigation.


When a statistical study is carried out, it is necessary to plan the different phases that said
study entails. Without a doubt, adequate preparation of the research, prior to field work,
will facilitate the task of the subsequent study. We are going to divide a statistical study
into several phases.

• Determination of the research objective. Although it may seem obvious, we


must keep in mind that, before beginning a statistical investigation, it is
necessary to define its objectives.
• Data collection. The data collection phase must obey a perfectly established
logistical and organizational plan. This procedure will be conditioned by the
available resources. It is also necessary to consider a time frame for data
collection. It is advisable to set restrictions on data collection, for example,
precision of measurements, time limits, number of observations, etc. It may be
interesting to do a little pre-sampling if you lack experience with these types of
data.
• Counting and systematization of data. Individual observations must undergo
a process of organization, counting and systematization so that they become a
reliable source. In this phase, possible errors in data collection will be detected.

• Data analysis and research evaluation. In this phase, the statistical technical
analysis of the observations collected is carried out. For this, the entire set of
procedures provided by Statistics is available. The examination of the results
will lead to the establishment of conclusions and possible actions.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 5
Statistics Applied to Social Sciences I

Topic 2. Data analysis.

2.1. Variables and observations.


Each trait or characteristic of the elements of a population is calledcharacter ,
statistical variable Or simply,variable .

2.1.1. Classification of variables.


Depending on the type of values that the variables take, we distinguish different types of
variables. Variables can essentially be of two types:qualitative and quantitative.

The qualitative variables(or attributes) are those that do not appear in numerical form, but as
categories or attributes. That is, values are categories and these values are different by a
quality, not by a quantity. For example: political party for which an individual voted; region in
which he lives; sex; civil status; brand of car you drive, etc. Note thatWith these variables you
cannot do algebraic operations with them.

Within the qualitative variables we can distinguish between:


- Ordinal variables. If its values can be sorted. For example: social class (low, middle,
upper); opinion on a political proposal (very much against, rather against, indifferent,
rather in favor, very in favor); and the degree of satisfaction in dealing with healthcare
personnel (very satisfied, satisfied, slightly satisfied).
- Nominal variables. If your values cannot be sorted. For example: sex (man, woman); smokes
(Yes, No); and marital status (single, married, separated, widowed).

The quantitative variablesare those that can be expressed numerically such as weight,
number of goals in a soccer match, temperature, annual income, grade on an exam,
number of years of education, kilometers of distance between work and residence... Within
The qualitative variables we can distinguish between:
- Discrete variables. If it takes values in the set of integers. For example: number of
siblings (1, 2, 3..., etc., but it can never be 3.45); number of coins that a person carries
in his or her pocket (0, 1, 2, ...); age of a person: (1,2,3 years...).
- Continuous variables: If it takes any value within a real range. For example: the
speed of a vehicle (80.3 km/h, 94.57 km/h...); height of people (1.65m, 1.58m...);
weight of people: (55.3kg, 68.2kh…)

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 6
Statistics Applied to Social Sciences I

Sometimes discrete variables are treated as continuous variables (since they take on many
possible values), for example the salary of a person in a certain population. And vice versa,
some continuous variables are treated as discrete, for example the age of a person when we
are only interested in specifying up to the years.

Statistical variables can also be classified into:


- One-dimensional variables: They only collect information about one characteristic (for
example: age of the students in a class).
- Two-dimensional variables: they collect, at the same time and on the same individual,
information about two characteristics of the population, which may or may not be related, (for
example: age and height of the students in a class).
- Multidimensional variables: they collect, at the same time and on the same individual, information
on three or more characteristics of the population, which may or may not be related (for example:
age, height and weight of the students in a class).

2.2. One-dimensional frequency distributions.

2.2.3. Frequencies.
Given a statistical variable that we are going to denote by - for which we take
- observations. Be the values -,- , … , -the different values (or modalities) that it presents
The variable -. We are going to assume that these values are ordered from lowest to highest
(unless it is a nominal variable that cannot present order). Then we are going to define the
following frequencies:

- The absolute frequency of a value(or modality) - is the number of observations that have
that value (or modality). We are going to denote it by . Since the total number
of observations is -, It is verified that + + …+ = -.
- The relative frequency of a value(or modality) - is the proportion of observations that
have that value (or modality). We are going to denote it by . That is:

=
-

The sum of all the relative frequencies must be equal to 1. That is,

+ + …+ = 1.Sometimes the relative frequency is expressed in the form ofpercentage =


100 ∗ .

- The cumulative absolute frequency of a value-is the number of observations that are
less than or equal to that value (or modality). If we denote it by - we have to
- = + + …+

- The cumulative relative frequency of a value-is the proportion of observations that


are less than or equal to - . If we denote it by is verified:
-
= + + …+ =
-

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 7
Statistics Applied to Social Sciences I

Cumulative frequencies do not make sense for nominal qualitative variables, since it is not
possible to establish a valid order in them.

NOTE: Some books likeVélez Ibarrola and othersThey use another notation to refer to the
different frequencies:

Frequency: Notation of this course Notation ofVelez Ibarrola


absolute
relative
cumulative absolute - -
Cumulative relative
In other words, when we see in this book the expressionFis referring to what we
denote bynand vice versa.

Notation. In mathematics, to avoid having to enter many sums or leave the sums abbreviated,
an expression is used to denote these sums. This expression uses the Greek letter uppercase
sigma u and is readsummation. In this way the following two expressions are equivalent:

+ + …+ =

2.2.2. Frequency tables.


The values (or modalities) of a statistical variable along with their absolute and/or relative
frequencies are presented in the form of a table (calledfrequency table):

Values Frequencies Frequencies Frequencies Frequencies


absolute relative absolute relative
accumulated accumulated
-
-
⋮ ⋮ ⋮ ⋮ ⋮
-=- =1
Total - 1

Suppose the variable “marital status” that we consider takes the values: single, married, separated
and widowed. We have measured this variable on 20 individuals and the following values have been
obtained: {single, single, widower, married, married, separated, widower, single, married, separated,
married, married, separated, single, widower, single, single, single, married, single}

As you can see, there are 8 single individuals, 6 married, 3 separated and 3 widowed. Values
that constitute the absolute frequency in this way we can create the following table:

Variable value
Singles 8 0.40
Married 6 0.30
Separated 3 0.15

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 8
Statistics Applied to Social Sciences I

Widowers 3 0.15
Total twenty 1.00
Note that since it is a nominal variable (in which there is no order of the different modalities
that the variable takes) the accumulated frequencies do not make sense.

An example of a discrete statistical variable is the following. Suppose that we have measured
the variable number of children for a set of 15 families and the results have been obtained: {1,
2, 1, 3, 2, 2, 4, 1, 1, 1, 0, 0, 2, 0 and 5. The frequency table would be:

Worth -
0 3 0.2 3 0.2
1 5 0.3333 8 0.5333
2 4 0.2667 12 0.8
3 1 0.0667 13 0.8667
4 1 0.0667 14 0.9333
5 1 0.0667 fifteen 1
Total fifteen 1

The analysis of the frequency distribution of continuous quantitative variables is more


complex than in previous cases. The reason is that the categories are no longer given
naturally by the variable but must be chosen. Therefore, the first step to construct the
frequency distribution table is to divide the set of possible values of the variable into
classes or intervalsthat they do not overlap. The difference between the upper and
lower end of the interval is calledinterval widthand the midpoint of each interval is
calledclass markand is usually denoted by - (or by ). It can also be useful to group data
into intervals when it is a discrete variable with many different values.

Once the classes have been chosen, the frequency distribution is done in an analogous way to the previous
cases.

Example: We have the following grades obtained in an exam by 20 different students


(evaluated between 0 and 10 points) {0.5, 1.9, 2.3, 2.5, 3.2, 3.7, 3.9, 4.1, 4.3, 4.9, 5.3, 5.5, 5.8,
6.5 , 6.8, 7.2, 8.1, 8.5, 8.8, 9.3}

The first thing is to create the classes or intervals. [0,1), [1,2),…,[9,10].

The frequency distribution is:

Interval Brand of -
class (- )
[0,1) 0.5 1 1/20 1 1/20
[1,2) 1.5 1 1/20 2 2/20
[23) 2.5 2 2/20 4 4/20
[3,4) 3.5 3 3/20 7 7/20
[4,5) 4.5 3 3/20 10 10/20
[5,6) 5.5 3 3/20 13 13/20
[6,7) 6.5 2 2/20 fifteen 15/20

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 9
Statistics Applied to Social Sciences I

[7,8) 7.5 1 1/20 16 16/20


[8,9) 8.5 3 3/20 19 19/20
[9,10] 9.5 1 1/20 twenty 20/20
Total twenty 1

23. Graphic representations of frequency distributions.


We have seen that the frequency distribution summarizes the data we have so that it can
be analyzed in a simpler way and provides us with more information.

Even more illuminating is the use of graphs and diagrams since at a glance we can
realize the characteristics of our sample. The graphical representation of a frequency
distribution depends on the type of variables we are considering.

2.3.1 Graphs for qualitative variables.


The basic principle of the representation of qualitative variables is proportionality between
areas and frequencies. The most important representations are thepie charts, the bar
chartsand thepictograms.

Sector diagrams.
It consists of dividing a circle (360º) into as many circular sectors as the qualitative variable
values, assigning a central angle to each circular sector proportional to the absolute ( ) or
relative ( ) frequency, thus obtaining a sector with an area proportional to the frequencies.
that they want to represent. In this way the amplitude (in degrees) of each circular sector
corresponding to each modality will be ∗ 360.

Bar chart.
It consists of constructing as many rectangles as the values of the qualitative variable, all of
them with a base of equal width. The height is taken equal to the absolute ( ) or relative ( )
frequency depending on the frequency distribution that we want to represent, thus obtaining
rectangles with areas proportional to the frequencies that we want to represent.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 10
Statistics Applied to Social Sciences I

Note: Some authors call this graph a Rectangle Diagram when it is used to represent
qualitative variables.

Pareto chart
It is a bar diagram but in which the values of the variable appear ordered from highest to
lowest frequency and at the top of the graph a line is drawn that represents the
accumulated frequency.

Pictograms.
Qualitative variables also allow a very plastic representation through drawings, icons,
symbols, maps, etc. These graphs are generally calledpictogram. To make them, you
must take into account that for each modality the size(magnification pictograms) or
thenumber of repetitions(repeat pictograms) must be proportional to the absolute ( )
or relative ( ) frequency.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 11
Statistics Applied to Social Sciences I

2.3.2. Graphs for discrete quantitative variables.

Bar chart.
It consists of raising, for each value of the variable, a rectangle whose height is its absolute
frequency ( ) or relative frequency ( ), depending on the frequency distribution that we
want to represent.

Example: Suppose that we have measured the variable number of children for a set of 15 families
and the following data have been obtained: {1, 2, 1, 3, 2, 2, 4, 1, 1, 1, 0, 0, 2, 0, 5}.

Cumulative bar chart.


If what we want to represent are the accumulated frequencies, we proceed the same as in the
previous case, raising above each value of the variable, a rectangle whose height is its accumulated
absolute frequency (-) or relative frequency ( ).

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 12
Statistics Applied to Social Sciences I

Frequency polygon.
It is built on the bar diagram by joining the upper midpoints of each bar.

Cumulative curve, distribution curve or cumulative frequency diagram. The


consideration of cumulative frequencies allows us to introduce a new way of
representing discrete quantitative variables. This graph is shaped like a ladder with
jumps that occur at each of the values of the variable and have a magnitude equal to
the relative frequency. In other words, the F(x) curve draws at all times the proportion
of observations less than a given value.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 13
Statistics Applied to Social Sciences I

2.3.3. Graphs of continuous variables (or discrete variables grouped into intervals).

Histogram.
The histogram is the most used graph when we work with continuous quantitative
variables. Frequencies are represented by areas. It is also sometimes used for discrete
variables that take many different values and have been grouped into intervals.
Unlike the bar diagram, the rectangles are drawn contiguous to reflect the idea of
continuity.

We can describe the construction of the histogram as follows:

1. The range of possible values of the variable is determined, based on the minimum and
maximum values observed in the data.
2. The range is divided into k class intervals, [$, $%) for & =1, … , 'formed by the values
of the variable that fall in said interval. Theamplitudeof each class is the difference
between the upper end of the interval minus the lower end.
3. The class mark - is calculated, which is the midpoint of the class interval:
$+$ %
-=
2
4. The absolute frequency of each class interval is calculated by counting the
number of observations that fall within it.
5. The histogram rectangles are drawn so that the base coincides with the class interval
and its area is proportional to the frequency of the interval. So the height (ℎ )must
be equal to or proportional to:

ℎ=
$%− $

There are some additional practical observations worth highlighting:

a) The range must include the minimum and maximum values observed in the
data. It may be convenient to extend it beyond these limits, up and down, so
that the interval limits do not take up many decimal places.
b) Although it is not a strict requirement, it is advisable that all class intervals have equal
width. In this way, the height of the rectangle is proportional to the frequency, we can
represent the absolute frequency ( ) or relative frequency ( ) on the Y axis.
c) The construction of class intervals produces the effect of discretizing a continuous
variable. The class mark can be considered a value that represents the entire interval, in
particular for calculation purposes.
d) There is no predetermined rule for how many classes we should choose, except
that any value must belong to one and only one class. Too many classes can
cause irregularities in the representation because classes accidentally exist
infrequently. Conversely, too restricted a number of classes can lead to greater
loss of
information. Some authors estimate that around√-classes, others (Sturges
formula),1 + log -/ log 21classes.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 14
Statistics Applied to Social Sciences I

NOTE. Some books likeVélez Ibarrola and othersThey are also called Histogram when a
discrete variable is represented. We will only call a histogram the one in which data
grouped into intervals are represented. That is to say that in the graphic representation the
rectangles are glued together.

Example: We have the following grades obtained in an exam by 20 different students


(evaluated between 0 and 10 points): {0.5, 1.9, 2.3, 2.5, 3.2, 3.7, 3.9, 4.1, 4.3, 4.9, 5.3, 5.5, 5.8,
6.5, 6.8, 7.2, 8.1, 8.5, 8.8, 9.3}

The shape of the histogram represents important properties of the statistical variable
to which it refers. First, note that the shape is the same whether the pitches reflect
absolute frequencies or express relative frequencies. On the other hand, the
appearance of the histogram is affected by the choice of the point where the first class
begins and by the width of the classes (SPSS does not allow drawing histograms with
classes of different widths). The following figure shows how the same data can give
rise to different histograms if the interval choices are different.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 15
Statistics Applied to Social Sciences I

Exercise 1. Make a histogram with the data from the students' grades,
grouping them into the intervals of [0.5], [5.7], [7.9) and [9.10] that
correspond to the qualitative grades of fail, pass. , notable and outstanding.

Cumulative curve, distribution curve or cumulative frequency polygon. An


additional graph to represent a distribution of values grouped into class intervals is
the cumulative frequency polygon, which is constructed as follows:

1. The cumulative relative frequencies are calculated for each class interval.
2. For the upper end of each interval, $%, a perpendicular of height equal to the
accumulated relative frequency, , of the interval is raised.
3. The ends of the perpendiculars are joined by straight segments, that is, the points
are joined.2$%, ).It must be taken into account that the first point is$2, 0)

Exercise 2. Make the cumulative frequency polygon of the test grade


data grouped in the previous intervals.

Stem and leaf diagram.


They are an alternative to the histogram that allows a global graphic representation to be made
while preserving the individual values. It is constructed as follows:

1. Begin by rounding the data to two, or at most three significant figures.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 16
Statistics Applied to Social Sciences I

2. Each observation is divided into two parts. Hestem, which is made up of all the
digits in the observation except the rightmost digit. Thesheet, which consists
of the final digit.
3. The stems are written in a vertical column, starting with the smallest and
continuing in increasing order. All stems must be placed consecutively even if
there is no data for a specific stem.
4. A vertical line formed by the stems is drawn.
5. Each leaf is written in a row to the right of its stem, starting with the smallest and
continuing in increasing order.

The following example is the Stem and Leaf Diagram corresponding to the data from the spatial
orientation test:

Stem-and-Leaf Plot Spatial Orientation

Frequency Stem & Leaf

21.00 0 . 566788888899999999999
62.00 1 . 00000000011122222333333333334 444444555555556667777777888899999
24.00 2 . 011112223333344457999999
19.00 3 . 0011122455567778889
14.00 4 . 01235666678999
8.00 5 . 02234556
3.00 6. 249
9.00 7. 122346789
16.00 8. 0003445556666678
5.00 9. 01138
11.00 10. 11222446999
5.00 eleven . 03447
2.00 12. fifteen
, 00 13.
1.00 14. 2

Stem width: 10.00


Each leaf: 1 case(s)

In addition to the fact that its appearance is similar to the histogram (but with horizontal orientation), it
allows us to know the values of the observations. So, for example, we know that between 60 and 70 there
are three values with scores 62, 64 and 69.

Exercise 3. Make a stem and leaf diagram with the 20 students' notes.

Population pyramids.
In the social sciences, particularly in the field of Demography, diagrams called
population pyramids are used. These graphs are particular histograms that tell us how
the population is distributed by age and sex.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 17
Statistics Applied to Social Sciences I

Box diagram.
We will see in the next section what box plots consist of.

2.4. Numerical representations of data distributions.


In the previous sections we saw techniques that allow a visual description of the
distribution of a variable using tables and graphs. Regardless of the type of variable,
we can summarize the distribution of a variable both using tables and graphs.

For quantitative variables we can also summarize the information in a simpler and more precise
way using numerical values that give us an idea of the location or center of the data (position
measures), of the concentration of the data around the center (measures of position).
dispersion) and other features of the distribution, such as asymmetry or pointing.

If these numerical values are calculated for the population we call themparameters. Example: The
average height of individuals in a country.

If those numerical values are calculated for the sample we call themstatistics. Example: The
average height of the students in this class.

2.4.1. Measures of centralization or central tendency.


They report the average values of the data series. The main measures of central
position are those that we will define below.

Arithmetic average.
Thearithmetic averageOr simplyhalfIt is the most used to represent the central tendency of a
distribution of numerical data. It is equal to the quotient between the sum of all the data and
the total number of observations:

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 18
Statistics Applied to Social Sciences I

- + - + ⋯+ -5= ∑5 -
-̅ =
- -

Exercise 4. The salaries of the 5 employees of a company are: 1200, 1425, 1600,1350
and 1100. Calculate their average.

If we have the data grouped in a frequency table where the values


- ,- , … , -they look themselves , , … , Sometimes we can use the following expressions to
calculate the mean:

∑ -·
-̅ = = -·
-

Exercise 5. The following table shows the ages of 40 students. Calculate its
average.

Age 18 19 twenty twenty-one 22 27 40


No. of Observations 10 fifteen 7 2 3 2 1

The mean or arithmetic mean is one of the most well-known and used measures of
centralization. Its popularity is undoubtedly motivated by the simplicity of its calculation
and its properties. However, it is not without drawbacks. Let's comment on some of its
characteristics:

• In the calculation of the average,allthe values of the variable and it is always a number
between the minimum and maximum of them. However, it does not have to coincide
with one of the values of the variable and this can make the result difficult to interpret.
For example, the average number of children of a woman is 0.8821; which can only be
understood abstractly, as the number of children of an “average” or “ideal” woman who
represents the entire population and does not correspond, evidently, to any real
woman.
• The mean is very sensitive to the influence of a few extreme observations. For
example, if the ages of the students in classes are between 18 and 22 years old
minus one person who is 60 years old, this last piece of information increases the
average value significantly. Extreme values may be exceptional observations, but
they may also be due to error and greatly distort the mean and other numerical
descriptions of the data.
The trimmed arithmetic meanp%sorts the values of the variable and eliminates the9
larger and9smaller before calculating the average of the - −29remaining where9it is
such that9 = % of -,It is usually taken between 1% and 10%.
∑5>?
?-
- =̅=
- − 29

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 19
Statistics Applied to Social Sciences I

The winsorized arithmetic meanp%, after sorting the values of the variable,
replaces the9smaller by the one immediately after -?%and the9larger by the
immediately preceding one -5>?where9it is such that9 = % of -
29 + 1)- ?%+ -?%+ ⋯+ -5>?>+ 29 + 1)- 5>?
-@==
-

Exercise 6. The following table shows the ages of 40 students. Calculate its
trimmed mean at 5% and its winsorized mean at 5%.

Age 18 19 twenty twenty-one 22 27 40


No. of Observations 10 fifteen 7 2 3 2 1

• If already prepared data is used, it may be the case that the average cannot be
calculated. For example:

TO 0 1 2 3 4 5 or more

bTO 23 Four. Five 27 7 3 3

• The mean has the property that if we move all the data by an amountcthe average
is displaced by that amountc.On the other hand, if we multiply the data by a
quantitydthe average that multiplied by the same quantity.
• Sometimes when we have a continuous variable grouped into intervals, instead of
using all observations to calculate the mean, class marks are used. Since there are
fewer different values, its calculation is simpler. If we have the data within a
statistical program, this is not done since for a computer to perform the calculation
of an average is a quick task.

Geometric mean.
It is sometimes useful when using data whose behavior is exponential.

- AND=KF-gh· -gYo· … -gJ.

Exercise 7. The Consumer Price Index (CPI) rose 10% in 2010 and 20% in 2011.
What was the average increase during those two years?

Median.
Themedianof a series of numerical values is the value that occupies the central place of the series,
once ordered in increasing order, that is, considering the values from lowest to highest. It is
represented byL(sometimes also forlM).

To calculate the median we can proceed as follows:

1. The observations are ordered according to their value, from lowest to highest, and
counted. Let - be the number of observations.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 20
Statistics Applied to Social Sciences I

2. If - is odd, the median is the observation that occupies the center of the ordered
list. This place is what makes the number5%starting to count for the
first on the list.
3. If - is even, the arithmetic mean of the observations that
occupy the central places, that is, the place5and the place5+1.

Exercise 8. Calculate the median of the data 18, 18, 19, 20, 22, 23 and 25. Calculate the
median of the data 18, 18, 19, 20, 22 and 23.

When the data are ordered in a frequency table we can use this order to calculate the
median.

Exercise 9. The following table shows the ages of 40 students. Calculate its
median.

Age 18 19 twenty twenty-one 22 27 40


No. of Observations 10 fifteen 7 2 3 2 1

When the data correspond to a continuous variable that is grouped into intervals, we
will see later how the median is calculated.

The median depends on all observations by their order, but not by their value. This makes
it less sensitive than average to extreme observations. On the other hand, the algebraic
calculations involved in the median are usually complicated, so their use is more limited
than that of the arithmetic mean.

Fashion.

Thefashionof a series of observations is the value that is repeated the most. It is represented byN (
sometimes also forNEITHER). It could be that a data distribution has more than one mode.

When it is a continuous variable, in which the data have been grouped into intervals, we will talk
about themodal intervalinstead of fashion.

Note: Some books (such as Velez Ibarrola and others) calculate a value representative of
the mode when the data are grouped into intervals. We are not going to do it.

Some properties of fashion are:

• The mode is a measure of centralization that is simple to calculate and easy to interpret.
• In the calculation of the mode, all observations are involved based on their frequency
and not by their value.
• The mode is very sensitive to the fluctuation of observations. It could even happen that
there were several fashions.
• Fashion lends itself poorly to algebraic calculations.
• It is possible to calculate the mode for qualitative variables.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 21
Statistics Applied to Social Sciences I

2.4.2. Position measurements.

Quartiles.
The quartiles of a numerical data series are three values, denoted byP, P andQQ, that

They are defined as follows:

• Qis the value that, once the series is ordered in increasing order, leaves below it,
including the value itself.P,25% of the observations and above the remaining 75%.
• Qis the value that, once the series is ordered in increasing order, leaves below it, including
the value itself.P,50% of the observations and above the remaining 50%. Therefore it
coincides with the median.
• QQis the value that, once the series is ordered in increasing order, leaves below it, including
the value itself.QQ, 75% of the observations and above the remaining 25%.

In other words, we can define quartiles as those values that divide the data set into
four equal parts.

Tell them.

Thedecilesof a series of numerical data are nine values, denotedR?for9 = 1, … ,9, beingR?
the value that, once the series is ordered in increasing order, leaves below it, including its
ownR?, he109%of the observations and above2100 − 109)%remaining.

Percentiles.
In an analogous way to the deciles, we will denote the percentiles byT?for9 = 1, … ,99,being T?
the value that, once the series is ordered in increasing order, leaves below it, including its ownT
?, he9%of the observations and above2100 − 9)%remaining.

As can be deduced from the definitionsL=P=ROR=TOR.In addition, other equivalences


are given, such as, for example,P = TORandQQ=TVU.

Since quartiles and deciles can be put in terms of percentiles, we will explain how the
latter are calculated.

Exercise 10. The following table shows the ages of 40 students. Calculate the quartiles
and the 57th percentile.

Age 18 19 twenty twenty-one 22 27 40


No. of Observations 10 fifteen 7 2 3 2 1

If we have a continuous variable grouped into intervals, we will assume that the
observations are distributed uniformly throughout each interval. In this way the calculation
of a percentileT?matches the inverse image of the value9/100on the distribution curve (or
cumulative frequency polygon).

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 22
Statistics Applied to Social Sciences I

Exercise 11. The following table shows the rankings of 20 students. Calculate the quartiles
and the 57th percentile.

Qualification [0.5) [5,7) [7,9) [9,10]


No. of Observations 10 5 4 1

Note: Some books (such as Velez Ibarrola and others) classify quartiles, deciles and
percentiles as measures of dispersion; we will call them measures of position.

2.4.3. Dispersion means.


Measures of central tendency give us an idea of the central value around which the data
are distributed, but they do not tell us how far the observations are from those central
values. For this we will use the dispersion measures. Let's look at some measures of
dispersion.

Range or route.
A simple idea to see the dispersion of a variable is to give the largest and smallest value it takes.

We define range as the difference between the largest and the smallest.

W = -XáZ− -XiG

Perhaps the most notable feature of this dispersion measure is that it is very sensitive to
extreme observations.

Interquartile, semi-interquartile, interdecilic and semi-interdecilic routes. He


interquartile pathis the difference between the upper quartileQQand the lower quartileP.

W\ = PQ−P

Hesemi-interquartile route:

W\
W]\ =
2

Heinterdecilic route:

WR = R̂ − R

And finally thesemi-interdecilic route:

W.R.
W]R =
2

In the interval betweenRand R̂ are found in 80% of the observations.

These dispersion measures are quite robust to extreme observations.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 23
Statistics Applied to Social Sciences I

Exercise 12. The following table shows the ages of 40 students. Calculates the range
and the interquartile, semi-interquatile, interdecilic and semi-interdecilic ranges.
What would happen if instead of a 40-year observation there was a 60-year
observation?

Age 18 19 twenty twenty-one 22 27 40


No. of Observations 10 fifteen 7 2 3 2 1

Exercise 13. The following table shows the rankings of 20 students. Calculates the
range and the interquartile, semi-interquatile, interdecilic and semi-interdecilic
ranges.

Qualification [0.5) [5,7) [7,9) [9,10]


No. of Observations 10 5 4 1

Variance and standard deviation.


With centralization measures we have a general view around which values the data is
distributed, but we do not know how far the data is from those central values. We have already
seen some measurements that give us an idea of how “dispersed” the data is with respect to
the central positions. These measures take into account two specific values due to the position
they occupy in the data. Now we will see two measures that take into account all the values to
calculate their dispersion.

Thevarianceof a data distribution in the sum of the squared distances between each data
and the mean, divided by the total number of observations, that is:

∑52- − -)̅
_=
-

The variance is represented by a square to emphasize that it is a positive number, which is


expressed in units of the variable squared, so if the variable is measured in meters, the
variance is measured in meters squared. To avoid this, the square root of the variance is
usually used as a measure of dispersion.

We definetypical deviationof a variablexas:

∑52- − -)̅
_ = _̀ = a
-

Exercise 14. Calculate the variance and standard deviation of data 18, 18, 19, 20, 22 and
23.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 24
Statistics Applied to Social Sciences I

There is an alternative expression to calculate the variance that involves a little less calculation,
although you have to be very careful in using enough decimal figures because you can end up
incurring a large error due to rounding.

∑5-
_ = −-
-

Exercise 15. Perform the previous exercise using this new expression for the
variance.

If the data are in a frequency table, the variance calculation can be simplified by taking
into account that each observation - is repeated times, for & =1, … , ',so the following
expression for calculating the variance can be used.

∑ 2- − -)̅ ∑ -
_= = −-
- -

or substitutinggbby .
5

For continuous variables grouped into class intervals, if we do not have the
enumeration of all the data we could use the class marks to calculate the variance (and
standard deviation).

Exercise 16. The following table shows the ages of 40 students. Calculate the
variance and standard deviation.

Age 18 19 twenty twenty-one 22 27 40


No. of Observations 10 fifteen 7 2 3 2 1

Exercise 17. Calculate the variance or standard deviation of the data in the following
table.

Qualification [0.5) [5,7) [7,9) [9,10]


No. of Observations 10 5 4 1

Some properties of the variance and standard deviation are the following:

• The variance (and standard deviation) measures the dispersion with respect to the arithmetic mean.
Therefore it should only be used when this is chosen as a centralization measure.

• The variance (and standard deviation) is always non-negative and takes a value of
zero only when all values of the variable are equal.
• The variance is measured in units of the variable squared, while the standard
deviation in the same units as the variable.
• If the same amount is added (or subtracted) from themcFor all data, the variance
and standard deviation remain the same. While if all the data is multiplied

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 25
Statistics Applied to Social Sciences I

for the same amountD,the variance is multiplied bydand the standard deviation for
d.

Coefficient of variation.
The variance and standard deviation depend on the unit of measurement used to measure the
variable. This is a serious drawback when you want to compare the dispersion of two
populations measured with different scales or when you want to compare different variables.
To have an invariant measurement with respect to the unit of measurement used, the
coefficient of variation is available.

The quotient between the standard deviation and the absolute value of the mean is called the coefficient of
variation. It cannot be calculated when the mean is 0, and it does not make much sense to calculate it
when the data distribution has negative values. It is represented byofand its expression is

_
of =
|-|̅

It is usually expressed as a percentage, multiplied by 100 the previous value.

Exercise 18. Calculate the Coefficient of variation of the data from the two previous
exercises.

Some properties of the coefficient of variation are:

• The coefficient of variation is a unitless number that is usually expressed as a


percentage.
• The coefficient of variation is a measure of dispersion that is invariant with respect to a
change in scale, however it is not invariant with respect to a change in origin.

Mean deviation
To calculate the variance we use the sum of the squared distances of each data to the
mean, that is2- − -)̅ .What would happen if instead of the squared distances we only
considered the distances?, that is, |- − -|̅ where the bars denote the absolute value.
Well, what happens is that we have a new measure of dispersion called mean
deviation from the arithmetic meanwhich we denote byRZ.̅

∑5|- − -|̅
RZ̅=
-

When we calculate this average of distances with respect to the median then we refer to the
mean deviation from the median:

∑5|- − L|
Rg=
-

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 26
Statistics Applied to Social Sciences I

When the data are in the form of a frequency table, the expressions of the mean deviation,
both with respect to the arithmetic mean and the median, are completely similar to those
obtained for the variance, without more than substituting the square of the deviation by
the corresponding absolute value.

Exercise 19. The following table shows the ages of 40 students. CalculateRZand Rg.

Age 18 19 twenty twenty-one 22 27 40


No. of Observations 10 fifteen 7 2 3 2 1

2.4.4. Shape tights.


In addition to the measures of centralization and dispersion, other numerical values can be
assigned that measure theshapeof a distribution of values of a statistical variable. When
talking about the shape of a distribution, reference is made to two characteristics that are
visually appreciated when examining a bar plot, histogram, or stem-and-leaf plot. One of them
is that the distribution has or does notsymmetrywith respect to a hypothetical axis that divides
it into two parts; This axis of symmetry has to do with the centralization measures that have
been studied. The other, related to dispersion, aims to distinguish whether this is due to the
fact that there are very frequent values close to the mean or, on the contrary, because there
are values that are very far from the mean but are infrequent. Despite having the same
dispersion, in the first case the representation of the bar diagram or histogram will produce an
impression of a flatter or flatter graph and in the second case it presents a very pronounced
peak around the mean.

Symmetry coefficient.
There are several measures of asymmetry, perhaps the most representative is calledFisher
Asymmetry Coefficientwhich we will denote byh.

∑52- − -)̅Q
h=
- · _Q

• Yeahh i 0We will say that there is an asymmetry to the left or negative asymmetry. ≅ 0We
• Yeahh will say that the distribution is symmetrical or almost symmetrical.
• Yeahhk 0We will say that the distribution is right-skewed or positively skewed.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 27
Statistics Applied to Social Sciences I

Generally, distributions that collect data from nature are slightly asymmetric to the
right. For example, the heights of Spaniards, the IQ of a population, etc.

The Fisher asymmetry coefficient has the properties that it is a dimensionless


coefficient and is invariant to changes in origin and scale.

Sometimes the Pearson asymmetry coefficient is used as an alternative indicator of


asymmetry, defined as:

-̅ − N
Coef. Pearson asymmetry =
_

It has the advantage that its calculation is simple, but it does not take into account the values of the variable or
its frequencies. In particular it cannot be calculated when there are several modes.

A third possibility for skewness is to look at the position of the quartiles.

2 PQ− P ) − 2P − P )
Coef. interquartile asymmetry =
QQ−P

It has a positive or negative sign depending on whether the distribution extends further to the
right or left. The box plot is a graphic representation that allows you to see when this coefficient
is positive, negative or equal to zero.

Pointing coefficient.
As has been said, the pointing coefficient of a distribution distinguishes distributions
with the same variance based on the degree of concentration they show around the
mean. The dispersion may be due to many observations close to the mean or a few
very far from the mean. To distinguish between one and the other we use the2- − -)̅w,
in this way the few observations that are very far away have more weight than many
very close ones. We define the Fisher asymmetry coefficient by:

∑52- − -)̅w
h=
- · _w

Since the asymmetry coefficient is always greater than 0 (as the quantities are raised
to a positive power), the comparison value is 3, which is the value taken by a “normal”
data distribution.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 28
Statistics Applied to Social Sciences I

• Yeahh i 3We will say that the data distribution is flattened or platykurtic.
• Yeahh≅ 3We will say that the data distribution is as pointed as the normal or
mesokurtic.
• Yeahhk 3We will say that the data distribution is pointed or leptokurtic.

As with the Fisher asymmetry coefficient, this coefficient is dimensionless and


invariant to changes in origin and scale.

If we have the data grouped, as we have done for other measures such as the mean
and variance, we can use the following expressions for the asymmetry and pointing
coefficients since each data is repeated times.

∑ 2- − -)̅Q ∑ 2- − -)̅w
h= h =
- · _Q - · _w

Exercise 20. Calculate the Fisher, Pearson and interquartile skewness


coefficients and the Fisher pointing coefficient for the following data.

Age 18 19 twenty twenty-one 22 27 40


No. of Observations 10 fifteen 7 2 3 2 1

2.4.5. Summary of five numbers and box plot.


We wanted to wait to see the different statistical measures to see a new graphical
representation calledbox plot. The box plot is based on five of these measures.

Hefive number summaryof a series of values is made up of the minimum value, the first quartile,
the median, the third quartile and the maximum value.

From these five numbers, a very simple graph can be made to represent the variable,
highlighting, in particular, the dispersion of its values. These diagrams are called box
plots and are constructed as follows:

• An axis is drawn with the scale of the variable, with a vertical orientation.
• A rectangle or box is drawn with a dimension parallel to the axis, whose length
extends from the first quartile,P,up to the third quartile,QQ.
• At the point that corresponds to the median,l,The rectangle is divided by a line
perpendicular to the axis.
• Starting from the midpoints of each side of the rectangle, two lines are drawn
parallel to the axis that reach respectively until reaching the last data without

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 29
Statistics Applied to Social Sciences I

overcome the distance of1.52PQ− P ) .These lines end with a line perpendicular
to the axis that is usually calledwhiskers.
• Observations that fall outside the reach of the whiskers are represented by points
and are considered“atypical” observations or datawhich would need to be
subjected to more detailed inspection. There is also talk of“extreme outlier” data
and are those cases with values further from the quartiles that32PQ− P ),and are
usually represented by asterisks or stars.

Example: For the example that we have already used about the weight in kilograms of 80 people, the
following table and the following box diagram have been obtained with the help of SPSS. What
conclusions can we draw from them?

It is common for a variable (continuous) to be compared grouped by the modalities taken


by another variable (generally qualitative). For example, below is a graph of the continuous
variable sulfate concentration that has been grouped by another qualitative variable,
which is the province where the data were taken. This also allows us to make comparisons
by groups.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 30
Statistics Applied to Social Sciences I

Here, each province is represented by a box showing sulfate levels.

Differences can be observed regarding the situation of the median and the dispersion. For
example, it is observed that the central sulfate values in the different provinces are similar but
(for example) the data from the province of Valencia are more dispersed than those from the
province of Castellón. Extreme cases (if any) are represented by special symbols next to which
the case number appears. In this example we observe 2 extreme data (one in Alicante and
another in Valencia). It also gives us an idea of the symmetry of the data, for example an off-
center median (within the rectangle) would indicate an asymmetry. In this example, we see an
asymmetry in the Valencia data.

Hipolito Hernandez Perez


Department: Mathematics, Statistics and Operations Research Page 31

You might also like