The Quality of Institutions: A Genetic Programming Approach
The Quality of Institutions: A Genetic Programming Approach
The Quality of Institutions: A Genetic Programming Approach
APPROACH*
and
ABSTRACT
The new institutional economics has studied the determinants of the quality of the institutions.
Traditionally, the majority of the empirical literature has adopted a parametric and linear
relationships between variables that are forced and misleading. This paper analyses the
Specifically, we employ a Genetic Program (GP) to study the functional relation between the
quality of institutions and a set of historical, economical, geographical, religious and social
variables. Besides this, we compare the obtained results with those employing a parametric
perspective (Ordinary Least Square Regression). We conclude that, at least for our application,
the parametric perspective adopted in previous papers about institutional quality could be
accurate.
*
A previous version of this paper was presented at the Annual Conference of the International Society for
New Institutional Economics (Barcelona, September 2005), and later it was published as a FUNCAS
Working Paper (FUNCAS, 2006).
1
1-.INTRODUCTION
program of research that has propelled the return of institutions into the agenda of
mainstream economics. The coasean notion of transaction costs (Coase, 1937, 1960)
and the northian notion of institutions (North, 1990) established the foundations for the
theoretical framework of the NIE. Political rules, informal norms and enforcement
mechanisms constitute the “rules of the game” of a society and these rules establish an
incentives structure that affects the level of transaction costs and the efficiency in the
economy.
some academic debates and controversies, but has already allowed significant advances
and Shirley, 2005). The progress of the NIE is generated via a “guerrilla action” (Coase,
1999) that stems from several social sciences, and that was propelled by the award of
the Nobel Prize to Ronald Coase in 1991 and to Douglass North in 1993. Since then, the
NIE has experienced a growing process in which its analytical abilities have been
propose an extension of the NIE, this program continues requiring efforts in the
theoretical and applied work. In fact, problems of definition (for example, Greif´s
measurement are present, and we need small pieces of work that expand the stock of
knowledge on institutions and economy. In this sense, empirical work is the best way to
2
The contribution of institutions in determining income levels around the world
has been one of the main programs of empirical research that has been developed in the
last decade (Knack and Keefer, 1997; Hall and Jones, 1999; Acemoglu, Johnson and
Robinson, 2001; Rodrik, Subramanian and Trebbi, 2004). There is now widespread
agreement among economists studying economic growth that institutional quality holds
the key to prevailing patterns of prosperity around the world (Rodrik, 2004). In this
way, economics understands the relevance of analysing the quality of institutions and its
determinants.
(1999) or Islam and Montenegro (2002). Traditionally, this program of research, which
analyzes the effect of a set of variables on the quality of institutions, has adopted a
and the unknown parameters are later estimated using some optimization procedure as
ordinary least square (OLS). The theoretical validity of the model is easily analysed
considering the signs of the coefficients, the statistical significance of the parameters
estimated and some fit criterion such as the R-Square. However, assuming a parametric
bias in the results, a loss of predictive ability and an absence of generalization of the
Nowadays, the great advances made in the field of Computer Science allow us to
develop, improve and apply powerful and sophisticated techniques for the estimation
selection and survival (Holland, 1975; Koza, 1992; Mitchell 2001). The method has
3
(Beenstock and Szpiro, 2002), finance (Álvarez-Díaz and Álvarez, 2003, 2005) and
increasing and intense spread of GP is mainly due to its advantages. Firstly, they do not
have any initial restriction on the functional form underlying in the data. Moreover,
unlike other methods based on Computer Science, the GP also offers explicitly a
However, as opposed to these advantages, these techniques usually have the difficulty
crucial in an empirical application and it should be always done in order to verify and
corroborate the adequacy of the parametric results. In our specific application we use a
Genetic Programming called DARWIN (Álvarez et al., 2001) to realize this verification
and, additionally, to model what factors explain the institutional quality in different
countries. To this purpose, we compare the GP results with those obtained from the
traditional parametric point of view and analyse their differences and similitude.
brief explanation of the methods used in our study. In Section 3, the data are described
and the results obtained for each method are presented. Finally, in Section 4, we draw
our conclusions.
4
2-. GENETIC PROGRAMMING
series of procedures inspired in biology and, to be more precise, in genetics and in the
theory of evolution of species. From the evolution of a random set of possible solutions
and by means of applying operators based on natural selection concepts such as survival
of the fittest individuals and genetic heritage, these computing procedures allow finding
algorithms which present the following common elements: initial population of possible
solutions to the problem, selection process using some fit criterion, and use of crossover
and random mutation to generate new solutions (Mitchell, 2001). In this paper we have
used a kind of genetic algorithm, called genetic programming (Koza, 1992; Álvarez et
al., 2001), as a tool to model the relationship between the quality of institutions and a
set of historical, economical, geographical, religious and social variables. The evolution
stages. At a first stage, the genetic programming creates a random initial population of
economical, geographical, religious and social variables X = {X 1i, X 2i,..., Xki }. These
5
Sj : (( A ⊗ B ) ⊗ (C ⊗ D )) ∀1≤ j ≤ N
where A, B, C, and D are the arguments (operand genes), the symbol ⊗ represents the
mathematical operators (operator genes) and the subscript j refers to each one of the N
equations belonging to the initial population. These arguments can be real numbers
of the variable). Besides, the mathematical operators ( ⊗ ) used will be sum (+),
subtraction (-), multiplication (·) and division (/), being the latter ‘protected’ to prevent
logarithm or the trigonometric ones) but at the expense of increasing the complexity in
mathematical expressions that are built simply with these four arithmetical operators
evolution process starts selecting those equations that fit best to the problem. For this
purpose, the R-Square has been adopted as fitness criterion. This performance measure
is defined as:
∑ ( IG
i =1
i − IGˆ i ) 2
R2 j = 1− M
∀1≤ j ≤ N
∑ (IGi − mean(IGˆ i )) 2
i =1
where R2 j is the R-Square obtained by equation j, IGi is the observed value, IGˆ i is the
predicted value, and M is the total number of observations in the sub-sample employed
to train the genetic program. Later on, all equations of the initial population are
6
value of R2 j is very low are rejected, while those with a high value are more likely to
The equations that survived after the selection process are used to create the
the so-called genetic operators will be applied: cloning, crossover and mutation. With
the cloning operator, the fittest equations are replicated in the next generation. With the
crossover operator pairs of equations with high values of R2 j are selected and they
exchange part of their arguments and of their mathematical operators. Finally, mutation
equations. The first top ranked individuals are exempted from mutation, so that their
information is not lost. Let us consider, for example, that the following equations
S1 : ( A + B) / C
S 2 : (D ⋅ E ) − G
variables). Let us suppose that both expressions will survive the selection process and
so they become the base equations for the next generation. The crossover operator
means the random selection of a block of operators and arguments in each equation and
their later exchange. For instance, let us suppose that the block (A+B) in expression S1
S3 : G / C
S 4 : (D ⋅ E ) − ( A + B )
7
As one can observe, the new equations inherit certain features from their parents.
Now let us suppose that the expression S1 is selected again and the mutation operator is
S 5 : ( A ⋅ B) / C
In short, the new population created from the initial population of equations is
crossed (such as S3 and S4 ). From this moment, the process will repeat the selection and
determined by the user, the iteration procedure ceases and an optimal mapping
population.
This paper analyses the functional relation between the quality of institutions
and a set of historical, geographical, economical, religious and social variables. In Table
variable, we construct a general index of institutional quality (IG) adding the six
Mastruzzi (2003); in this sense, we follow the index aggregation process proposed by
Easterly and Levine (2003). On the other hand, the explanatory variables include the
8
Protestant, Others), the geographical (latitude) and economical (GNP) condition (La
Porta et al, 1999). In this way, the database constructed for our study contains complete
In order to detect the possible existence of overfitting, the total sample was
of 90 observations randomly chosen and it was reserved exclusively for obtaining the
models. On the other hand, the Out-of-Sample contains the rest of observations. Its
main function is to verify the validity and consistence of the obtained models and, in
the R-Square obtained in In-Sample and Out-of-Sample are similar and relatively high.
If this condition was verified, it would be proved the ability of the constructed models
estimate our model for different years of the dependent variable. Table 2 depicts the
results obtained using OLS regression and GP. At a first glance, we should highlight the
temporal consistency in the results for both methods. In spite of considering different
years, the results in terms of R-Square, the explanatory variables finally chosen and
their effects on quality index do not show temporal divergences considering both
methods.
Before analysing the OLS results, we should mention that the standard
backwards stepwise procedure with a 10% level of significance was considered to select
the final variables in OLS model. In Table 2 we can observe how the R-Square does
show a relatively high value (around 0.70) and how there exists a small divergence
between the in-sample and the out-of-sample period. This characteristic reveals the
absence of a possible lack of generalisation using OLS. It seems that the method has
9
discovered the general pattern existing in the data rather than memorise some specific
features of the individual observations (overfitting problem). As we can observe for the
different years, the relevant variables to explain the institutional quality are GNP,
LATIT, FRENCH and SOCI (and marginally once CAT). The sign of the estimated
coefficients on GNP and LATIT are positive, while on SOCI and FRENCH are negative.
In general terms, these results are coherent with those obtained by La Porta et al (1999),
when they conclude that countries that are poor, close to the equator, ethnolinguistically
Up to this point we have introduced the results assuming the common linear and
perspective can originate some problems of misspecification, biasing the results and,
therefore, misunderstanding our conclusions. For example, are the selected variables the
most important to explain the institutional quality? We could question as well if the
effect of the selected variables are real or spurious because of assuming a specific and
rigid functional form. In order to valid and investigate the possible existence of a bias in
Table 2 also provides specific information about the GP results and, certainly,
we can find certain similitude with OLS. First of all, among all possible arithmetic
equations, the GP approach has obtained a very similar functional form to the OLS
simple linear relation would be a valid approach to link the General Institutional Quality
Index and the explanatory variables. Secondly, as in the OLS case, the R-Square is
relatively high and constant when In-Sample and Out-of-Sample are considered. Lastly,
some survival variables to the evolutionary process coincide with the selected variables
10
in OLS. For instance, GNP and LATIT appear ni all equations offered by the genetic
program for the different years, and their positive effects corroborate those obtained by
OLS. However, there are other variables which have survived the evolutionary process
but they were not selected by the OLS backwards stepwise procedure, such as ENG,
ETHF and, sometimes, SOCI. In this case, ENG shows a positive sign, and ETHF and
SOCI have a negative effect. These results obtained via a GP approach are coherent
with the conclussions of the traditional literature on the quality of institutions in the
perspective) does not provoke a loss of out-of-sample predictive ability, compared with
a flexible technique such as the GP. Moreover, there exists certain similitude in the
variables considered as the most relevant to explain the regressand. Therefore, we can
conclude that, at least for our application, the parametric perspective adopted in
4-. CONCLUSIONS
The general procedure to model the quality of institutions has been based almost
exclusively on a linear and parametric point of view. Therefore, a-priori and rigid
functional forms are discretionally imposed by the researcher rather than observed in the
which adopts a non-parametric approach. Our paper has tried to initiate this avenue, and
in fact, it constitutes the first case in which the new institutional economics employs a
11
genetic programming approach. In particular, the main focus of this paper has been to
validate the parametric structure traditionally used in the literature (La Porta et al,
1999). In this sense, we have opened a new frontier which demands future efforts of
research.
Our results have revealed that the parametric perspective, which has been
considered an accurate analytic approach. We have to point out that, among all possible
arithmetic equations, the GP approach has obtained a very similar functional form to the
OLS regression for all our regressions. Moreover, our comparison seems to corroborate
the results obtained by the parametric perspective (OLS) in terms of the variables that
were finally selected (GNP and LATIT); besides this, their effects on regressand
coincide. There exist some divergences in the variables selected as the most relevant by
the differents methods (for example, FRENCH is considered as relevant using the OLS
backwards stepwise procedure, while ETHF and ENG are considered as relevant by the
GP), however, in all cases the effects on regressand are in accordance with previous
Analysing the fit criterion, the R-Square shows a similar value for both methods.
Moreover, for both cases, there exists a small divergence between the R-Square in the
in-sample and the out-of-sample period. Therefore, we can confirm the absence of
A final comment must be mentioned about GP. A genetic program can be very
useful to model and validate parametric results (analysing the survival variables and
their sign, for example). However, we should not forget that the field of Genetic
Algorithms, and of evolutionary computing in general, is relatively new and many of its
problems are still under study Mitchell (2001). More research needs to be done in order
12
to improve and perfect the procedure (for example, our genetic program requires new
REFERENCES
Álvarez A., A. Orfila and J. Tintoré (2001). “DARWIN- an evolutionary program for
models using genetic algorithms”, Journal of Economic Dynamic & Control, 26,
pp. 811-835.
Caballero, G. (2001). “La Nueva Economía Institucional”, Sistema, N. 156, pp. 59-86.
13
Caballero, G. (2002). “El programa de la Nueva Economía Institucional: lo macro, lo
Coase, R. H. (1960). “The Problem of Social Cost”, Journal of Law and Economics, V.
3, N. 1, pp. 1-44.
Coase (1999). “The task of the society”, ISNIE Newsletter, V. 2, N. 2, pp. 1-6.
Easterly, W. and R. Levine (2003). “Tropics, Germs, and Crops: How endowments
3-40.
Hall, R. and C. I. Jones (1999). “Why do some countries produce so much more output
per worker than others?”, Quaterly Journal of Economics, 114, pp. 83-116.
Holland J. H. (1975). Adaptation in natural and artificial systems, Ann Arbor, The
pp. 1251-1288.
La Porta, R.; Lopez de Silanes, F.; Shleifer, A. y R. Vishny (1999). “The quality of
222-279.
14
Ménard, C. and M. Shirley (2005). Handbook of New Institutional Economics, Springer
Ed.
Press
University Press.
Rodrik, D., A. Subramanian and F. Trebbi (2004). “Institutions Rule: The primacy of
Szpiro, G. G. (1997). “Forecasting chaotic time series with genetic algorithm”, Physical
functional form for chaotic time series evolution using genetic algorithm”,
15
TABLE 1: Variables included in the Analysis
English Common Law: Identifies the Legal Origin of the English Common
Law.
ENG
Logaritm of GNP per capita (expressed in current US dollars for the period
GNP
1970-1995).
16
Tabla 2: OLS and GP Results
R-SQUARE SURVIVAL VARIABLES
MODEL In-Sample Out-of- Total Variables Sign
Sample
SOCI -
OLS 0.7236 0.7314 0.7276 FRENCH -
IG = − 15.862− 2.884⋅ SOCI − 1 .805⋅ FRENCH + 10.084⋅ LATIT + 2.045⋅ GNP
LATIT +
(0 .00 ) (0 .00 ) ( 0 .01) ( 0 .00 ) (0 .00 )
GNP +
2002
ETHF -
GP IG = − 16.15 + (2 + LATIT ) ⋅ GNP − ETHF + ENG 0.7175 0.764 0.7325 ENG +
LATIT +
GNP +
SOCI -
OLS IG = − 14 .301− 3 .472 ⋅ SOCI − 1. 815⋅ FRENCH + 10. 496⋅ LATIT + 1 .877 ⋅ GNP 0.7086 0.6978 0.7086 FRENCH -
(0.00) (0. 006) (0 .012) (0 .00) (0.00) LATIT +
GNP +
2000
3.78 SOCI -
GP IG = − 14.88 + (2 + LATIT ) ⋅ GNP + ENG + 0.7015 0.7356 0.7125 ENG +
SOCI − 3.3 LATIT +
GNP +
SOCI -
OLS IG = −13.426− 3.618⋅ SOCI − 2.739⋅ FRENCH + 0.025⋅ CAT + 10.912⋅ LATIT + 1.69⋅ GNP 0.7015 0.7572 0.7165 FRENCH -
( 0. 00) ( 0. 006) (0 . 002) (0 . 02) ( 0. 00) ( 0. 00)
CAT +
LATIT +
GNP +
1998
SOCI -
GP IG = − 15.72 + ( 2 + LATIT ) ⋅ GNP − ETHF − SOCI + ENG 0.6824 0.7675 0.7048 ENG +
LATIT +
GNP +
ETHF -
SOCI -
OLS IG = −14 . 040 − 2.558 ⋅ SOCI − 1 .739 ⋅ FRENCH + 7 .806 ⋅ LATIT + 1 .871 ⋅ GNP 0.6935 0.7517 0.7104 FRENCH -
LATIT +
GNP +
1996
ENG +
GP IG = − 14.62 + (1.74 + LATIT ) ⋅ GNP − ETHF + 2 ⋅ ENG 0.6944 0.7449 0.7093 LATIT +
GNP +
ETHF -
17
18