The Quality of Institutions: A Genetic Programming Approach

Marcos Álvarez-Diaz (Columbia University)


Gonzalo Caballero Miguez (University of Vigo)


The new institutional economics has studied the determinants of the quality of the institutions.

Traditionally, the majority of the empirical literature has adopted a parametric and linear

approach. These forms impose ad hoc functional structures, sometimes introducing

relationships between variables that are forced and misleading. This paper analyses the

determinants of the quality of institutions using a non-parametric and non-linear approach.

Specifically, we employ a Genetic Program (GP) to study the functional relation between the

quality of institutions and a set of historical, economical, geographical, religious and social

variables. Besides this, we compare the obtained results with those employing a parametric

perspective (Ordinary Least Square Regression). We conclude that, at least for our application,

the parametric perspective adopted in previous papers about institutional quality could be


Keywords: Quality of Institutions, Institutional Determinants, Non-Parametric

Perspective, Genetic Programming.

JEL: O10, O50, C14.

A previous version of this paper was presented at the Annual Conference of the International Society for
New Institutional Economics (Barcelona, September 2005), and later it was published as a FUNCAS
Working Paper (FUNCAS, 2006).


In recent decades, the new institutional economics (NIE) has constituted a

program of research that has propelled the return of institutions into the agenda of

mainstream economics. The coasean notion of transaction costs (Coase, 1937, 1960)

and the northian notion of institutions (North, 1990) established the foundations for the

theoretical framework of the NIE. Political rules, informal norms and enforcement

mechanisms constitute the “rules of the game” of a society and these rules establish an

incentives structure that affects the level of transaction costs and the efficiency in the


The NIE is a young program that is in a stage of development and it includes

some academic debates and controversies, but has already allowed significant advances

in different areas such as economic history, economics of organization, law and

economics, policy analysis and development economics (Williamson, 2000; Ménard

and Shirley, 2005). The progress of the NIE is generated via a “guerrilla action” (Coase,

1999) that stems from several social sciences, and that was propelled by the award of

the Nobel Prize to Ronald Coase in 1991 and to Douglass North in 1993. Since then, the

NIE has experienced a growing process in which its analytical abilities have been

recognized (Caballero, 2001, 2002). Nevertheless, although North (2005) already

propose an extension of the NIE, this program continues requiring efforts in the

theoretical and applied work. In fact, problems of definition (for example, Greif´s

notion of institutions as an equilibrium versus North´s view as rules), methodology and

measurement are present, and we need small pieces of work that expand the stock of

knowledge on institutions and economy. In this sense, empirical work is the best way to

improve this knowledge.

The contribution of institutions in determining income levels around the world

has been one of the main programs of empirical research that has been developed in the

last decade (Knack and Keefer, 1997; Hall and Jones, 1999; Acemoglu, Johnson and

Robinson, 2001; Rodrik, Subramanian and Trebbi, 2004). There is now widespread

agreement among economists studying economic growth that institutional quality holds

the key to prevailing patterns of prosperity around the world (Rodrik, 2004). In this

way, economics understands the relevance of analysing the quality of institutions and its


The study of the quality of institutions include works such as La Porta et al

(1999) or Islam and Montenegro (2002). Traditionally, this program of research, which

analyzes the effect of a set of variables on the quality of institutions, has adopted a

parametric perspective; therefore a specific functional form (usually linear) is assumed

and the unknown parameters are later estimated using some optimization procedure as

ordinary least square (OLS). The theoretical validity of the model is easily analysed

considering the signs of the coefficients, the statistical significance of the parameters

estimated and some fit criterion such as the R-Square. However, assuming a parametric

point of view might cause misspecification problems and, in consequence, originate a

bias in the results, a loss of predictive ability and an absence of generalization of the

model in the face of new observations.

Nowadays, the great advances made in the field of Computer Science allow us to

develop, improve and apply powerful and sophisticated techniques for the estimation

and prediction of different phenomena. One of these techniques, called Genetic

Programming (GP), is inspired by Genetics and by the darwinian theories of natural

selection and survival (Holland, 1975; Koza, 1992; Mitchell 2001). The method has

already been used satisfactorily in different scientific areas, including economics

(Beenstock and Szpiro, 2002), finance (Álvarez-Díaz and Álvarez, 2003, 2005) and

environmental economics (Álvarez-Díaz and Domínguez-Torreiro, 2005). This

increasing and intense spread of GP is mainly due to its advantages. Firstly, they do not

have any initial restriction on the functional form underlying in the data. Moreover,

unlike other methods based on Computer Science, the GP also offers explicitly a

mathematical equation which allows a simple ad hoc interpretation of the results.

However, as opposed to these advantages, these techniques usually have the difficulty

of being computationally intensive and the construction of confidence intervals and

hypothesis contrasts is not trivial.

In this work we intend to verify the existence of a bias motivated by employing a

parametric perspective; therefore, we try to detect possible misspecifications problems

associated to the traditional parametric models. In our opinion, this verification is

crucial in an empirical application and it should be always done in order to verify and

corroborate the adequacy of the parametric results. In our specific application we use a

Genetic Programming called DARWIN (Álvarez et al., 2001) to realize this verification

and, additionally, to model what factors explain the institutional quality in different

countries. To this purpose, we compare the GP results with those obtained from the

traditional parametric point of view and analyse their differences and similitude.

The article is presented as follows. After this introduction, Section 2 presents a

brief explanation of the methods used in our study. In Section 3, the data are described

and the results obtained for each method are presented. Finally, in Section 4, we draw

our conclusions.


Genetic Algorithms, originally developed by Holland (1975), enclose a whole

series of procedures inspired in biology and, to be more precise, in genetics and in the

theory of evolution of species. From the evolution of a random set of possible solutions

and by means of applying operators based on natural selection concepts such as survival

of the fittest individuals and genetic heritage, these computing procedures allow finding

an optimal approximation to the solution of a certain problem.

In the specialized literature there is no a commonly accepted definition of

genetic algorithms which allows distinguish them from other computational

evolutionary methods. However, there exist many programs considered as genetic

algorithms which present the following common elements: initial population of possible

solutions to the problem, selection process using some fit criterion, and use of crossover

and random mutation to generate new solutions (Mitchell, 2001). In this paper we have

used a kind of genetic algorithm, called genetic programming (Koza, 1992; Álvarez et

al., 2001), as a tool to model the relationship between the quality of institutions and a

set of historical, economical, geographical, religious and social variables. The evolution

process developed by the genetic program can be explained by means of a series of

stages. At a first stage, the genetic programming creates a random initial population of

N mathematical equations susceptible of representing accurately the relationship

between the dependent variable IG i (institutional quality index) and historical,

economical, geographical, religious and social variables X = {X 1i, X 2i,..., Xki }. These

mathematical equations are created by means of a random combination of operators and

arguments in the following way:

Sj : (( A ⊗ B ) ⊗ (C ⊗ D )) ∀1≤ j ≤ N

where A, B, C, and D are the arguments (operand genes), the symbol ⊗ represents the

mathematical operators (operator genes) and the subscript j refers to each one of the N

equations belonging to the initial population. These arguments can be real numbers

included in a certain interval (the equation coefficients) or independent variables (delays

of the variable). Besides, the mathematical operators ( ⊗ ) used will be sum (+),

subtraction (-), multiplication (·) and division (/), being the latter ‘protected’ to prevent

zero divisors. It is also possible to include other mathematical operators (such as

logarithm or the trigonometric ones) but at the expense of increasing the complexity in

the functional optimisation process. Moreover, previous studies on genetic

programming have demonstrated that it is possible to describe complex dynamics with

mathematical expressions that are built simply with these four arithmetical operators

(Szpiro, 1997; Yadavalli et al., 1999; Álvarez et al., 2001).

At a second stage, after determining the initial population of candidates, the

evolution process starts selecting those equations that fit best to the problem. For this

purpose, the R-Square has been adopted as fitness criterion. This performance measure

is defined as:

∑ ( IG
i =1
i − IGˆ i ) 2
R2 j = 1− M
∀1≤ j ≤ N
∑ (IGi − mean(IGˆ i )) 2
i =1

where R2 j is the R-Square obtained by equation j, IGi is the observed value, IGˆ i is the

predicted value, and M is the total number of observations in the sub-sample employed

to train the genetic program. Later on, all equations of the initial population are

classified in decreasing order according to the value of R2 j. Those equations whose

value of R2 j is very low are rejected, while those with a high value are more likely to

survive, being the base for the next generation of equations.

The equations that survived after the selection process are used to create the

equations of a new solutions generation (i.e., reproduction process). In order to do that

the so-called genetic operators will be applied: cloning, crossover and mutation. With

the cloning operator, the fittest equations are replicated in the next generation. With the

crossover operator pairs of equations with high values of R2 j are selected and they

exchange part of their arguments and of their mathematical operators. Finally, mutation

means that any operator or argument is randomly replaced in a small number of

equations. The first top ranked individuals are exempted from mutation, so that their

information is not lost. Let us consider, for example, that the following equations

belong to the initial population:

S1 : ( A + B) / C

S 2 : (D ⋅ E ) − G

where A, B, C, D, E and G are the equation arguments (coefficients and independent

variables). Let us suppose that both expressions will survive the selection process and

so they become the base equations for the next generation. The crossover operator

means the random selection of a block of operators and arguments in each equation and

their later exchange. For instance, let us suppose that the block (A+B) in expression S1

and the argument G in expression S 2 have been selected. By means of an exchange of

blocks two news equations appear as follows:

S3 : G / C

S 4 : (D ⋅ E ) − ( A + B )

As one can observe, the new equations inherit certain features from their parents.

Now let us suppose that the expression S1 is selected again and the mutation operator is

applied. So, the following equation can be obtained from S1 :

S 5 : ( A ⋅ B) / C

where the mutation was the random alteration of a mathematical operator.

In short, the new population created from the initial population of equations is

composed of cloned equations (such as S 2 ), mutated expressions (such as S 5 ), or

crossed (such as S3 and S4 ). From this moment, the process will repeat the selection and

reproduction stages in an iterative way. After a given number of generations,

determined by the user, the iteration procedure ceases and an optimal mapping

IGˆ = F ( X 1 , X 2 ,..., X k ) is given by the strongest mathematical equation in the



This paper analyses the functional relation between the quality of institutions

and a set of historical, geographical, economical, religious and social variables. In Table

1 a brief description of the employed variables is showed. Regarding to the dependent

variable, we construct a general index of institutional quality (IG) adding the six

particular indicators of governance that were elaborated by Kaufmann, Kraay and

Mastruzzi (2003); in this sense, we follow the index aggregation process proposed by

Easterly and Levine (2003). On the other hand, the explanatory variables include the

ethnolinguistic fractionalization, the legal tradition (English Common Law,

Socialist/Communist Law, French Commercial Code), the religion (Roman Catholic,

Protestant, Others), the geographical (latitude) and economical (GNP) condition (La

Porta et al, 1999). In this way, the database constructed for our study contains complete

information about 117 countries.

In order to detect the possible existence of overfitting, the total sample was

divided in two sub-samples: In-Sample and Out-of-Sample. The In-Sample is composed

of 90 observations randomly chosen and it was reserved exclusively for obtaining the

models. On the other hand, the Out-of-Sample contains the rest of observations. Its

main function is to verify the validity and consistence of the obtained models and, in

consequence, detect possible overfitting problems. Therefore, it will be necessary that

the R-Square obtained in In-Sample and Out-of-Sample are similar and relatively high.

If this condition was verified, it would be proved the ability of the constructed models

to generalize new observations and, therefore, the no-existence of overfitting problems.

In order to study the institutional determinants and their temporal dynamics, we

estimate our model for different years of the dependent variable. Table 2 depicts the

results obtained using OLS regression and GP. At a first glance, we should highlight the

temporal consistency in the results for both methods. In spite of considering different

years, the results in terms of R-Square, the explanatory variables finally chosen and

their effects on quality index do not show temporal divergences considering both


Before analysing the OLS results, we should mention that the standard

backwards stepwise procedure with a 10% level of significance was considered to select

the final variables in OLS model. In Table 2 we can observe how the R-Square does

show a relatively high value (around 0.70) and how there exists a small divergence

between the in-sample and the out-of-sample period. This characteristic reveals the

absence of a possible lack of generalisation using OLS. It seems that the method has

discovered the general pattern existing in the data rather than memorise some specific

features of the individual observations (overfitting problem). As we can observe for the

different years, the relevant variables to explain the institutional quality are GNP,

LATIT, FRENCH and SOCI (and marginally once CAT). The sign of the estimated

coefficients seem to be in accordance with the a priori expectative. Specifically, the

coefficients on GNP and LATIT are positive, while on SOCI and FRENCH are negative.

In general terms, these results are coherent with those obtained by La Porta et al (1999),

when they conclude that countries that are poor, close to the equator, ethnolinguistically

heterogeneous, use French or Socialist laws exhibit inferior government performance.

Up to this point we have introduced the results assuming the common linear and

parametric perspective. Nevertheless, as it was mentioned in the introduction, this

perspective can originate some problems of misspecification, biasing the results and,

therefore, misunderstanding our conclusions. For example, are the selected variables the

most important to explain the institutional quality? We could question as well if the

effect of the selected variables are real or spurious because of assuming a specific and

rigid functional form. In order to valid and investigate the possible existence of a bias in

our results, we compare them with those obtained employing a GP.

Table 2 also provides specific information about the GP results and, certainly,

we can find certain similitude with OLS. First of all, among all possible arithmetic

equations, the GP approach has obtained a very similar functional form to the OLS

regression. The functional structures found by GP are quasi-linear; and therefore, a

simple linear relation would be a valid approach to link the General Institutional Quality

Index and the explanatory variables. Secondly, as in the OLS case, the R-Square is

relatively high and constant when In-Sample and Out-of-Sample are considered. Lastly,

some survival variables to the evolutionary process coincide with the selected variables

in OLS. For instance, GNP and LATIT appear ni all equations offered by the genetic

program for the different years, and their positive effects corroborate those obtained by

OLS. However, there are other variables which have survived the evolutionary process

but they were not selected by the OLS backwards stepwise procedure, such as ENG,

ETHF and, sometimes, SOCI. In this case, ENG shows a positive sign, and ETHF and

SOCI have a negative effect. These results obtained via a GP approach are coherent

with the conclussions of the traditional literature on the quality of institutions in the

sense of La Porta et al. (1999).

In summary, we can affirm that adopting a rigid functional form (parametric

perspective) does not provoke a loss of out-of-sample predictive ability, compared with

a flexible technique such as the GP. Moreover, there exists certain similitude in the

variables considered as the most relevant to explain the regressand. Therefore, we can

conclude that, at least for our application, the parametric perspective adopted in

previous papers about institutional quality could be accurate.


The general procedure to model the quality of institutions has been based almost

exclusively on a linear and parametric point of view. Therefore, a-priori and rigid

functional forms are discretionally imposed by the researcher rather than observed in the

data. This leads to a possible misspecification problem and, in consequence, a bias in

the results. In order to validate the results obtained in a parametric framework, it is

relevant to open an avenue of research on the determinants of the institutional quality

which adopts a non-parametric approach. Our paper has tried to initiate this avenue, and

in fact, it constitutes the first case in which the new institutional economics employs a

genetic programming approach. In particular, the main focus of this paper has been to

validate the parametric structure traditionally used in the literature (La Porta et al,

1999). In this sense, we have opened a new frontier which demands future efforts of


Our results have revealed that the parametric perspective, which has been

commonly adopted to study the determinants of the institutional quality, could be

considered an accurate analytic approach. We have to point out that, among all possible

arithmetic equations, the GP approach has obtained a very similar functional form to the

OLS regression for all our regressions. Moreover, our comparison seems to corroborate

the results obtained by the parametric perspective (OLS) in terms of the variables that

were finally selected (GNP and LATIT); besides this, their effects on regressand

coincide. There exist some divergences in the variables selected as the most relevant by

the differents methods (for example, FRENCH is considered as relevant using the OLS

backwards stepwise procedure, while ETHF and ENG are considered as relevant by the

GP), however, in all cases the effects on regressand are in accordance with previous

results obtained by the parametric approach (La Porta et al, 1999).

Analysing the fit criterion, the R-Square shows a similar value for both methods.

Moreover, for both cases, there exists a small divergence between the R-Square in the

in-sample and the out-of-sample period. Therefore, we can confirm the absence of

overfitting problems using both OLS and GP.

A final comment must be mentioned about GP. A genetic program can be very

useful to model and validate parametric results (analysing the survival variables and

their sign, for example). However, we should not forget that the field of Genetic

Algorithms, and of evolutionary computing in general, is relatively new and many of its

problems are still under study Mitchell (2001). More research needs to be done in order

to improve and perfect the procedure (for example, our genetic program requires new

technical developments and improvements to incorporate a higher number of variables

and reduce the computational time).


TABLE 1: Variables included in the Analysis

General Institutional Quality Index: Sum of the six quality indicators by

Kaufmann et al (2003): Voice and Accountability, Political Stability,
Government Effectiveness, Regulatory Quality, Rule Of Law and Control of

Ethnolinguistic Fractionalization: Average value of indices of ethnolinguistic

ETHF fractionalization.

English Common Law: Identifies the Legal Origin of the English Common

Socialist/Comunist Law: Identifies the Legal Origin of the

Socialist/Communist Law.

French Commercial Code : Identifies the Legal Origin of the French

Commercial Law.

Protestant Religion: Identifies the percentage of the population of each

PROT country that is protestant.

Catholic Religion: Identifies the percentage of the population of each country

CAT that is catholic.

Other Religion: Identifies the percentage of the population of each country

that belongs to other religions (non-catholic and non-protestant).

Latitude: Absolute value of the latitude of the country.


Logaritm of GNP per capita (expressed in current US dollars for the period

Tabla 2: OLS and GP Results
MODEL In-Sample Out-of- Total Variables Sign
OLS 0.7236 0.7314 0.7276 FRENCH -
IG = − 15.862− 2.884⋅ SOCI − 1 .805⋅ FRENCH + 10.084⋅ LATIT + 2.045⋅ GNP
(0 .00 ) (0 .00 ) ( 0 .01) ( 0 .00 ) (0 .00 )
GP IG = − 16.15 + (2 + LATIT ) ⋅ GNP − ETHF + ENG 0.7175 0.764 0.7325 ENG +
OLS IG = − 14 .301− 3 .472 ⋅ SOCI − 1. 815⋅ FRENCH + 10. 496⋅ LATIT + 1 .877 ⋅ GNP 0.7086 0.6978 0.7086 FRENCH -
(0.00) (0. 006) (0 .012) (0 .00) (0.00) LATIT +
3.78 SOCI -
GP IG = − 14.88 + (2 + LATIT ) ⋅ GNP + ENG + 0.7015 0.7356 0.7125 ENG +
SOCI − 3.3 LATIT +
OLS IG = −13.426− 3.618⋅ SOCI − 2.739⋅ FRENCH + 0.025⋅ CAT + 10.912⋅ LATIT + 1.69⋅ GNP 0.7015 0.7572 0.7165 FRENCH -
( 0. 00) ( 0. 006) (0 . 002) (0 . 02) ( 0. 00) ( 0. 00)
GP IG = − 15.72 + ( 2 + LATIT ) ⋅ GNP − ETHF − SOCI + ENG 0.6824 0.7675 0.7048 ENG +
OLS IG = −14 . 040 − 2.558 ⋅ SOCI − 1 .739 ⋅ FRENCH + 7 .806 ⋅ LATIT + 1 .871 ⋅ GNP 0.6935 0.7517 0.7104 FRENCH -
GP IG = − 14.62 + (1.74 + LATIT ) ⋅ GNP − ETHF + 2 ⋅ ENG 0.6944 0.7449 0.7093 LATIT +


