De Mystifying
De Mystifying
net/publication/379810736
CITATIONS READS
0 66
1 author:
Guillaume Wunsch
Catholic University of Louvain
119 PUBLICATIONS 911 CITATIONS
SEE PROFILE
All content following this page was uploaded by Guillaume Wunsch on 28 August 2024.
perspective journals.sagepub.com/home/bms
Guillaume Wunsch
Demography, UCLouvain, Louvain-la-Neuve, Belgium
Résumé
L’analyse des données massives. Le point de vue d’un démographe. Ces dernières
années, plusieurs démographes ont souligné la nécessité de prendre en compte les données
massives (big data) dans les études démographiques. Certains sont favorables aux approches
inductives, car les algorithmes statistiques pourraient permettre de découvrir de nouvelles
relations entre les données. Cet article examine certaines des méthodes, anciennes et
nouvelles, qui ont été développées pour détecter des relations et des associations dans les
données. Il se termine par une discussion sur la façon dont les big data et leur analyse peuvent
contribuer à améliorer le pouvoir explicatif des modèles en sciences sociales, et en
démographie en particulier.
Abstract
In the past few years, several demographers have pointed out the need to consider big
data in population studies. Some are in favour of data-driven approaches, as statistical
algorithms could discover novel patterns in the data. This paper examines some of the
methods, both old and new, that have been developed for detecting patterns and
associations in the data. It concludes with a discussion on how big data and big data
analytics can contribute to improving the explanatory power of models in the social
sciences and in demography in particular.
Mots clés
abduction, analyse exploratoire des données, déduction, données massives, induction,
méthodes d’analyse des données massives
Keywords
abduction, big data, big data analytics, deduction, exploratory data analysis, induction
Corresponding Author:
Guillaume Wunsch, Demography, University of Louvain (UCLouvain), Louvain-la-Neuve, Belgium
Email: [email protected]
244 Bulletin de Méthodologie Sociologique 161-162
Introduction
In the past few years, several demographers have pointed out the need to consider big
data in population studies. For example, Stephanie Bohon (2018) has argued that demo-
graphers have long collected and analysed big data but in a small way, focusing only on a
subset of the data. She considers that demographers should target big deep data, i.e.
population-generalizable data such as censuses, rather than data created for other pur-
poses than research such as social media data. Bohon is in favour of data-driven
approaches that look at the data holistically in view of detecting possible patterns.
Bohon’s proposal is in fact rooted in the inductive tradition of research.
In the recent past, individual-level anonymized data from censuses and registers have
become increasingly available and demographers have taken advantage of this situation.
A review of the literature has shown that, in the field of big data, demographers analyse
huge amounts of data at the level of the individual, coming mainly from censuses and
from various administrative registers (Wunsch et al., 2024). In addition, more and more
national institutes are now linking data sources together at the individual level, namely
census with registers, census with census, and registers with registers. For demographers,
one can truly speak of a (big) data revolution. Of course, in all these sources some
persons remain uncovered and thus undocumented. For example, population registers
usually include only the legal residents of the country.
The availability of big microdata has considerably expanded the scope of demo-
graphic research in all fields of study. For instance, multilevel analyses can now be
conducted on a large scale and much more individual information is available for feeding
agent-based models. However, the same review of the literature cited above has also
shown that demographers are not very involved in causal inference and often have
recourse to single-equation models that cannot spell out the structure of relationships
among the variables. Lately, the author of the present text has co-authored several
articles on structural causal modelling in a hypothetico-deductive perspective. This
approach requires however strong background knowledge and theory that are often
lacking. As an alternative, could data-driven approaches based on big data analytics
improve the explanatory power of models in the social sciences? Could induction come
to the rescue of deduction?
Following Doug Laney (2001), big data are characterised by 3 V: their volume, their
variety and their velocity. Volume relates to the huge amount of data produced by public
and private sources; variety to the various sources and formats of the data (such as text,
sensor data, satellite imagery, etc.); and velocity to data creation in real-time. Big data
analytics is the process of extracting information from big data, such as discovering
patterns and correlations in the data. It corresponds to an exploratory analysis of the data
and, as such, is nothing new. Indeed, for decades social scientists have used various
forms of exploratory multivariate analysis in the search for non-random patterns or
structures in the data. As pointed out by Brian Everitt (1978), one may not know in
advance what the structural characteristics of one’s multivariate data are. In this case,
one should rely on exploratory techniques rather than on confirmatory ones. To give but
one example, some fifty years ago Michel Loriaux (1971) recommended taking a data-
driven approach in demography, as few tried and true theories were available in this
Wunsch 245
field. He proposed segmentation analysis for the exploration of the data, a stepwise
application of one-way analysis of variance model. Segmentation analysis is actually
a special case of survival trees that are based on the principle of recursive partitioning
algorithms. The latter could be used as an alternative to parametric regression in the
social sciences (Robette, 2022).
This paper examines some of the methods, both old and new, that have been devel-
oped for detecting patterns and associations in the data. The focus is firmly on the
exploratory analysis of data. Recent methods are based on machine learning (ML) and
artificial intelligence (AI). No attempt is made however to cover the full field of big data
analytics that is becoming larger every year. The choice of a specific method or algo-
rithm depends upon the goal of the study. In particular, one should consider if the
objective is interpretability or predictive accuracy (see Bi et al., 2019). The paper
concludes with a discussion on how big data and big data analytics can contribute to
improving the explanatory power of models in the social sciences and in demography in
particular.
account for a large proportion of the variance in the data, one can project the observa-
tions on the plane defined by these two components, showing the possible structure in the
data and pointing out any outliers. Other classic visualization techniques rely on non-
metric multidimensional scaling, for ordinal variables, or on non-linear mapping applied
to distance matrices. These and other traditional data visualization multivariate methods
are described, for instance, in Everitt (1978). For more recent approaches, such as Lexis
fields or composite lattice plots, see the special collection of the journal Demographic
Research on data visualization finalized on 20 April 2021.
Dimensionality-reduction
Some classic methods are still used, such as principal components and factor analysis,
for dimensionality-reduction purposes. However, the analysis of big data can require
new techniques, traditional methods often being ill suited for analysing complex large-
scale data. A more recent approach in machine learning is feature subset selection that
searches the space of persons’ feature (or attribute or variable) subsets for the optimal
subset. The method is based on the relevance and redundancy of the features for the
problem at hand. Relevance and redundancy are evaluated respectively by entropy and
similarity measures (GeeksforGeeks, 2021). Another method is t-distributed stochastic
neighbour embedding, a non-linear dimensionality reduction algorithm that seeks to find
patterns in the data by identifying clusters based on similarity of data points (Schochas-
tics, 2017). It can be used for visualizing the data in a two- or three-dimensional space.
One should also point out the regularization approach in machine learning that reduces
Wunsch 247
the dimensionality of the training set in order to avoid overfitting. For example, Lasso
regression minimizes the complexity of the model by a penalty function limiting the sum
of the absolute values of the model coefficients. In Ridge regression, the penalty function
is equivalent to the square of the magnitude of the coefficients. Another technique is
Elastic-Net regression that improves on both Ridge and Lasso by using a modified
penalty function based on the combination of the penalties of the Lasso and Ridge
methods (see e.g. Nirisha, 2021). Dimensionality reduction can however lead to a high
bias error.
Cluster analysis
Clustering is of particular interest for demographers. It groups together low-level data
and can uncover hidden similarities between members of a group or feature differences
between groups. Outliers can be detected as falling outside the clusters. Cluster analysis
aims at putting units (e.g. individuals) into groups or clusters, minimizing the intra-group
differences among units and maximizing the inter-group differences. For example,
Duchêne and Thiltgès (1993) have clustered the 43 regional sub-divisions (‘arrondisse-
ments’) of Belgium into 4 groups with common characteristics of mortality over age
15 in order to examine regional disparities in adult mortality. Clustering could be espe-
cially useful in the analysis of big data, such as census microdata, where the very large
number of individuals could be grouped into a much smaller number of units, in which
individuals share common characteristics, that are more convenient to analyse.
Several distance measures are available for the purpose of clustering, the more famil-
iar one being Euclidean distance. Other distance measures can be preferred in some
applications. A well-known one is the Mahalanobis distance, but recent algorithms also
have recourse to the Manhattan distance or the Minkowski distance among others,
according to different data-mining problems (Tsai et al., 2015). Algorithms usually fall
into one of three main categories of clustering: density-based clustering, partitioning
clustering, and hierarchical clustering, though others exist such as grid-based algorithms.
For their pros and cons, see Wang (2017). Though not a new approach, cluster analysis
has been developed considerably in recent years. Some techniques now rely on fuzzy
clustering, such as the fuzzy-cmeans (FCM) algorithm generating fuzzy partitions.
A major problem is how to reduce the possible complexity of the data in big data
clustering, the data being structured (potentially available in tabular form) or unstruc-
tured (such as blogs or images), and often provided by various sources.
Finally, if the problem is expressed as sequences of events, such as life courses in
demography, clustering similar sequences together according to their resemblance can
be performed, in a pattern-search approach, using sequence analysis with optimal match-
ing or alignment algorithms (Abbott, 1995). See Ritschard et al. (2008) for a comparison
between sequence analysis and survival trees for mining event-histories.
This machine-learning method also indicates the strength of association among the co-
occurrences in the data. For a clear introduction, see the entry ‘Association rule learning’
in Wikipedia (2022). The method searches for the “if x-then y” patterns or item sets
among variables, that are the most frequent in the data. Several algorithms are available
for this purpose. In order to avoid discovering too many such rules, a threshold has to be
set. The algorithms rely on two important concepts: Support, or how frequently the
itemset x and y appears in the dataset, and Confidence, or p(y/x) i.e. the conditional
probability of y given x. A third concept is Lift, i.e. the ratio of the observed frequency of
co-occurrence to the expected frequency if x and y were independent. If x and y are
actually independent, Lift ¼ 1. Positive or negative associations lead respectively to Lift
> 1 and Lift < 1. In high-dimensional spaces, the method may require preliminary
dimensionality reduction. Of course, some associations detected may be spurious and
nothing guarantees that the associations are relevant from a causal viewpoint.
these methods by splitting segments, merging segments, and avoiding islands, i.e. connected
clusters of pixels that are surrounded by pixels of another segment.
With climate change and the expected increases in emigration from the more affected
areas, the analysis of satellite imagery combined with other data, such as digital traces from
mobile phones and field studies, could become a major tool in migration research. For an
example relating to migration caused by armed conflict, see Pech and Lakes (2017).
as it can open the door to proposing new explanatory mechanisms that would then have
to be tested. These newer methods are also being used for prediction, with a qualified
success as the predictions are based on past knowledge. In the social sciences, causes and
causal relations change over time and nothing guarantees that the past and present
determinants of union formation, for example, will remain identical in the future.
and better causes of C than A. Abduction thus requires testing the validity of the
proposed explanation A and comparing A to other possible causes.
Scientists use induction, abduction and deduction in their current practice according
to their needs; much depends on the availability of theory and data. If theory is unavail-
able, induction can come to the rescue of deduction by proposing possible new hypoth-
eses. But these have then to be tested, induction thus leading to deduction.
To conclude
In an inductive approach, big data analytics can detect novel patterns and associations in
the data. However, as Kitchin (2014) has observed, it is one thing to identify patterns in
big data; it is another thing to explain them. In other words, data do not speak for
themselves: they have to be interpreted. One can presume that among the various
associations that will be detected, only a few will possibly make causal sense. In a
data-driven approach, the problem then consists in proposing and testing a suitable
mechanism that can explain why a variation observed in one variable produces a varia-
tion in another variable, observation bias and confounding being under control. When
background knowledge is available, abduction can be used for this purpose.
To bring this article to a close, big data, machine-learning, and artificial intelligence
will have a profound effect on scientific discovery but they will not replace human
judgment in the construction and testing of explanatory models.
Acknowledgements
The author wishes to thank Catherine Gourbin and Federica Russo, and two anonymous reviewers,
for their valuable insights in the preparation of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Notes
1. Actually, the experience did not turn out too well, as there were many missing cases due to
under-reporting.
2. EigenCentrality is the eigenvector corresponding to the largest eigenvalue of the adjacency
matrix (Friedkin, 1991)
3. The term was coined by the Russian literary scholar Mikhail Bakhtin.
References
Abbott A (1995) Sequence analysis: New methods for old ideas. Annual Review of Sociology 21:
93-113.
Agrawal R, Imieliński T and Swami A (1993) Mining association rules between sets of items in
large databases. ACM SIGMOD Record 22(2): 207-216.
Wunsch 253
Ianni M, Masciari E and Sperlı́ G (2021) A survey of Big Data dimensions vs Social Networks
analysis. Journal of Intelligent Information Systems 57(1): 73-100.
Kitchin R (2014) Big Data, new epistemologies and paradigm shifts. Big Data & Society 1(1):
1-12.
Kucharski A (2021) The rules of contagion. London: Wellcome Collection.
Laney D (2001) 3D Data Management: Controlling data Volume, Velocity and Variety. Applica-
tion Delivery Strategies, Meta Group.
Le Bras H (1971) Géographie de la fécondité française depuis 1921. Population 26(6): 1093-1124.
Loriaux M (1971) La segmentation, un outil méconnu au service du démographe. Recherches
Economiques de Louvain 37(4): 293-327.
Marucci-Wellman HR, Lehto MR and Corns HL (2015) A practical tool for public health surveil-
lance: Semi-automated coding of short injury narratives from large administrative databases
using Naı̈ve Bayes algorithms. Accident Analysis & Prevention 84: 165-176.
Mencarini L, Hernández-Farı́as DI, Lai M, Patti V, Sulis E and Vignoli D (2019) ‘Happy parents’
tweets: An exploration of Italian Twitter data using sentiment analysis. Demographic Research
40 (25): 693-724.
Newman MEJ (2004) Analysis of weighted networks. Physical Review E 70, 056131.
Nigri A, Levantesi S and Aburto JM (2022) Leveraging deep neural networks to estimate
age-specific mortality from life expectancy at birth. Demographic Research 47(8): 199-232.
Nirisha V (2021) Regularization and its techniques in Machine Learning. Linkedin. Available at
https://www.linkedin.com/pulse/regularization-its-techniques-machine-learning-nirisha-
voggu-1c/ (accessed 21 February 2023).
O’Malley AJ and Marsden PV (2008) The analysis of social networks. Health Services Outcomes
Research Methodology 8(4): 222-269.
Pech L and Lakes T (2017) The impact of armed conflict and forced migration on urban expansion
in Goma: Introduction to a simple method of satellite-imagery analysis as a complement to
field research. Applied Geography 88: 161-173.
Ritschard G, Gabadinho A, Müller NS and Studer M (2008) Mining event histories: A social
science perspective. International Journal of Data Mining, Modelling and Management 1(1):
68-90.
Robette N (2022) Trees and forest. Recursive partitioning as an alternative to parametric regres-
sion models in social sciences. Bulletin of Sociological Methodology 156(1): 7-56.
Schochastics (2017) Dimensionality Reduction Methods Using FIFA 18 Player Data, R-Bloggers.
http://blog.schochastics.net/post/dimensionality-reduction-methods (accessed 20 January
2022).
Titiunik R (2015) Can big data solve the fundamental problem of causal inference? Political
Science & Politics 48 (1): 75-7.
Tsai CW, Lai CF, Chao HC and Vasilakos A (2015) Big data analytics: a survey. Journal of Big
Data 2(21)
Tukey JW (1977) Exploratory Data Analysis. Reading, Massachusetts: Addison-Wesley Pub. Co.
Wang L (2017) Heterogeneous data and big data analytics. Automatic Control and Information
Sciences 3(1): 8-15.
Wikipedia (2022) Association rule learning. Available at https://en.wikipedia.org/wiki/
Association_rule_learning (accessed 25.09.2022).
Wunsch 255
Wunsch G (1988). Causal Theory and Causal Modeling. Beyond Description in the Social
Sciences. Leuven: Leuven University Press.
Wunsch G, Gourbin C and Russo F (2024) Big data, demography, and causality. Open Journal of
Social Sciences 12(1): 181-206.