0% found this document useful (0 votes)
31 views14 pages

De Mystifying

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views14 pages

De Mystifying

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/379810736

Big data analytics. A demographer’s perspective

Article in Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique · April 2024


DOI: 10.1177/07591063241236071

CITATIONS READS

0 66

1 author:

Guillaume Wunsch
Catholic University of Louvain
119 PUBLICATIONS 911 CITATIONS

SEE PROFILE

All content following this page was uploaded by Guillaume Wunsch on 28 August 2024.

The user has requested enhancement of the downloaded file.


Models and protocols/Modèles et protocoles

Bulletin de Méthodologie Sociologique


2024, Vol. 161-162 243–255
Big data analytics. ª The Author(s) 2024
Article reuse guidelines:
A demographer’s sagepub.com/journals-permissions
DOI: 10.1177/07591063241236071

perspective journals.sagepub.com/home/bms

Guillaume Wunsch
Demography, UCLouvain, Louvain-la-Neuve, Belgium

Résumé
L’analyse des données massives. Le point de vue d’un démographe. Ces dernières
années, plusieurs démographes ont souligné la nécessité de prendre en compte les données
massives (big data) dans les études démographiques. Certains sont favorables aux approches
inductives, car les algorithmes statistiques pourraient permettre de découvrir de nouvelles
relations entre les données. Cet article examine certaines des méthodes, anciennes et
nouvelles, qui ont été développées pour détecter des relations et des associations dans les
données. Il se termine par une discussion sur la façon dont les big data et leur analyse peuvent
contribuer à améliorer le pouvoir explicatif des modèles en sciences sociales, et en
démographie en particulier.

Abstract
In the past few years, several demographers have pointed out the need to consider big
data in population studies. Some are in favour of data-driven approaches, as statistical
algorithms could discover novel patterns in the data. This paper examines some of the
methods, both old and new, that have been developed for detecting patterns and
associations in the data. It concludes with a discussion on how big data and big data
analytics can contribute to improving the explanatory power of models in the social
sciences and in demography in particular.

Mots clés
abduction, analyse exploratoire des données, déduction, données massives, induction,
méthodes d’analyse des données massives

Keywords
abduction, big data, big data analytics, deduction, exploratory data analysis, induction

Corresponding Author:
Guillaume Wunsch, Demography, University of Louvain (UCLouvain), Louvain-la-Neuve, Belgium
Email: [email protected]
244 Bulletin de Méthodologie Sociologique 161-162

Introduction
In the past few years, several demographers have pointed out the need to consider big
data in population studies. For example, Stephanie Bohon (2018) has argued that demo-
graphers have long collected and analysed big data but in a small way, focusing only on a
subset of the data. She considers that demographers should target big deep data, i.e.
population-generalizable data such as censuses, rather than data created for other pur-
poses than research such as social media data. Bohon is in favour of data-driven
approaches that look at the data holistically in view of detecting possible patterns.
Bohon’s proposal is in fact rooted in the inductive tradition of research.
In the recent past, individual-level anonymized data from censuses and registers have
become increasingly available and demographers have taken advantage of this situation.
A review of the literature has shown that, in the field of big data, demographers analyse
huge amounts of data at the level of the individual, coming mainly from censuses and
from various administrative registers (Wunsch et al., 2024). In addition, more and more
national institutes are now linking data sources together at the individual level, namely
census with registers, census with census, and registers with registers. For demographers,
one can truly speak of a (big) data revolution. Of course, in all these sources some
persons remain uncovered and thus undocumented. For example, population registers
usually include only the legal residents of the country.
The availability of big microdata has considerably expanded the scope of demo-
graphic research in all fields of study. For instance, multilevel analyses can now be
conducted on a large scale and much more individual information is available for feeding
agent-based models. However, the same review of the literature cited above has also
shown that demographers are not very involved in causal inference and often have
recourse to single-equation models that cannot spell out the structure of relationships
among the variables. Lately, the author of the present text has co-authored several
articles on structural causal modelling in a hypothetico-deductive perspective. This
approach requires however strong background knowledge and theory that are often
lacking. As an alternative, could data-driven approaches based on big data analytics
improve the explanatory power of models in the social sciences? Could induction come
to the rescue of deduction?
Following Doug Laney (2001), big data are characterised by 3 V: their volume, their
variety and their velocity. Volume relates to the huge amount of data produced by public
and private sources; variety to the various sources and formats of the data (such as text,
sensor data, satellite imagery, etc.); and velocity to data creation in real-time. Big data
analytics is the process of extracting information from big data, such as discovering
patterns and correlations in the data. It corresponds to an exploratory analysis of the data
and, as such, is nothing new. Indeed, for decades social scientists have used various
forms of exploratory multivariate analysis in the search for non-random patterns or
structures in the data. As pointed out by Brian Everitt (1978), one may not know in
advance what the structural characteristics of one’s multivariate data are. In this case,
one should rely on exploratory techniques rather than on confirmatory ones. To give but
one example, some fifty years ago Michel Loriaux (1971) recommended taking a data-
driven approach in demography, as few tried and true theories were available in this
Wunsch 245

field. He proposed segmentation analysis for the exploration of the data, a stepwise
application of one-way analysis of variance model. Segmentation analysis is actually
a special case of survival trees that are based on the principle of recursive partitioning
algorithms. The latter could be used as an alternative to parametric regression in the
social sciences (Robette, 2022).
This paper examines some of the methods, both old and new, that have been devel-
oped for detecting patterns and associations in the data. The focus is firmly on the
exploratory analysis of data. Recent methods are based on machine learning (ML) and
artificial intelligence (AI). No attempt is made however to cover the full field of big data
analytics that is becoming larger every year. The choice of a specific method or algo-
rithm depends upon the goal of the study. In particular, one should consider if the
objective is interpretability or predictive accuracy (see Bi et al., 2019). The paper
concludes with a discussion on how big data and big data analytics can contribute to
improving the explanatory power of models in the social sciences and in demography in
particular.

The old and the new


Some classic methods for exploring data can be given as examples (Everitt and Dunn,
2001). Cluster analysis is utilized for classification purposes and searches for distinct
groups or classes of individuals (or units) in the data. One can then see on what variables
the groups differ. Pattern recognition is usually based on some form of cluster analysis.
Cluster analysis requires assessing the distance between individual profiles and the
measurement of distance is a major consideration in this case. Methods such as principal
components (continuous variables) or multiple correspondence analysis (categorical
variables) aim at reducing the number of dimensions of the data matrix (n individuals
or units on p variables), there being as many dimensions as there are variables. For
example, Hervé Le Bras, as far back in time as 1971, has had recourse to principal
components analysis for reducing the number of dimensions in a study of the fertility
rates of the French ‘départements’ (Le Bras, 1971). A population being divided into
k groups known a priori, according e.g. to socio-economic status, discriminant analysis
can be used to decide to which group an individual belongs.
Other classification methods, such as Support Vector Machines, naı¨ve Bayes rule,
decision trees, and neural networks, have been developed more recently for dealing with
big data. For example, Marucci-Wellman et al. (2015) have used naı̈ve Bayes algorithms
to routinely classify injury narratives from large administrative databases. Nigri et al.
(2022) have had recourse to a deep neural networks model to assign a vector of
age-specific death rates to an observed or predicted life expectancy. This method over-
comes the linearity assumption and data requirements of past approaches. For these
newer methods, the reader is referred, for instance, to Tsai et al. (2015) and Hassani
et al. (2018). Closer to demography, in the field of epidemiology, a good survey article is
Bi et al. (2019).
Since Tukey’s seminal book on exploratory data analysis (Tukey, 1977), much atten-
tion has been given to the visualization of the possible patterns and dependencies, trends,
and outliers or anomalies in the data. For example, if the first two principal components
246 Bulletin de Méthodologie Sociologique 161-162

account for a large proportion of the variance in the data, one can project the observa-
tions on the plane defined by these two components, showing the possible structure in the
data and pointing out any outliers. Other classic visualization techniques rely on non-
metric multidimensional scaling, for ordinal variables, or on non-linear mapping applied
to distance matrices. These and other traditional data visualization multivariate methods
are described, for instance, in Everitt (1978). For more recent approaches, such as Lexis
fields or composite lattice plots, see the special collection of the journal Demographic
Research on data visualization finalized on 20 April 2021.

Machine-learning and artificial intelligence


In recent years, machine-learning (ML) has been proposed as a new approach for
detecting patterns and associations in the data and for making predictions, especially
in the case of big data with many variables. For example, in a data-driven approach,
Bonacini et al. (2021) have opted for a ML algorithm to identify structural breaks in the
time series of COVID-19 cases in Italy. For the same country, Cerqua et al. (2021) have
used ML to build a counterfactual scenario of mortality in the absence of COVID-19.
When hundreds or even thousands of variables are considered, as in weather forecasts,
the computer beats humans by far.
Most of the newer methods described below rely on automated ML iterative tech-
niques. Recent approaches now depend on artificial intelligence (AI). The models adapt
or ‘learn’ as they are exposed to new data. Often, the models are applied to subsets of
data, in view of comparing results across models. For example, in a study on unauthor-
ized immigration to the USA, Azizi and Yektansani (2020) have split their dataset into
two subsets, one being a subset for training the model and the other a subset for testing
the trained model. The objective is to develop a model that generalizes well to new data,
the test set serving here as a proxy for new data. For a simple introduction to ML and AI,
see for example Alpaydin (2021). In the following paragraphs, some other methods of
big data analysis are briefly discussed, considering their possible usefulness in popula-
tion studies.

Dimensionality-reduction
Some classic methods are still used, such as principal components and factor analysis,
for dimensionality-reduction purposes. However, the analysis of big data can require
new techniques, traditional methods often being ill suited for analysing complex large-
scale data. A more recent approach in machine learning is feature subset selection that
searches the space of persons’ feature (or attribute or variable) subsets for the optimal
subset. The method is based on the relevance and redundancy of the features for the
problem at hand. Relevance and redundancy are evaluated respectively by entropy and
similarity measures (GeeksforGeeks, 2021). Another method is t-distributed stochastic
neighbour embedding, a non-linear dimensionality reduction algorithm that seeks to find
patterns in the data by identifying clusters based on similarity of data points (Schochas-
tics, 2017). It can be used for visualizing the data in a two- or three-dimensional space.
One should also point out the regularization approach in machine learning that reduces
Wunsch 247

the dimensionality of the training set in order to avoid overfitting. For example, Lasso
regression minimizes the complexity of the model by a penalty function limiting the sum
of the absolute values of the model coefficients. In Ridge regression, the penalty function
is equivalent to the square of the magnitude of the coefficients. Another technique is
Elastic-Net regression that improves on both Ridge and Lasso by using a modified
penalty function based on the combination of the penalties of the Lasso and Ridge
methods (see e.g. Nirisha, 2021). Dimensionality reduction can however lead to a high
bias error.

Cluster analysis
Clustering is of particular interest for demographers. It groups together low-level data
and can uncover hidden similarities between members of a group or feature differences
between groups. Outliers can be detected as falling outside the clusters. Cluster analysis
aims at putting units (e.g. individuals) into groups or clusters, minimizing the intra-group
differences among units and maximizing the inter-group differences. For example,
Duchêne and Thiltgès (1993) have clustered the 43 regional sub-divisions (‘arrondisse-
ments’) of Belgium into 4 groups with common characteristics of mortality over age
15 in order to examine regional disparities in adult mortality. Clustering could be espe-
cially useful in the analysis of big data, such as census microdata, where the very large
number of individuals could be grouped into a much smaller number of units, in which
individuals share common characteristics, that are more convenient to analyse.
Several distance measures are available for the purpose of clustering, the more famil-
iar one being Euclidean distance. Other distance measures can be preferred in some
applications. A well-known one is the Mahalanobis distance, but recent algorithms also
have recourse to the Manhattan distance or the Minkowski distance among others,
according to different data-mining problems (Tsai et al., 2015). Algorithms usually fall
into one of three main categories of clustering: density-based clustering, partitioning
clustering, and hierarchical clustering, though others exist such as grid-based algorithms.
For their pros and cons, see Wang (2017). Though not a new approach, cluster analysis
has been developed considerably in recent years. Some techniques now rely on fuzzy
clustering, such as the fuzzy-cmeans (FCM) algorithm generating fuzzy partitions.
A major problem is how to reduce the possible complexity of the data in big data
clustering, the data being structured (potentially available in tabular form) or unstruc-
tured (such as blogs or images), and often provided by various sources.
Finally, if the problem is expressed as sequences of events, such as life courses in
demography, clustering similar sequences together according to their resemblance can
be performed, in a pattern-search approach, using sequence analysis with optimal match-
ing or alignment algorithms (Abbott, 1995). See Ritschard et al. (2008) for a comparison
between sequence analysis and survival trees for mining event-histories.

Association rule mining


The method was first developed for market-basket analysis by Agrawal et al. (1993). The
purpose here is to discover the more frequent relationships between the data attributes.
248 Bulletin de Méthodologie Sociologique 161-162

This machine-learning method also indicates the strength of association among the co-
occurrences in the data. For a clear introduction, see the entry ‘Association rule learning’
in Wikipedia (2022). The method searches for the “if x-then y” patterns or item sets
among variables, that are the most frequent in the data. Several algorithms are available
for this purpose. In order to avoid discovering too many such rules, a threshold has to be
set. The algorithms rely on two important concepts: Support, or how frequently the
itemset x and y appears in the dataset, and Confidence, or p(y/x) i.e. the conditional
probability of y given x. A third concept is Lift, i.e. the ratio of the observed frequency of
co-occurrence to the expected frequency if x and y were independent. If x and y are
actually independent, Lift ¼ 1. Positive or negative associations lead respectively to Lift
> 1 and Lift < 1. In high-dimensional spaces, the method may require preliminary
dimensionality reduction. Of course, some associations detected may be spurious and
nothing guarantees that the associations are relevant from a causal viewpoint.

Social network analysis


People have never been so highly connected as now, thanks to social media such as
Facebook (Meta) or Twitter (X). The study of these myriads of connections has stimu-
lated the development of social network analysis. Social networks are also very impor-
tant in the study of epidemics and the way contagion spreads (Kucharski, 2021). For
example, the structure of the network has an impact on the rate of contagion. If it is fully
connected, the infection can spread from a single infected person to everyone else. If the
network contains closed loops, it can increase transmission due to the variety of routes
available. Moreover, some people have far more contacts than others do. It is therefore
important to know who are these high-contact persons or central agents in the network,
as they can possibly be high transmitters of the disease. Much has been written by
demographers on the recent incidence and lethality of Covid. Some countries, such as
Belgium, have developed a register of the proximate contacts of persons affected by this
disease that could be used to study the network of transmission.
Social networks are often represented by graphs where agents are vertices or nodes
and edges represent non-null relationships or interactions between agents. These net-
works can be expressed by binary matrices. A network with n nodes is represented by an
n x n adjacency matrix A with elements Aij ¼ 1 if i and j are connected and 0 otherwise.
Various descriptive statistics of the network can be computed; see O’Malley and Mars-
den (2008) for a good survey. For example, size is the number of nodes or agents while
density is the number of actual direct connections relative to the number of potential
ones. An agent’s degree is the number of other agents to which she is directly connected.
The degree distribution is the frequency distribution giving the number of agents having
particular degrees. The length of a path between agents is the number of edges it
contains. Measures of centrality reflect the prominence of agents within a network.
In the case of big data, much of the research relating to social networks has focused on
social network topology (such as assortative or disassortative networks) and especially
on centrality measures (Ianni et al., 2021). An important measure is degree centrality
that is based on the number of direct connections each node has to other nodes. Highly
connected individuals are probably the most popular ones and high transmitters of
Wunsch 249

information, or of contagion in the case of an infectious disease. A measure of Eigen-


Centrality reflects not only how many links a node has with other nodes but also how
many links its connected nodes have, and so on.2 An agent can acquire high centrality
either by being connected to many others or by being connected to others that themselves
are highly central. EigenCentrality identifies nodes with influence over the whole net-
work. Edges can also be weighted according to the strength of the ties; the entries of the
adjacency matrix are then the weights on the edges. A weighted graph can be mapped
onto an unweighted multigraph with multiple direct edges between nodes. See Newman
(2004) for a useful paper on this subject.
As a complement to this network-oriented approach, one should also consider
content-oriented approaches focusing on the topics and opinions or sentiments
exchanged among the network members. Natural Language Processing (NLP), a
machine-learning approach relying on neural networks, has become the major tool to
analyse contents and sentiments in big data. As an example using Twitter (X) data for
examining to what extent social media users report negative or positive sentiments on
topics relevant to fertility, see Mencarini et al. (2019). A very good literature review of
the network- and contents- approaches is presented in Bazzaz et al. (2021).

Analysing satellite imagery


The analysis of spatial data is an important component of research in such fields as
geography or ecology, and a wide range of methods have been developed for this
purpose (for an overview, see among others Dale et al., 2002). We conclude this non-
exhaustive survey of various methods of big data analytics by pointing out the impor-
tance of satellite imagery for some demographic applications. There are now a variety of
satellite high-resolution imagery sources, often freely available, that can provide infor-
mation on landscapes and infrastructures such as buildings and roads (GISGeography,
2023). These data are used e.g. for census mapping, in combination with geographical
information systems and census questionnaires on tablet. AI-powered algorithms can
now derive from geospatial big data highly-detailed digitalized coloured maps distin-
guishing buildings, road networks, green areas, water, etc. (see for instance Ecopia AI at
https://www.ecopiatech.com/). Darin et al. (2022) have, for example, used satellite
imagery for Burkina Faso in view of obtaining information on some areas where census
data cannot be collected, by combining observations of buildings in satellite images with
complementary demographic data. Another application is the creation of population
density maps by age and gender, combining satellite imagery with census information,
such as those that can be found on the HDX open platform.
To efficiently analyse aerial digital video data, the use of these large datasets requires
reliable segmentation algorithms, creating subsets or segments based on common charac-
teristics among the units of observation. This approach should not be confused with ‘seg-
mentation analysis’ as used by Loriaux (see Introduction). In the present case, image
segmentation is the partitioning of an image into connected regions of pixels defined by
similar colour or texture. For instance, González-Acuña et al. (2016) compare four segmen-
tation methods (clustering) on aerial data and show that each has its own merits and draw-
backs. They propose post-processing in order to improve the segmentation performance of
250 Bulletin de Méthodologie Sociologique 161-162

these methods by splitting segments, merging segments, and avoiding islands, i.e. connected
clusters of pixels that are surrounded by pixels of another segment.
With climate change and the expected increases in emigration from the more affected
areas, the analysis of satellite imagery combined with other data, such as digital traces from
mobile phones and field studies, could become a major tool in migration research. For an
example relating to migration caused by armed conflict, see Pech and Lakes (2017).

Discussion and conclusion


Big data
Demographers analyse huge amounts of microdata coming from censuses and from various
administrative registers. In many studies, the volume of observations reaches hundreds of
thousands and even several million. During the past years, individual-level anonymized data
from censuses and registers have become increasingly available. Moreover, in the high-
income countries, many national institutes are now linking data sources together.
Big data can help improving the explanatory power of models in several ways. A large
number n of observations increases the precision of the estimates and the power of hypo-
thesis tests. In addition, following Titiunik (2015), a large number of observations can allow
for a wider range of estimation methods that would be unreliable with fewer observations. A
large n also enables studying small subpopulations that would be overlooked in sample
surveys for instance. On the other hand, if n is very large even small differences of no
theoretical interest will be “statistically significant”, blurring the causal picture. Once again,
statistical significance should not be confused with theoretical significance (Bijak, 2019).
Furthermore, a large number p of variables helps in better describing the phenomenon
under study and in reducing omitted-variable bias. However, as the number of variables
increases, the differences among individuals will also increase, as their profiles will
differ more and more. As each individual becomes increasingly unique, will grouping
e.g. life courses together still have causal meaning? Theoretical reflection on the plau-
sible subset of causal variables (not too large but not too small either) will be mandatory
in these circumstances.
On the other hand, the representativeness of big datasets is unknown in many cases as
well as their population of reference, especially in the study of found data that are not
originally collected nor designed for a specific research purpose, such as those provided
by Facebook or Twitter (X). These datasets are therefore often problematic. For instance,
social media data are provided for a particular subset of users that is neither represen-
tative nor random.

Big data analytics


The methods of big data analytics, both old and new, are efficient in detecting correla-
tions and patterns in the data. Actually, the newer methods of data-mining based on
machine-learning and artificial intelligence can automatically detect unknown links
among variables based on the millions of observations available. This bottom-up
approach is a useful complement to the traditional top-down hypothesis-based approach,
Wunsch 251

as it can open the door to proposing new explanatory mechanisms that would then have
to be tested. These newer methods are also being used for prediction, with a qualified
success as the predictions are based on past knowledge. In the social sciences, causes and
causal relations change over time and nothing guarantees that the past and present
determinants of union formation, for example, will remain identical in the future.

Deduction, induction, and abduction


This paper ends with a more general discussion of the scientific method based on
deduction and induction in this era of big data. It also recalls the value of abduction.
It was pointed out in the introduction that the hypothetico-deductive approach, where
one tests an explanatory hypothesis by way of data, is often hampered in demography by
the lack of sound background knowledge and theory. Selecting a hypothesis or theory to
test is also deterrent to other possible hypotheses. And if one tests several hypotheses,
according to what rule does one choose the winner? On this issue, see for instance
Wunsch (1988, chapter 2). Moreover, there are no general laws in the social sciences
and possible explanations are highly context-dependent. For example, the determinants
of present-day fertility in France are quite different from those at the time of the Sun
King. Even in a same country, behaviours can differ quite drastically from one popula-
tion group to the other. Not only causes can differ among different populations but the
mechanisms leading from the causes to their effects can also be different. Causal claims
are therefore only valid for a specific time and place, or chronotope.3
There are no such difficulties with induction. The latter starts with a series of obser-
vations and, if a common pattern is detected, one seeks a general explanation. For
example, non-contracepting populations can present different levels of fertility. One
observes a strong association between the fertility level and the duration of breastfeed-
ing. As the latter influences the duration of post-partum amenorrhea, one concludes that
the fertility differences are caused by different breastfeeding practices. In this case, a
causal mechanism can even be proposed. The larger the data set, the more confident one
can be with the conclusion. However, other explanations may also be valid, e.g. separa-
tion of couples due to male labour migration, different infertility prevalences, or differ-
ential durations of post-partum sexual abstinence. Though induction is a useful
companion to deduction when theory is lacking, one can never be sure that induction
leads to the right cause.
Demographers have recently suggested adopting also abductive reasoning (Hauer and
Bohon, 2020; Bijak, 2022). Abduction seeks to propose the most plausible explanation
for a novel pattern observed. In other words, if C is observed, abduction consists in
selecting a hypothesis A from one’s background knowledge, considered the most plau-
sible for the case at hand, such that if A is true then C is explained (Cattelin, 2004).
Abduction therefore links inductive and deductive approaches. Doctors use it when
proposing an explanation of the symptoms observed in a patient, based on their know-
ledge of the causal relations between diseases and symptoms, and more generally by
scientists, on the basis of their knowledge, when invoking an explanation for novel
patterns discovered in an exploratory analysis of the data. However, there can be other
252 Bulletin de Méthodologie Sociologique 161-162

and better causes of C than A. Abduction thus requires testing the validity of the
proposed explanation A and comparing A to other possible causes.
Scientists use induction, abduction and deduction in their current practice according
to their needs; much depends on the availability of theory and data. If theory is unavail-
able, induction can come to the rescue of deduction by proposing possible new hypoth-
eses. But these have then to be tested, induction thus leading to deduction.

To conclude
In an inductive approach, big data analytics can detect novel patterns and associations in
the data. However, as Kitchin (2014) has observed, it is one thing to identify patterns in
big data; it is another thing to explain them. In other words, data do not speak for
themselves: they have to be interpreted. One can presume that among the various
associations that will be detected, only a few will possibly make causal sense. In a
data-driven approach, the problem then consists in proposing and testing a suitable
mechanism that can explain why a variation observed in one variable produces a varia-
tion in another variable, observation bias and confounding being under control. When
background knowledge is available, abduction can be used for this purpose.
To bring this article to a close, big data, machine-learning, and artificial intelligence
will have a profound effect on scientific discovery but they will not replace human
judgment in the construction and testing of explanatory models.

Acknowledgements
The author wishes to thank Catherine Gourbin and Federica Russo, and two anonymous reviewers,
for their valuable insights in the preparation of this article.

Declaration of Conflicting Interests


The author declared no potential conflicts of interest with respect to the research, authorship, and/
or publication of this article.

Funding
The author received no financial support for the research, authorship, and/or publication of this article.

Notes
1. Actually, the experience did not turn out too well, as there were many missing cases due to
under-reporting.
2. EigenCentrality is the eigenvector corresponding to the largest eigenvalue of the adjacency
matrix (Friedkin, 1991)
3. The term was coined by the Russian literary scholar Mikhail Bakhtin.

References
Abbott A (1995) Sequence analysis: New methods for old ideas. Annual Review of Sociology 21:
93-113.
Agrawal R, Imieliński T and Swami A (1993) Mining association rules between sets of items in
large databases. ACM SIGMOD Record 22(2): 207-216.
Wunsch 253

Alpaydin E (2021) Machine Learning. Cambridge, MA: MIT Press.


Azizi S and Yektansani K (2020) Artificial Intelligence and Predicting Illegal Immigration to the
USA. International Migration 58(5): 183-193.
Bazzaz A S, Kashani M H, Mahdipour E and Jameii M (2021) Big data analytics meets social
media: A systematic review of techniques, open issues, and future directions. Telematics and
Informatics 57.
Bi Q, Goodman K E, Kaminsky J and Lessler J (2019) What is machine learning? A primer for the
epidemiologist. American Journal of Epidemiology 188(12): 2222-2239.
Bijak J (2019) Editorial : P-values, theory, replicability, and rigour. Demographic Research 41
(32): 949-952.
Bijak J (2022) Towards Bayesian Model-based Demography. Agency, Complexity and Uncertainty
in Migration Studies. Cham: Springer.
Bohon S A (2018) Demography in the Big Data revolution: Changing the culture to forge new
frontiers. Population Research and Policy Review 37(3): 323-341.
Bonacini L, Gallo G and Patriarca F (2021) Identifying policy challenges of COVID-19 in hardly
reliable data and judging the success of lockdown measures. Journal of Population Economics
34(1): 275-301.
Catellin S (2004) L’abduction : une pratique de la découverte scientifique et littéraire. Hermès, La
Revue 39(2): 179-185.
Cerqua A, Di Stefano R, Letta M and Miccoli S (2021) Local mortality estimates during the
COVID-19 pandemic in Italy. Journal of Population Economics 34(4): 1189-1217.
Dale MRT, Dixon P, Fortin M-J, Legendre P Myers DE and Rosenberg MS (2002) Conceptual and
mathematical relationships among methods for spatial analysis. Ecography 25(5): 558-577.
Darin E, Kuépié M, Bassinga H, Boo G and Tatem AJ (2022) The population seen from the space:
When satellite images come to the rescue of the census. Population 77(3): 437-464.
Duchêne J and Thiltgès E (1993) La mortalité des plus de 15 ans en Belgique : les disparités
régionales en 1985-1987. Espace Populations Sociétés 1: 61-74.
Everitt BS (1978) Graphical Techniques for Multivariate Data. London: Heinemann Educational.
Everitt BS and Dunn G (2001) Applied Multivariate Data Analysis. New York, London: Arnold;
Oxford University Press.
Friedkin NE (1991) Theoretical foundations for centrality measures. American Journal of Sociol-
ogy 96(6): 1478-1504.
GeeksforGeeks (2021) Feature subset selection process. Available at https://www.geeksforgeeks.org/
feature-subset-selection-process/#:*:text¼Feature%20Selection%20is%20the%20most,to
%20a%20machine%20learning%20activity (accessed 20 January 2022).
GISGeography (2023) 15 Free Satellite Imagery Data Sources. Available at https://gisgeography.
com/free-satellite-imagery-data-list/ (accessed 21 February 2023).
González-Acuña RG, Tao J, Breen D, Breen BA, Pointing S, Gillman L and Klette R (2016)
Robust Segmentation of Aerial Image Data Recorded for Landscape Ecology Studies. In:
Huang F and Sugimoto A (eds) Image and Video Technology – PSIVT 2015 Workshops.
PSIVT 2015. Lecture Notes in Computer Science, vol 9555. Cham: Springer.
Hassani H, Huang X and Silva E (2018) Banking with blockchained big data. Journal of
Management Analytics 5(4): 256-275.
Hauer ME and Bohon SA (2020) Causal inference in population trends: searching for demographic
anomalies in Big Data. SocArXiv. November 19.
254 Bulletin de Méthodologie Sociologique 161-162

Ianni M, Masciari E and Sperlı́ G (2021) A survey of Big Data dimensions vs Social Networks
analysis. Journal of Intelligent Information Systems 57(1): 73-100.
Kitchin R (2014) Big Data, new epistemologies and paradigm shifts. Big Data & Society 1(1):
1-12.
Kucharski A (2021) The rules of contagion. London: Wellcome Collection.
Laney D (2001) 3D Data Management: Controlling data Volume, Velocity and Variety. Applica-
tion Delivery Strategies, Meta Group.
Le Bras H (1971) Géographie de la fécondité française depuis 1921. Population 26(6): 1093-1124.
Loriaux M (1971) La segmentation, un outil méconnu au service du démographe. Recherches
Economiques de Louvain 37(4): 293-327.
Marucci-Wellman HR, Lehto MR and Corns HL (2015) A practical tool for public health surveil-
lance: Semi-automated coding of short injury narratives from large administrative databases
using Naı̈ve Bayes algorithms. Accident Analysis & Prevention 84: 165-176.
Mencarini L, Hernández-Farı́as DI, Lai M, Patti V, Sulis E and Vignoli D (2019) ‘Happy parents’
tweets: An exploration of Italian Twitter data using sentiment analysis. Demographic Research
40 (25): 693-724.
Newman MEJ (2004) Analysis of weighted networks. Physical Review E 70, 056131.
Nigri A, Levantesi S and Aburto JM (2022) Leveraging deep neural networks to estimate
age-specific mortality from life expectancy at birth. Demographic Research 47(8): 199-232.
Nirisha V (2021) Regularization and its techniques in Machine Learning. Linkedin. Available at
https://www.linkedin.com/pulse/regularization-its-techniques-machine-learning-nirisha-
voggu-1c/ (accessed 21 February 2023).
O’Malley AJ and Marsden PV (2008) The analysis of social networks. Health Services Outcomes
Research Methodology 8(4): 222-269.
Pech L and Lakes T (2017) The impact of armed conflict and forced migration on urban expansion
in Goma: Introduction to a simple method of satellite-imagery analysis as a complement to
field research. Applied Geography 88: 161-173.
Ritschard G, Gabadinho A, Müller NS and Studer M (2008) Mining event histories: A social
science perspective. International Journal of Data Mining, Modelling and Management 1(1):
68-90.
Robette N (2022) Trees and forest. Recursive partitioning as an alternative to parametric regres-
sion models in social sciences. Bulletin of Sociological Methodology 156(1): 7-56.
Schochastics (2017) Dimensionality Reduction Methods Using FIFA 18 Player Data, R-Bloggers.
http://blog.schochastics.net/post/dimensionality-reduction-methods (accessed 20 January
2022).
Titiunik R (2015) Can big data solve the fundamental problem of causal inference? Political
Science & Politics 48 (1): 75-7.
Tsai CW, Lai CF, Chao HC and Vasilakos A (2015) Big data analytics: a survey. Journal of Big
Data 2(21)
Tukey JW (1977) Exploratory Data Analysis. Reading, Massachusetts: Addison-Wesley Pub. Co.
Wang L (2017) Heterogeneous data and big data analytics. Automatic Control and Information
Sciences 3(1): 8-15.
Wikipedia (2022) Association rule learning. Available at https://en.wikipedia.org/wiki/
Association_rule_learning (accessed 25.09.2022).
Wunsch 255

Wunsch G (1988). Causal Theory and Causal Modeling. Beyond Description in the Social
Sciences. Leuven: Leuven University Press.
Wunsch G, Gourbin C and Russo F (2024) Big data, demography, and causality. Open Journal of
Social Sciences 12(1): 181-206.

View publication stats

You might also like