Data Science For Entrepreneurship

This document summarizes a study that uses data science methods to analyze a dataset of 7.7 million job vacancies in the Netherlands from 2012 to 2017. The study finds: 1) Demand for both entrepreneurial and digital skills increased for managerial positions over this period, but increased more for entrepreneurial skills. 2) Entrepreneurial skills were in higher demand than digital skills across professions and over the entire 2012-2017 period. 3) Demand for certain entrepreneurial skills was particularly important for some professions but not others. The study concludes that further research on entrepreneurial skills, especially outside of entrepreneurs, could provide valuable insights.

Uploaded by

punitha fake

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views22 pages

Data Science For Entrepreneurship

Uploaded by

punitha fake

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Small Bus Econ

https://doi.org/10.1007/s11187-019-00208-y

Data science for entrepreneurship research: studying

demand dynamics for entrepreneurial skills
in the Netherlands
Jens Prüfer & Patricia Prüfer

Accepted: 13 March 2019

# The Author(s) 2019

Abstract The recent rise of big data and artificial intel- but not for others. We also find that entrepreneurial skills
ligence (AI) is changing markets, politics, organizations, were significantly more demanded than digital skills
and societies. It also affects the domain of research. over the entire period 2012–2017 and that the absolute
Supported by new statistical methods that rely on com- importance of entrepreneurial skills has even increased
putational power and computer science—data science more than digital skills for managers, despite the impact
methods—we are now able to analyze data sets that can of datafication on the labor market. We conclude that
be huge, multidimensional, and unstructured and are further studies of entrepreneurial skills in the general
diversely sourced. In this paper, we describe the most population—outside the domain of entrepreneurs—is a
prominent data science methods suitable for entrepre- rewarding subject for future research.
neurship research and provide links to literature and
Internet resources for self-starters. We survey how data Keywords Data science . Machine learning .
science methods have been applied in the entrepreneur- Entrepreneurship . Entrepreneurial skills . Big data .
ship research literature. As a showcase of data science Artificial intelligence
techniques, based on a dataset of 95% of all job vacan-
cies in the Netherlands over a 6-year period with 7.7
JEL classification L26 . C50 . C55 . C87 . O32
million data points, we provide an original analysis of
the demand dynamics for entrepreneurial skills in the
Netherlands. We show which entrepreneurial skills are
particularly important for which type of profession. 1 Introduction
Moreover, we find that demand for both entrepreneurial
and digital skills has increased for managerial positions, We are drowning in data. Ninety percent of the world’s
data today has been created in the last 2 years alone.1
Most of it is unstructured text, images, and videos,
which is hard to categorize, let alone understand, for
human beings.2 There are sensor data in (self-driving)
J. Prüfer (*)
Department of Economics, CentER, TILEC, Tilburg University,
cars, smart home and office equipment, social media
P.O. Box 90153, 5000 LE Tilburg, The Netherlands data, mobile data, data on Internet and browsing behav-
e-mail: [email protected] ior, or digital camera images, to name just a few. This
P. Prüfer 1
https://public.dhe.ibm.com/common/ssi/ecm/wr/en/wrl12345
CentERdata, Tilburg University, P.O. Box 90153, 5000 usen/watson-customer-engagement-watson-marketing-wr-other-
LE Tilburg, The Netherlands papers-and-reports-wrl12345usen-20170719.pdf.
e-mail: [email protected] 2
https://blog.microfocus.com/how-much-data-is-created-on-the-
internet-each-day/.
J. Prüfer, P. Prüfer

explosion of data is accompanied by tremendous prog- opportunities initiated by very recent technological prog-
ress in data science methods, which can make sense of ress, which by itself is an ongoing process. Thus, the very
all the available information. Those methods are fueled object of entrepreneurship research changes along the
by artificial intelligence (AI). And this may just be the development of the technological frontier. Today, due to
beginning (Taddy 2018). The McKinsey Global Insti- the availability of much more data and computer power,
tute recently projected that the adoption of AI by firms this frontier is shaped strongly by the state of data science
may follow an S-curve pattern—a slow start given the techniques. Today, we can analyze and interpret large
investment associated with learning and deploying the amounts of complex and unstructured data and make
technology and then acceleration driven by competition predictions based on correlations and inductive modeling.
and improvements in complementary capabilities Researchers can benefit by understanding and—where
(Bughin et al. 2018). At the macro level, they expect appropriate—embracing statistical methods that are driven
that AI could potentially deliver additional economic by AI algorithms. This process has already started and has
output of around U$13 trillion by 2030, boosting global had disruptive effects on the social sciences, such as eco-
GDP by about 1.2% a year. The increased output from nomics (Einav and Levin 2014) and management (George
efficiency gains and innovations could be passed to et al. 2014). It has created the new field of computational
workers in the form of wages and to entrepreneurs and social science, which may reveal new patterns of individ-
firms in the form of profits.3 ual and group behavior and allow to model economic and
These rapid and ongoing changes in the economic, social interactions more precisely (Lazer et al. 2009).
political, and social spheres also affect the domain of We contribute to the entrepreneurship literature in two
research. Massively improved AI offers better, i.e., dimensions. First, in the next section, we describe the most
cheaper, prediction (Agrawal et al. 2018). Improved pre- prominent data science methods suitable for entrepreneur-
diction capabilities allow us to work with huge data sets ship research. The goal is to give the interested reader a
that are representative for entire populations, simply be- concise overview over what is possible technically today,
cause they contain nearly complete data on that population with enough input and references to start educating one-
(see Section 4 for an example). Even more, the statistical self. Section 2 is complemented by the Appendix, where
methods relying on AI—data science methods—allow us we provide links to literature and Internet resources and
to tackle novel types of questions, such as the following: where we also delineate key technical terms and list the
How to study the role of geographic and social proximity most relevant text mining tools and download resources
for entrepreneurial interactions by using huge social media for self-starters. Our second contribution comes in Sections
data sets, i.e., from Twitter, instead of traditional case 3 and 4. Section 3 surveys how data science methods have
studies? How can we classify the personalities of more been applied in the entrepreneurship research literature and
than 1000 CEOs, identify the entrepreneurial ones, and sketch how they have been used to study important re-
study to what extent being entrepreneurial has positive search questions that could not—or not to the same
effects on firm performance? To what degree are entrepre- extent—be studied without these techniques. Along these
neurial skills and personality traits helpful for workers in lines, in Section 4, we provide an original analysis of a data
all kinds of sectors and jobs? Crucially, these questions—if set with 7.7 million data points and study the dynamics of
they could be asked at all—could not be seriously studied, demand for entrepreneurial skills in the Dutch population.
let alone be answered, by traditional empirical methods In Section 5, we conclude by discussing opportunities and
that have been taught in graduate schools in economics risks of data science techniques and relate them to tradi-
and management in the past decades. tional empirical research methods and theory.
In their seminal article, Shane and Venkataraman
(2000) defined entrepreneurship as the identification,
evaluation, and exploitation of opportunities. Shane 2 Data science methods for entrepreneurship
(2012) underlined that entrepreneurship is a process, not research
a one-time event. The questions listed above relate to
2.1 Background
3
Note, however, that in practice this redistribution of profits is often
not occurring. Profits are accumulated and stay mostly in the top In conventional statistical research, you start with the
Bsuperstar firms^ (Mayer-Schönberger and Ramge 2018). formulation and testing of hypotheses with the help of
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

data, assuming that the data are generated by a given graphs, we can observe clusters within that information
stochastic data model. In data science, by contrast, you and present results of data analyses.
churn large volumes of data looking for patterns by Of course, traditional data sources such as surveys
using algorithmic models and treating the data mecha- and large administrative data sets (old data) can be
nism as unknown.4 Thus, data science Bnot only pro- analyzed and interpreted with the help of data science
vides new tools, it solves a different problem^ techniques, too. The computational power of these tech-
(Mullainathan and Spiess 2017, p. 88) and is able to niques allows for a much broader and varied search on
discover complex structures that were not specified in existing data, which may lead to the revelation of new
advance (Breiman 2001). In other words, whereas con- patterns and insights even in traditional data sources. A
ventional statistics is deductive, data science is induc- notable example is the use of machine learning tech-
tive: the approaches are complementary. niques on the huge United States Patent and Trademark
Data science relies heavily on computational power Office (USPTO) database. Various papers have shown
and computer science to derive knowledge from the that these methods can improve inventor disambigua-
unprecedented, exponentially growing, complex, and tion from this database and, thereby, help to add a more
unstructured data, the so-called big data. By making accurate understanding of inventor careers (Li et al.
software autonomous or using iterative feedback to 2014; Ventura et al. 2015). Machine learning algorithms
discover associations in data, we can find generalizable cannot only match patents more correctly to inventors,
patterns and anomalies. Thus, instead of teaching ma- they can also include more information from other use-
chines to do things, the goal of data science is to design ful data sources, for example co-authorships, collabora-
them to Bthink^ for themselves and then allow them tion variables, and geographic location. Based on this
access to the mass of available data so they could learn. information, Blarge-scale innovation studies across time
Moreover, while the human brain can associate two or and space with visualization of inventor mobility across
three dimensions of information with each other, algo- the United States^ (Li et al. 2014, p. 941) are possible
rithms allow hundreds of dimensions. This leads to a with much lower error rates than before. Similarly,
system searching for much more fine-grained associa- disambiguation approaches based on machine learning
tions, clusters, and classifications, extracting meaningful are more consistent across contexts as they can cope
information from the data. As a next step, an under- better with varying features and detect the best features
standable structure can be developed to facilitate data- automatically and more precisely (Ventura et al. 2015).
driven decision-making.
Due to the nature of Bbig data^ and the complexity of 2.2 Key data science methods
the algorithms used, data science often requires special
ways of data storage, accessibility, and processing. A multitude of different tools and techniques are avail-
Analyses are often done by using multiple computers able, of which we highlight the most interesting ones for
and multiple calculation units, the so-called high-perfor- entrepreneurship research. In general, Python, currently
mance computing, for instance, Hadoop clusters and the fastest growing (general purpose) programming lan-
Spark Streaming, or parallel virtual environments.5 Usu- guage, features a large range of very effective scripts and
ally the basic steps for analysis include writing an algo- open-source libraries for these tasks.
rithm, setting up an automated process (script), and
linking it with open data protocols and application pro- 2.3 Machine learning
gramming interfaces (APIs). Collecting large amounts
of unstructured information often generates a complex Within the field of data science, machine learning (ML)
information set. With the help of visualization tech- is an advanced field of research dealing with the tech-
niques and tools, such as chord charts and network niques that teach computers to learn without being pro-
grammed explicitly (Samuel 1959). ML is not a syno-
4
This section mildly overlaps with one section in Prüfer and Prüfer nym for AI, though; it is technically a branch of AI. AI,
(2018). That paper, however, is significantly shorter and focuses on in fact, is a much broader concept, in which machines
institutional economics, not entrepreneurship research and the dynam-
mimic cognitive functions of learning and problem
ics of entrepreneurial skills.
5
See Box A1 in the Appendix for detailed explanations of these, and solving. Therefore, AI algorithms and machines are able
more, technical terms. to adapt to different situations and to carry out tasks in a
J. Prüfer, P. Prüfer

way that we would consider Bsmart^ or Bintelligent,^ that the model can only perfectly rationalize a specific
that is, with human-like cognitive functions (OECD outcome based on the given training data but is not able
2017; Taddy 2018). to predict variants that were not used for training.6
Within ML, the two most important categories are Deep learning (DL) is a special class of supervised
supervised learning and unsupervised learning. Super- learning algorithms that is frequently used for feature
vised ML is the name of a set of advanced algorithms extraction from complex, multidimensional data such as
that use information from known results, the so-called images. For instance, Google uses DL to automatically
labels, to optimize predictions. Technically, in a super- suggest the next word(s) of a search term when one has
vised learning task a computer learns a relation between started typing a word. DL uses the so-called (artificial)
some observed input (usually a vector of many predic- neural networks, which allow computers to more close-
tors) and some desired output (one outcome variable of ly mimic human brains while still being faster, more
interest) (Hastie et al. 2009). A supervised learning accurate and less biased. Neural networks are especially
algorithm analyzes the labeled training data and pro- suited for deriving patterns from (highly) non-linear
duces an inferred function to map novel (test) data. processes. Depending on the form of the model under-
Supervised learning helps to predict unseen patterns lying the DL algorithm, a neural network falls either
and to understand which input best predicts the outcome within the category of supervised learning or within
to assess the quality of previously tested predictions/ unsupervised learning, which we will explain below.
inferences. Therefore, it also serves to reduce the Bcurse In a pioneering example, Tan and Koh (1996) trained
of dimensionality,^ for example by using an algorithm a neural network based on information from psycholog-
for dimensionality reduction such as principal compo- ical, demographic, and family characteristics to predict
nent analysis, where variables that are meaningless in entrepreneurial inclination. Results from a survey ad-
explaining a desired target variable or are possibly cor- ministered among 200 business undergraduates served
related, are eliminated by the statistical procedure of as training and testing data to model entrepreneurial
orthogonal transformation. inclination in an individual. Then, the neural network
Depending on the type of data, one can choose from predicted inclination in any other person based on
regression and classification techniques within super- knowledge of the imputed social and psychological
vised ML. If one has to predict continuous values, correlates. In this early case, the ML algorithm had an
regression techniques are the way to go, while classifi- accuracy of 80% for predicting entrepreneurial inclina-
cation techniques are used in discrete settings; they tion in individuals not encountered before.
identify which set of categories (classes) a new obser- Machine learning can also be performed unsuper-
vation belongs to. An easy to interpret and widely used vised. Then it is used to learn and establish baseline
classification method is a decision tree. Starting from profiles for different entities. In unsupervised ML,
the root, the training observations are split up as hetero- Bnatural^ groups or clusters of observations are made,
geneously as possible into two subgroups. At each node,
the algorithm examines which variable it can best split 6
One way to overcome instability is to use ensemble methods, such as
into two new nodes. In this way, the data is split up a random forest (RF). This is a tree-based supervised learning tech-
further and further, until a stop criterion is met (for nique, in which a large number of decision trees are combined to arrive
at the final prediction. Thereby, the method is more stable than a single
example, less than n training observations per node). decision tree. If the target variable that we want to predict is categorical,
Depending on the values of the variables, each observa- the final outcome is determined by means of Bmajority voting.^ In
tion ultimately falls into one class (i.e., a single leaf). other words: the outcome of most trees is considered to be the final
outcome. The collection of trees is called random, because each tree is
The results of a decision tree can be interpreted and
trained on a random selection of variables and observations. When
graphically displayed relatively easily. However, deci- multiple models are combined in a large model, we speak of an
sion trees are prone to instability: a relatively small ensemble model. Combining many loose decision trees into an ensem-
ble model results in higher precision and in more stable predictions.
change in the data can result in another tree. Thereby,
Therefore, a RF generally provides much better predictions than a
a decision tree has a large Bgeneralization error,^ a decision tree. Other well-known techniques are gradient boosting and
phenomenon that is also called Boverfitting the data,^ Support Vector Machines (SVM). For more information on these (and
meaning that it can contain nodes that have been created other) techniques, see Hastie et al. (2009) or Provost and Fawcett
(2013). There you can also find a discussion on performance metrics
by specific cases in the training data set, making the to test which of the available models works best for a given dataset and
model poorly generalizable to other data. This means research question.
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

whereby observations that are Bequal^ or Bclose^ to Calvano et al. (2018) use RL to experiment with AI
each other, belong to the same group. This allows trends pricing agents interacting repeatedly in a controlled
and patterns in data to be properly mapped out, for environment (computer-simulated marketplaces). Their
instance, when customers have to be grouped into dif- algorithmic price setting experiments shows that when
ferent segments based on their characteristics so that replacing human decision-making even relatively sim-
services can be tailored individually (Alsayat and El- ple pricing algorithms systematically learn to play so-
Sayed 2016). In this type of cluster analysis, it is nec- phisticated collusive strategies without communicating
essary to optimize the number of clusters and to thor- with each other at all.
oughly investigate the stability of the clusters. The latter
can be done by adding noise or using multiple algo-
2.4 Text analytics and web data scraping
rithms to check whether a certain change in data gives
rise to a new cluster.
In addition, data science methods are also suitable for
For the clustering, a distance metric is often used as
obtaining information from unstructured data, often
(in)equality score. This can be the Euclidean distance or
scraped from the Internet. This is very useful because
another distance function. Whereas most distances can
about 80% of big data is available in unstructured text
only be used for numerical and complete data, the
form, for example in blogs, websites, and social media
Gower’s remote function can deal with both categorical
(Cogburn and Hine 2017). This way, all data sources
and missing data. The more data there is, the more
that relate to natural language can be used, such as open
computationally expensive the choice of an algorithm
answers, text files, notes from customer contacts, reports
that minimizes distance. Frequently used algorithms are
or e-mails. There are several useful tools and techniques
K-means and K-modes which work with centroids as
for handling text, semantic, and social data to extract
distance measures and are, thereby, less computationally
valuable information from these sources. Here, we de-
expensive in terms of the best distance metric.
scribe how and what we can infer from the data and
A very early example of unsupervised learning using a
discuss useful techniques for mining and analyzing text
neural network for clustering is Rutherford et al. (2001).
data to discover interesting patterns, extract useful
This paper uses a so-called self-organizing map (SOM)
knowledge, and support decision-making. Even more
approach to study the relation of firm size with firm
information can be found in Section 4.
success and survival. Using information from the Nation-
In addition, Internet log files and the metadata of
al Survey of Small Business Finances (NSSBF), Ruther-
search engines can provide interesting information
ford et al. (2001) classify small firms (having less than
about trends over time. Search engines register when
500 employees) into multiple groups based on size and
and where a search query was performed in their search
ownership as well as firm characteristics. The 4637 small
logs and process this information for the answers pro-
firms in their sample cluster naturally into two distinct
vided to subsequent related search queries.7 The num-
groups: a larger group with 3311 members of very small
bers of searches on certain topics and the presented
firms and a smaller group with 1326 members of larger
order of search results often show interesting patterns,
(but still small) firms. Given that these two groups differ
which Google Trends makes use of, for instance. In a
significantly on other background characteristics, this
recent book, Stephenson-Davidowitz (2017) presents
early paper provides evidence that differences in firm size
research that uses different kinds of Internet data: Goo-
and structures matter to predict antecedents of firm sur-
gle Trends, online search data, information on views and
vival and success.
clicks, and even patterns of swipes in mobile apps. A
Yet another category is reinforcement learning (RL),
famous example is what happened after Facebook in-
which differs from standard supervised learning because
troduced the BNews Feed^ in 2006. With this function,
correct input/output pairs are never presented, nor sub-
users would get automated updates of the activities of all
optimal actions explicitly corrected. Thus, in reinforce-
their friends. It provoked immediate fierce protests of
ment learning, there is no answer. Instead, the reinforce-
nearly a million users but Facebook did not remove the
ment agent decides how to perform the given task. The
News Feed. The company had what Stephenson-
only training data given as feedback to the algorithm is
in the form of rewards and punishments. In the absence 7
The economic consequences of this usage are studied in Prüfer and
of training data, it is bound to learn from its experience. Schottmüller (2017).
J. Prüfer, P. Prüfer

Davidowitz calls the Bdigital truth serum^ (Stephenson- attention. To arrive at deeper, richer, and more fine-
Davidowitz 2017, p. 154): numbers on clicks and visits grained insights on the entrepreneurial mindset, the so-
increased tremendously after the introduction of the called digital footprints from social media are increas-
News Feed. In his book, the author provides many more ingly used. Lee et al. (2017) measure overconfidence of
examples on how to use Internet data to derive new CEOs by classifying their messages sent on Twitter.
insights in human nature and behavior, especially for They distinguish Bprofessional CEOs^ and Bfounder
sensitive issues such as sexual orientation, sexism, cus- CEOs^ and find that the latter use more optimistic
tomers’ revealed preferences, and stereotypes. language on Twitter and during earnings conference
Mining, clustering, and analyzing these unstructured calls. Founder CEOs are also more likely to issue earn-
data sources requires the use of analytical techniques for ings forecasts that are too high.
natural language. This so-called natural language pro- Aggarwal and Singh (2013) show that social media
cessing (NLP) can be performed in different program- can also be used as a means to an end for entrepreneurial
ming languages, for example, Python or R, and re- success. They study company blogs across multiple
searchers can use well-established packages and tool- stages of venture capitalists’ decision-making and find
boxes. Sentiment analysis, for instance, can extract sub- that blogging can help managers in getting their prod-
jective information from language, while topic modeling ucts and services selected at the screening stage, but
can discover the abstract Btopics^ in a collection of that, beyond that, blogging does not help directly. The
documents. Other techniques, such as named entity authors show that blogs can help indirectly in the last
recognition (NER) or Part-of-Speech (POS) tagging, stage of the venture capital process when negotiating a
recognize entities such as organizations, people, loca- contract with the venture capitalist: blogs (with good
tions, dates, time, or currency (NER) or word types such coverage) attract the attention of competing venture
as verb, noun (POS) in text. Box A2 in the Appendix capitalists, which drives up venture prices, and hence
lists the most common concepts and tools in general and improves the blogger’s outside option.
Section 4 exemplifies the steps one has to take when Since the success of the managerial Bupper echelons^
working with text data from online sources. perspective, it is rather undisputed that the individual
characteristics and values of decision makers have a
significant impact on the performance of firms
3 Applying data science to entrepreneurship (Andrews 1980; Hambrick and Mason 1984). A key
research question is how certain managerial characteristics trans-
late into better performance. A popular empirical ap-
In this section, we highlight some recent papers using proach to this question has been to measure actual
data science methods for research questions on various behavior of decision makers through real-time personal
aspects of entrepreneurial characteristics, processes, and observation (Mintzberg 1973). This time-consuming
entrepreneurship (success). ML and/or text analytics procedure, however, creates the problem of small sam-
have been applied to issues such as funding (via venture ple sizes and suffers from selection issues.
capital and via crowdfunding), (product) innovation, Bandiera et al. (2017) tackle the issue by developing
inventors’ disambiguation, and entrepreneurial traits. new methodology: First, via daily phone calls with 1114
The fundamental contributions of these studies fall in CEOs or their assistants, they collected 42,233 data
two categories: the utilization of new sources of infor- points about the decision makers’ diaries. Then they
mation and data, advancing the data frontier, and ap- employed an unsupervised learning algorithm (a latent
plying novel techniques to existing data and/or prob- Dirichlet allocation, LDA), which provides them with a
lems, thereby advancing the knowledge frontier. We complete probabilistic description of time-use patterns,
now take them in order. despite the high dimensionality of their data set. The
A classical question in entrepreneurship research re- algorithm posits that the actual behavior of each CEO is
lates to the factors predicting a start-ups’ success (Stuart a mixture of a small number of Bpure^ behaviors and
and Abetti 1987; Hisrich et al. 2007). Today, the role of that the creation of each activity is attributable to one of
online and social media communication and information these pure behaviors. In their case, the algorithm finds
for the development, identification, and success of en- two Bpure^ behaviors and generates a one-dimensional
trepreneurial activities and agents has received a lot of behavior index that represents a CEO as a convex
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

combination of the two pure behaviors. Following these subjective and self-reported measures, digital foot-
Kotter (1999), they classify the first pure behavior as prints, where individuals willingly and unwillingly
Bmanager^ and the second as Bleader:^ Bmanager^ re- spread (personal) information to a large and diverse
fers to more time of the CEO spent in meetings with audience, can be used to derive objective and accurate
production-level workers and one-to-one meetings with information, revealing individuals’ true preferences.
firm employees or suppliers; Bleader^ refers to more time Obschonka et al. (2017a) show that this new tool de-
spent with top-executives and in interactions with several livers valid results for univariate and multivariate anal-
participants and functions from inside and outside the firm yses of personality differences between (superstar) en-
together. Kotter associated Bmanagers^ with a focus on trepreneurs and (superstar) managers and that, surpris-
monitoring and implementation tasks, whereas Bleaders^ ingly and contrary to earlier findings, the latter category
focus on the creation of organizational alignment and shows more entrepreneurial characteristics than the for-
communication across a broad variety of characteristics. mer one.9
Clearly, the characterization of Bleaders^ is related to the Tata et al. (2017) use Twitter data to arrive at the
characterization of entrepreneurial skills in the entrepre- Bpsycholinguistics of entrepreneurship^ and demon-
neurship literature (see Section 4). strate that even though entrepreneurs are fundamentally
As a final step, Bandiera et al. (2017) correlate their different from the general population, also the organiza-
managerial behavior index with firms’ balance sheet tional life cycle matters for the emotions and sentiments
data and find that Bleader^ CEOs are more likely to be attached to entrepreneurship and to the work-life bal-
found in larger and more productive firms: an increase ance in general. The use of language as a robust means
of the behavior index by one standard deviation is for revealing individuals’ (work-life) concerns, motives,
associated with an increase of 7% in sales, controlling traits, and emotions is not new to the field. For instance,
for a battery of factors. This not only suggests that Tausczik and Pennebaker (2010) have shown that lan-
decision makers with entrepreneurial characteristics guage is a robust means for revealing individuals’ work-
can also do well in more established organizations. life concerns and emotions. BEntrepreneurial emotion^
More important for the study at hand, Bandiera et al. is a topic in itself and describes a package of feelings
(2017) show an innovative way how to use data science that often come with being an entrepreneur (Cardon
techniques to give a more robust answer to an existing et al. 2012), a topic that has gained increased importance
question involving personal characteristics of decision through big data and AI as enablers of new self-
makers. There is a lot of scope to apply this to a host of employed businesses: BApproximately 150 million
questions in the entrepreneurship literature. workers in North America and Western Europe have
Obschonka et al. (2017a) use Twitter data to identify left the relatively stable confines of organizational life—
the personality traits of superstar entrepreneurs and sometimes by choice, sometimes not—to work as inde-
compare them to the characteristics of superstar man- pendent contractors^ (Petriglieri et al. 2018). However,
agers, Ba hitherto understudied population in entrepre- by using Twitter data for these analyses, Tata et al.
neurship research^ (p. 14). To do this, the authors use a (2017) are able to overcome several limitations of tradi-
sample of 106 Twitter accounts of (superstar) entrepre- tional data sources, such as surveys. Social media data
neurs and managers. They analyze information from can not only avoid response and recall biases; they also
these accounts by using a novel language-based person- offer a real-time window into peoples’ thoughts over
ality assessment tool that is capable of dealing with the long periods, for more actors than any existing alterna-
huge number of observations from social media data.8 tive, at any point in time, and across diverse geograph-
Up to now, traditional, survey-based methods, such as a ical locations. Moreover, content analysis of Twitter
standard Big Five questionnaire, have been used to data allows collecting information on emotions, con-
assess an individual’s personality traits. In contrast to structs, and concerns simultaneously.

8 9
The tool, Receptiviti, is used for top-down language analysis along a In another interesting paper, Obschonka and Fisch (2018) use a NLP
large number of psychological metrics. Receptiviti is the commercial approach on Twitter data to analyze whether entrepreneurial personal-
variant of the Linguistic Inquiry and Word Count (LIWC) text analysis ities are increasingly more numerous and more influential in political
platform, which allows for an assessment of language and text for leadership. They test the underlying hypothesis, that an entrepreneurial
psychological purposes in more than 80 languages. See Box A2 in the personality benefits from the rise of the Bentrepreneurial society,^ on
appendix for more details. US President Donald J. Trump, who was an entrepreneur before.
J. Prüfer, P. Prüfer

Wang et al. (2017) use Twitter data for yet an- taps into the power of the crowd to acquire financial
other type of entrepreneurship research. They apply support. Platforms match individuals or entities in need
social network analysis to entrepreneurial networks of funding with individuals or groups willing to contrib-
in the USA to identify and locate entrepreneurs ute financially, often in the form of microfunding.10
jointly with important regional subtleties within the Apart from being interesting sources of big data, these
network. They find that although Twitter enables platforms themselves can be viewed as big data and
interactions across geographically (and socially) dis- datafication phenomena as a result of the ongoing dig-
tant locations, the highest intensity can be detected itization and automation. Both innovative developments
in regional interactions characterized by similar so- use mass collaboration, mostly via online tools, to ac-
cioeconomic and demographic profiles. This suggests complish certain goals, for example the funding of an
that, even in our digitally connected world, geo- idea or a project. There is serious money involved in
graphic and social proximity are important for entre- crowdfunding11 and, therefore, reliable predictions on
preneurial interactions. Hence, earlier results about the success rates of these products or projects are im-
the important role of social relationships for entre- portant. Obviously, big data and data science methods
preneurship are still valid. See the work of Olav play an important role also for the internal business
Sorenson (Rickne et al. 2018). For instance, processes at a crowdfunding or crowdsourcing platform.
Sorenson (2018) shows that both professional and With social media promotions, statistics on earlier pro-
private social relationships are original reasons for jects, market dynamics, and other activities, a huge
industry concentrations in a small number of places, amount of data is generated, which can predict the
even when firms do not benefit from this clustering. success of ideas or products based on past analytics
Wang et al. (2017) extend the research on the rele- results.12
vance of networks in various ways: they simulta- Hoornaert et al. (2017) build a ML model to predict
neously examine the types of actors engaged in the success and failure of business and product ideas
digital networks and the specific regions that are generated within the crowd based on 3C’s: its content,
active on the Twitter entrepreneurship domain.
Moreover, they analyze the regional characteristics 10
The idea of crowdfunding also exists in the financing of research,
that explain the intensity of activity on this social among others due to high rejection rates prevalent in (scientific)
research. Various scientific crowdfunding platforms emerged, for ex-
media platform. The use of big data allows for
ample Experiment.com (https://experiment.com/) or SciFund
social network analyses on a much larger scale Challenge (https://scifundchallenge.org/). Vachelard et al. (2016)
than when using data from primary survey wrote: BCrowdfunding represents an attractive new option for funding
collection efforts. Thereby, Wang et al. (2017) offer research projects, especially for students and early-career scientists or
in the absence of governmental aid in some countries. The number of
a seminal example of bringing data science into the successful science-related crowdfunding campaigns is growing, which
entrepreneurial social networks literature that has demonstrates the public’s willingness to support and participate in
mostly been dominated by case studies (Greve and scientific projects^ (p. 1).
11
According to the Crowdfunding Industry Report, global
Salaff 2003). They also demonstrate how the incor- crowdfunding was expected to reach $ 34,4 billion in 2015
poration of additional quantitative and qualitative (http://crowdexpert.com/crowdfunding-industry-statistics/), while the
information can mitigate issues of representativeness Crowdfund Campus reports that Bin 2016, equity raised from
crowdfunding passed VC funding for the first time, and, by 2025, the
inherent in social media data. World Bank Report estimates that global investment through
Data science methods have also been applied to crowdfunding will reach $93 billion.^
assess performance of crowdsourcing and (https://crowdfundcampus.com/blog/2017/01/crowdfunding-in-
2017-three-key-trends/).
crowdfunding platforms. Crowdsourcing taps into a 12
On the other hand, even data collection itself can be crowdsourced.
crowd with talent to get a whole bunch of business This method originates from the scientific world where the first known
projects or tasks solved by dividing them into case of crowdsourcing, taking place around 150 years earlier than
Wikipedia, is the Oxford English Dictionary. The aim of this dictio-
microtasks. These microprojects assigned to a skilled
nary, to list all the words known in the English language with their
on-demand workforce provide many business opportu- definition and explanation of usage, could only be reached after people
nities, but can also deliver interesting research ques- all over the world contributed to it (https://dictionarylab.stanford.
tions. Crowdfunding, on the other hand, is a unique edu/crowdsourcing-oed). Another great example of crowdsourcing in
practice is OpenStreetMap — an alternative to Google Maps launched
form of entrepreneurial finance that combines elements in 2004. Since then, more than 1 million mappers have worked together
of private and public equity (Cummings et al. 2019). It to collect and supply data (https://www.openstreetmap.org/).
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

the contributor proposing it, and the crowd’s feedback key characteristics across industries, such as profitabil-
on the idea. A non-linear, supervised algorithm iden- ity, sales growth, and market risk. Information on the
tifies the variables that are most predictive of an idea’s text-based network classification is also informative
distinctiveness and successful implementation. The au- about identifying rival firms. Moreover, these classifi-
thors find that considering immediately available infor- cations show endogenously how industries and their
mation about the content and contributor improves the competitors change due to external shocks and how
ranking performance by around 25% over random idea R&D activities and advertisement are endogenously
selection, while adding crowd-related information that adjusted to the behavior of relevant competitors.
accumulates over time further improves performance by Hoberg and Phillips (2016) combine two central ideas:
nearly up to 50%. The last C, crowd feedback is, thus, first, that the product features and bundles a firm offers
the best predictor, but also the one that needs most time can be consistently derived from SEC product descrip-
to develop. tions and that these descriptions can be used to assign a
Courtney et al. (2017) use data from Kickstarter, a spatial location based on product descriptions, generat-
large and popular crowdfunding portal. They examine ing a Hotelling-like product location space for these
the interplay of three signal types obtained from differ- firms. Second, this study uses text analysis to build a
ent sources within the platform on the viability of a network of firms, in which the similarity of each firm to
certain idea: the direct actions a start-up takes regarding every other firm is calculated by firm-by-firm pairwise
a proposed idea and/or product (the content), its charac- word similarity scores using the original product de-
teristics (mainly crowdfunding experience; the contrib- scriptions. Based on these pairwise similarity scores,
utor), as well as third-party endorsements (sentiments firms are grouped into industries and the general indus-
expressed in backer comments; the crowd). For the last try classification can be interpreted as an unrestricted
type of signal, the authors implement a novel sentiment network of firms. There, a firm’s competitors are anal-
analysis technique, with which the underlying tone of ogous to a group of friends on social media, with each
textual comments by backers can be derived. This al- firm having its own distinctive set of competitors.
lows for a continuous feedback measure of a large and
heterogeneous group of individuals commenting on a
project, a major improvement on the dichotomous var-
4 A case in point: using NLP to study dynamics
iable that is usually used to measure third-party endorse-
of entrepreneurial skill demand in a large population
ments (Courtney et al. 2017).
On a higher level, Hartmann et al. (2016) connect
Complementing the (admittedly selective) broad litera-
data science and entrepreneurship. They derive a taxon-
ture review in the previous section, now we go into
omy of business models used by start-up firms that rely
some depth. To exemplify the general statements made
on data as a key resource for business, which they call
above, we offer an original analysis of a novel big data
data-driven business models. Their taxonomy consists
set by using various data science methods. Specifically,
of six different types of such business models among
we study the consequences of the ongoing technological
start-ups and thereby develops a basis for understanding
and economic developments on the demand for entre-
how start-ups build business models that capture value
preneurial skills.
from data as a key resource.
Developing entrepreneurial skills is increasingly seen
Whereas the above-cited papers use new (big) data
as important to foster entrepreneurship (Baumol et al.
sources, Hoberg and Phillips (2016) apply a novel tech-
2007). Several recent articles picked up the call and
nique, text analysis, to study an existing administrative
approached the topic from several disciplinary and
database.13 They use the product descriptions that firms
methodological angles.14 As the purpose of this section
filed with the US Securities and Exchange Commission
is not to review the literature on entrepreneurial skills
(SEC) to develop new time-varying industry classifica-
(for that, see the cited articles) but to exemplify data
tions. These new, more flexible measures of industry
science methods, we restrict our notion to two observa-
membership are better suited to explain differences in
tions. First, there is no generally accepted delineation,
13 14
Li et al. (2014) and Ventura et al. (2015), discussed in Section 3, fall These include RezaeiZadeh et al. (2017), Obschonka et al. (2017b),
in the same category as Hoberg and Phillips (2016). and Rosique-Blasco et al. (2018).
J. Prüfer, P. Prüfer

let alone definition, of Bentrepreneurial skills.^ Second, technical jobs? Prüfer et al. (2019) answer these ques-
one of the restrictions of existing studies using tradition- tions by making use of a novel approach of Blabor
al empirical methods is the small number of available market analytics^ in which information from online
data points: the cited articles report sample sizes of 39, vacancies, thus from unstructured (big) Internet data, is
523, and 1126 subjects, respectively. Consequently, it is combined with information from labor market forecasts,
hard to draw general, robust lessons that can be applied that is, with structured data from administrative
to different contexts than those studied. An additional sources.15 Thereby, an innovative and very rich source
characteristic of these articles, which is interrelated with of information, as well as a unique dataset is created
sample size, is that they focus on (would-be) entrepre- with which the authors analyze the impact of
neurs, which does not allow to make statements about digitization and automation on the labor market in
the importance of and demand for entrepreneurial skills general, on specific economic sectors, on 371 different
in the general population. Here, we try to alleviate these occupations, and on 3 types of professions. Prüfer et al.
constraints. (2019) measure the change in skills requirements over
The starting point is that digitization, automation, and time by taking into account digital, technical, and ICT
the development of new (adaptable) technologies have skills compared to general cognitive and non-cognitive
an increasing impact on the labor market. The boundary skills.
between BICT jobs^ and other professions in which In an original extension to Prüfer et al. (2019), the
ICT-related skills are required is becoming increasingly current section derives insights on the consequences of
blurred. Moreover, the specific skills demanded and the the ongoing digitization and automatization on the dy-
tasks that have to be fulfilled in all occupations have namics of entrepreneurial skills. We distinguish three
changed considerably in recent years (Spitz-Oener types of professions: managers, ICT jobs, and non-ICT
2006). jobs. This helps to understand how the requirements
These changes have led to increased demand for have changed over time and among types of professions
employees with sufficient digital skills in many coun- and, thus, not only provides insights into ongoing skills
tries, including the Netherlands (ROA 2017). For em- dynamics but also on the need for additional qualifica-
ployees who, in the longer term, cannot acquire the tions and retraining of specific groups.
necessary digital skills through training and retraining, This approach is not without caveats, either. Vacancy
suitable measures and career development paths are data are not necessarily representative and we do not
required that avoid insufficient qualifications and, even- know who applies for a certain vacancy and who is
tually, unemployment. In contrast, research has shown employed in the end. On the other hand, (online) vacan-
that employees can adapt sufficiently to the changes on cies give a much more fine-grained and real-time picture
the labor market and that the negative effects of digiti- of labor market demand. This data source can provide
zation and automation might be exaggerated, as many information over a longer period of time, for a larger
jobs may change but also new jobs will be created sample, and across various locations. Moreover, vacan-
(Autor 2015; Arntz et al. 2016). Given that innovative cy data are less prone to response and recall bias, which
capacity is directly related to economic growth, a lack of are eminent in survey data—even more so as it is fairly
people with sufficient digital, technical, and ICT skills, expensive to place a (clearly visible and widely distrib-
in combination with a broader set of the so-called twen- uted) vacancy. Finally, vacancy data are much cheaper
ty-first century skills, limits innovative capacity than other sources of information such as questionnaires
(Obschonka et al. 2017b). Alternatively, abundance of within a representative sample or register data that have
these types of employees helps to mitigate the negative to be linked from multiple sources.
effects on innovative capacity and the labor market
(Elliott 2017; McAfee and Brynjolfsson 2017).
What are the dynamics in skill demand on the labor
market? What are the consequences for different occu-
15
pations and for employees with different educational The World Economic Forum (2018) together with the Boston Con-
backgrounds and different levels of expertise? How do sulting Group and Burning Glass Technologies used a similar approach
for the US labor market. On top of their approach, Prüfer et al. (2019)
they affect certain types of professions such as man- analyze dynamics in the demand for skills in various professions based
agers, ICT professionals, and employees in non-IT/ on vacancy data.
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

4.1 Data and methods processing (NLP) techniques. However, an initial step
is data pre-processing of the vacancy texts, which helps
4.1.1 Data to improve text mining results. An important step of pre-
processing is the removal of stop words, such as articles
We use data from the vacancy database Jobfeed, which and prepositions. These words often appear in the can-
is administered by TextKernel, a tech company.16 This didate and job descriptions, but do not describe skills,
online job portal contains more than 95% of all vacan- education, knowledge, or experience. Examples of this
cies published on the Dutch labor market in the last are articles and prepositions. Standard Dutch and En-
10 years. Therefore, it offers a nearly complete—and glish stop word lists exist to remove these stop words. In
hence nearly representative—data set of online job ads addition, we have identified high-frequency words that
in the Netherlands. Jobfeed searches the Internet for new do not provide information about the profile, for in-
vacancies on a daily basis and applies ML algorithms to stance Bexperience^ or Bknowledge,^ and removed all
crawl for vacancies and filter out redundancies. The data this non-usable information from our dataset.18
mainly contain (unstructured) text, but Jobfeed also Moreover, we removed structured fields in the Jobfeed
extracts structured data such as profession, education, database, such as e-mail addresses, telephone numbers and
location, and company name. links to websites, by using a so-called regular expression, a
We use data for a period of 6 years, from January sequence of characters that define a search pattern. Using
2012 until December 2017, in total about 7.7 million for instance the popular library re (for regular expression
vacancies. Most of the vacancies are written in Dutch; operations) in Python allows us to match or search se-
about 8% are in English. As long as a candidate or job quences of characters by checking if a given word/phrase
description is available, we use all vacancies in our is present in a text.19 Hence, it is useful for dictionary-
analyses; this holds for 7.32 million vacancies relating based skill extraction. It is also useful for text cleaning
to 371 different occupations. The candidate and job operations by matching the specified sequence of charac-
descriptions contain relevant information about required ters, for example website links or e-mail addresses, which
skills, experience, and education. In addition, we use we then remove because this type of information is not
information gathered from multiple sources, including relevant for our analysis and could even have negative
the Occupational Information Network (O*NET), an effects (for instance, web links could be incorrectly recog-
online database with information about the knowledge, nized as HTML-skills).
skills, tasks, training, and experience required for a large A final step is to normalize the text because in un-
number of occupations. Another data source is ISCO structured data words can appear in various forms, such
(International Standard Classification of Occupations; as Brequired,^ Brequire,^ and Brequiring.^ There are also
version ISCO-2008), a classification of 436 professions derived words with similar meaning, such as
supplied by the International Labor Organization (ILO). Be n t r e p r e n e u r i a l , ^ Be n t r e p r e n e u r , ^ a n d
In the ISCO-08 classification, a profession has a skill Bentrepreneurship.^ The purpose of text normalization
level (1 to 4) and is a combination of the nature of the is to reduce inflections (i.e., derivations) of a word into a
work, the required training, and the required experience. common basic form to arrive at a single canonical form
Other sources for skills data we used include the EU the text might not have had before. The form of text
skills framework, Stackoverflow, and Dbpedia, normalization that we apply is called stemming, in
Wikipedia’s skills database.17 which ends of words are hacked by applying a heuristic
process. To do this in Dutch language, we apply an
4.1.2 Methods existing algorithm.20
The specific NLP tool we use for this project is the
As the collected vacancies from the Internet consist of bag-of-words model. This model helps to retrieve infor-
unstructured text, we apply natural language mation from an unstructured data source by representing

16
See https://www.textkernel.com/hr-software/jobfeed/.
17 18
More information can be found on the various websites: Natural Language Toolkit in Python, https://www.nltk.org/.
19
https://www.onetonline.org/help/onet/database; http://dbpedia. Regular expression: https://docs.python.org/3/library/re.html.
20
org/page/Category:Skills; https://stackoverflow.com/tags?page=1 This is called the Dutch Snowball stemmer, available in the Python
&tab=popular. NLTK package.
J. Prüfer, P. Prüfer

a text as the bag (multiset) of its words, disregarding semantic problems, for example, one job ad could men-
grammar and word order, but keeping information on tion Bsolution-oriented,^ whereas another one requires
the frequency of each word and using it as a feature for applicants to be Bcapable of solving problems^; often
training a classifier. To make text suitable for analysis, multiple skills fall into the same category. Therefore,
we transformed it into a vector of numbers that relate to within the other skills list, we created 11 broader cate-
the meaning of each word and how it relates to other gories for the entrepreneurial skills reflecting frequently
words. We then applied a mathematical distance mea- mentioned skills in the framework of twenty-first cen-
sure to calculate the difference (distance) between all the tury skills and in the entrepreneurship literature (see
words in our text fragments. Table 1).
After the pre-processing steps, we can finally extract
all necessary information from our text data. Therefore, 4.1.3 Results
we categorize the mentioned skills into two unique lists:
digital and technical skills and other skills. Because the Figure 1 shows the ranking of entrepreneurial skills in
vacancies are partly in English, we use both Dutch and all vacancies that require at least one entrepreneurial
English skill labels. We also included as many different skill. This ranking is based on the cumulative fraction
forms and expressions of skills as possible based on the of appearance of the skills of a certain category in all
frequency of words in the vacancy texts. In addition, to vacancies. Thus, it is the total number of skills appearing
make the extraction process of skills more reliable and in the job descriptions normalized by the total number of
robust, the entire list of skills is normalized and divided jobs of that year in that category. The more often the
into two parts—skills that contain one character, one skills from a certain category are demanded in vacan-
word or an abbreviation, and a second list with skills cies, the higher the rank of this category on our heat map
with more than one word. For both categories, the skills (and the darker the color). In other words, this shows the
are searched within one vacancy. If the exact skill is (change in) total demand for the skills in the different
found in the text, it is counted and if a skill occurs categories.
several times within one vacancy, this counts as one. A Overall, communications skills are in highest demand
unigram model was used for the first category. In this in the years 2015–2017, followed-shortly by self-starter
model, the text (candidate and job description) is skills.22 Planning and organization skills, also including
fragmented word by word. The text is first cleaned up the project management skills Bagile^ and Bscrum,^
partly, for example by removing brackets and convers- rank highly for managers and ICT professionals.23 Oth-
ing everything into lowercase. Also, noise related to line er skills categories that are more relevant in these two
breaks, special characters and white space is removed occupation types than in general are the well-known
again by using regular expression. The splitting into entrepreneurial skills collaboration and leadership. Sur-
words then only needs to happen on a single space while prisingly, creativity and flexibility are less demanded
the words can be looked up in the list of skills.21 For the than overall, although the difference is less pronounced
skills from the second category, skills with more than for flexibility. In contrast, self-starter skills are ranked
one word (the so-called bigrams or trigrams), we used first for other professions; flexibility comes third, while
regular expressions to match the skills after having done planning and organization skills end up on the fourth
the necessary cleaning. position.
As mentioned above, there is no generally accepted Moving to the dynamic dimension of our study, if we
definition of entrepreneurial skills. Moreover, there are look at the trend in entrepreneurial skills between 2012
21 22
This approach helps to prevent that skills are incorrectly recognized The majority of vacancies (more than 80%) are related to other
on the basis of only part of a word or sentence, thus to avoid any professions (including the category Bunknown^), while managerial
spurious matching of skills. A special exception is the BMicrosoft occupations account for 7% of all vacancies and ICT jobs for about
Word^ skill. This skill is sometimes referred to only as BWord.^ But 10%. Communication skills and self-starter skills do not differ much
the Dutch word Bword^ and BWord^ is also common. Only the word from each other in the vacancies for other professions, while self-starter
BWord,^ case sensitive, is recognized as a skill, with the exception of skills rank only fourth for managers and ICT professions. Therefore, it
cases where it is followed by Byou^ or Bthen.^ The same problem is possible that communication skills are number 2 in all the three types
occurs with the programming language BC,^ which cannot be errone- of professions, but rank number 1 overall.
23
ously recognized when asked for a driving license C or the Dutch Scrum and agile are project management methods (http://www.
nursing diploma C. mountaingoatsoftware.com/agile/scrum).
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

Table 1 Categories of entrepreneurial skills with examples achieving higher skill levels and in lifelong learning.
Category Skill examples This also highlights the repercussions from the ongoing
digitization and automation, which lead to faster tech-
Critical thinking Reasoning/ability to reason, research, nological change and, therefore, impose higher demand
judgment and decision-making, critical for a highly skilled, self-managing, and continuously
thinking, systems analysis, systems
evaluation, business analysis, business
learning labor force.
modeling, business process improvement Within the class of entrepreneurial skills, we thus find
Creativity Creative, innovation, originality that there is an increase in the demand for communica-
Collaboration Active listening, team-oriented, participation tion, collaboration, computational thinking, planning
in discussions, collaboration, ability to and organizational, self-starter, problem solving, and
work together active learning skills, highlighting the importance of
Communication Speaking/oral communication, writing, the so-called twenty-first century skills.
reporting, reading comprehension, written Comparing the dynamics in entrepreneurial skills to
understanding, bilingual/multi-lingual
(Dutch, German, French, English), Presen- the dynamics in digital skills and making a distinction
tation skills between managers and non-managerial professions, we
Computational Mathematics, analytical, science, find that demand for entrepreneurial skills has increased
thinking econometrics, statistics by a factor of 1.3 for managers between 2012 and 2017
Flexibility Adapting, flexibility, ability to adjust (Fig. 4) (from a cumulative fraction of 3.17 to 4.07). The
Leadership Coordination, negotiation, leadership, demand for this type of skills has also increased slightly
delegating, Coaching, persuasiveness, for non-managerial occupations (combining ICT/
ability to lead a team/group
technical job and non-ICT/technical jobs). For digital
Self-starter Self-motivated, initiative, proactive,
entrepreneurship, inquisitive, enthusiastic,
skills, we find an increase of factor 1.6 for managers
independence, curious go-getter (from a cumulative fraction of 0.54 to 0.87) but none for
Problem solving Root cause analysis, problem management, the other occupation types. Prüfer et al. (2019) explain
problem sensitivity, problem solving, the latter result by steeply increasing demand for skills
solution-oriented, perseverance related to Bdigital transformation^ and Bbig data and
Active learning Active learning, learning strategies, learning analytics.^
assessment and evaluation, development
The relatively larger increase for managers’ digital
management, eager to learn, ability to learn
skills (due to their low baseline demand in 2012) not-
Planning and Time management, risk management,
organization organization design and implementation, withstanding (Fig. 4) shows that the cumulative fraction
project management, facility management, of entrepreneurial skills demanded by managers is sig-
strategic thinking, systemic thinking, nificantly larger than the cumulative fraction of man-
change management, program agers’ demanded digital skills. Moreover, the absolute
management, sustainability strategy,
requirement definition and management, demand increase for managers’ entrepreneurial skills
requirement gathering, monitoring over the 5-year period studied (0.9 points) is also larger
than the absolute increase for their digital skills (0.4
points).
and 2017 (Figs. 2 and 3), we observe an increase in the Summarizing, we conclude that both entrepreneurial
demand for cooperation (related to communication (by and digital skills are in increased demand for managerial
factor 1.0) and collaboration (by factor 1.4) skills) and positions in the Netherlands over the entire period
in skills for planning and organization (by factor 1.4), 2012–2017. Given the hugely growing importance of
self-starter (by factor 1.0), computational thinking (by datafication and our finding that, among digital skills,
factor 1.2), problem solving (by factor 1.2), and active those on Bdigital transformation^ and Bbig data and
learning (by factor 1.7). Flexibility (by factor 1.1) and analytics^ are most valued by employers, one could
leadership skills (by factor 1.0) are also in increasing in expect that demand for digital skills would increase
demand, while the remaining skills remain more or less most. Our empirical results, however, show the oppo-
stable. Overall, the demand for active learning skills is site: entrepreneurial skills were significantly more rele-
rising most in this period, indicating an increasing need vant over the six-year period studied. Moreover, the
for employees that are intrinsically interested in
J. Prüfer, P. Prüfer

Fig. 1 Ranking of entrepreneurial skills overall and per job type (2015-2017)

absolute importance of this skill type in managerial job AI can lead to disruption because incumbent firms
vacancies has increased even more than digital skills. often have weaker economic incentives than start-
ups to adopt the technology. AI-enabled products
are often inferior at first because it takes time to
train a prediction machine to perform as well as a
5 Discussion and conclusion: opportunities and risks hard-coded device that follows human instruc-
for researchers tions rather than learning on its own. However,
once deployed, an AI can continue to learn and
The ongoing datafication, coupled with gigantic techno- improve, leaving its unintelligent competitors’
logical progress in the domain of AI, is changing all products behind. It is tempting for established
aspects of our lives: work, politics, community interac- companies to take a wait-and-see approach, stand-
tions, economic transactions, and many more. Agrawal ing on the sidelines and observing the progress in
et al. (2018, p.194) summarize: AI applied to their industry. That may work for

Fig. 2 Dynamics of top

entrepreneurial skills (2012-2017)
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

Fig. 3 Dynamics of bottom

entrepreneurial skills (20-12-
2017)

some companies, but others may find it difficult to automated collection of information, especially on, but
catch up once their competitors get ahead in the not restricted to, the Internet. Via text analysis, com-
training and deployment of AI tools. puters can learn to understand the meaning of words,
Now, substitute Bresearchers^ for Bfirms^/ relate them to each other, and analyze them at scales that
Bcompanies^ in this quotation and Bresearch projects^ otherwise would require the help of hordes of research
for Bproducts.^ assistants. The new techniques and technologies also
The disruption occurring at the economy-level is allow to use many more (unstructured) real-time data
mirrored in the world of research, fueled by develop- sources to conduct analyses that would not have been
ments in data science methods. Distinguishing them- possible otherwise, for instance by using sensor data
selves from traditional statistics and econometrics, these from mobile devices (Blumenstock et al. 2015). By
methods use algorithmic models and treat the data making reliance on subjective and self-reported surveys
mechanism as unknown in order to discover complex largely unnecessary and substituting these sources with
structures that were not specified in advance. Where objective data on revealed preferences, they improve the
conventional statistics is deductive, data science is in- accuracy, robustness and, hence, the value of entrepre-
ductive. These inductive methods facilitate the neurship research.
Given that these methods are usually freely available
and relatively easy to learn, data science techniques
thereby contribute to a democratization of empirical
research tools, where scholars or students with fewer
resources have a higher chance to compete with
established researchers from resource-rich countries re-
garding the types of research questions they can study.
However, given the current state of data science
methods, they cannot completely substitute human cre-
ativity and research design skills.24 According to
Agrawal et al. (2018), AI algorithms are better than
humans at factoring in complex interactions among
different indicators if enough data are available. If this
condition does not hold, however, humans are often
24
This may be less important in hard sciences and may also change in
entrepreneurship research once an artificial general intelligence is
developed—which is expected to take 10–100 years (OECD 2017).
Fig. 4 Dynamics of entrepreneurial digital skills
J. Prüfer, P. Prüfer

better than machines when understanding the data gen- due to differences between online (social media) users
eration process confers a prediction advantage. In the and the entire population and measurement errors that
social sciences, data science methods appear to be espe- are due to the unreliability of social media data as a
cially well suited for first, inductive analyses that guide representative measure of social phenomena. Compar-
further research efforts. This occurs, for instance, by ing the results of a (small) representative survey with
pointing researchers at relevant correlations and helping results of (big) unrepresentative data, of which the rep-
them to design better (field) experiments, to make better resentativeness can even be assessed empirically, there-
comparisons between more precise populations of inter- fore looks like an ideal way forward for empirical
est, and to reveal behavior that was difficult to detect research.25
previously (Monroe et al. 2015). The inductive, data- Just as all technologies based on AI, data science
driven approach can also point theorists at the key methods come with risks. Agrawal et al. (2018) con-
variables of interest for a specific question that deserve clude their insightful book on the consequences of AI by
being modeled. This may alleviate the need for expert focusing on three trade-offs. The first is productivity
interviews or the use of small, unrepresentative surveys versus distribution. Bughin et al. (2018) note: BA key
to obtain a first understanding of the main influence challenge is that adoption of AI could widen [perfor-
factors for a given research question. In Section 4, we mance and outcome] gaps between countries, compa-
showed the advantages of this approach—and the de- nies, and workers.^ Applied to research, data science
tails how to apply it to a specific question from the methods can increase the number, breadth, and speed of
domain of entrepreneurship research, the demand dy- questions we can work on, increasing our productivity.
namics of entrepreneurial skills. Our study, based on a But researchers who neglect technological progress or
dataset of 95% of all job vacancies in the Netherlands who miss the train may feel very disadvantaged as some
over a 6-year period with 7.7 million data points, has traditional methods may be dominated by data science
visualized that with data science methods we can study techniques. Consequently, there may be a watershed
questions that could not have been studied on smaller, moment for every researcher, where she either invests
non-representative data sets. It has allowed us to state some time to familiarize herself with data science
that demand for both entrepreneurial and digital skills methods (for these the above-described democratization
has increased for managerial positions but that entrepre- of research tools may kick in), or not (which saves time
neurial skills were significantly more relevant over the and effort in the short run but may come at significant
entire period 2012–2017 and that the absolute impor- risk for the relevance of her research in the long run).
tance of entrepreneurial skills has even increased more The second trade-off is innovation versus competi-
than digital skills. This finding may serve as motivation tion. In business, the successes of Google and Facebook,
for more research on the role of entrepreneurial skills in both of which are highly data-driven firms that have
the general population—and not only among (would- embraced AI early, have shown that data-driven markets
be) entrepreneurs. display first-mover advantages and are prone to market
Moreover, data science techniques may also reduce tipping. Importantly, watching the dismal fate of their
the risk that theorists fall victim to confirmation bias competitors underlines how important it is not to fall
(Mahmoodi et al. 2017). Dreaming ahead, this may lead behind.26 To some degree, data science methods could
to a norm for the best theoretical researchers having to introduce a similar spiral, where those researchers who
motivate their models by the results of big data analyses. embrace them early could produce higher-quality re-
Notably, data science methods are no substitute for search, which may have positive feedback effects on
theoretical research or conventional statistics. They
complement those established methodologies. A fruitful 25
Hal Varian (2014, p. 23), Google’s Chief Economist, comments: BA
avenue for further research is to combine big data and good predictive model can be better than a randomly chosen control
group, which is usually thought to be the gold standard.^ Stephens-
ML with administrative and survey data. In all social
Davidowitz (2017, p. 255) notes: B[E]ven a spectacularly successful
sciences, data science techniques have been largely Big Data organization like Facebook sometimes makes use of […] a
applied to Internet data (often by scraping and analyzing small survey.^
26
big social media data sets). Entrepreneurship research is Prüfer and Schottmüller (2017) provide more empirical details,
rationalize dominant firms’strategies, and introduce Bdata-driven indi-
no exception, as Section 3 has shown. However, this rect network effects^ as the source of market tipping on data-driven
approach ignores both potential selection effects that are markets.
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

their consecutive projects. As long as data sets from one same indirect network effects as competition on data-
project can be merged and, hence, be partly reused in driven markets (which leads to market tipping and one
future projects, the prediction power of those re- highly dominant firm per market).
searchers’ models might outcompete latecomers repeat- By contrast, what is certainly true is that we as
edly, discouraging entry of new researchers in their researchers need to keep up the standards of verifiability,
fields.27 The quality of top researchers’ work might reliability, and replicability of research results. Howev-
be(come) stellar but the competitive supply of answers er, this is particularly difficult when ML algorithms are
to important research questions might decrease, giving used because, by definition, the algorithm is learning: it
the top researchers significant opinion leadership. adapts based on feedback.32 Therefore, it is harder than
The third trade-off is performance versus privacy. with conventional research methods to reproduce pre-
Using AI successfully depends on huge amounts of data dictions (read: results) based on ML. What is necessary,
because it is the very power of personalization of ser- thus, is to make the decision-making processes of algo-
vices and inference about an individual’s preferences rithms more transparent. This would facilitate trust in
and characteristics that can be made if only sufficient the new technologies and replicability would be easier.
data about other individuals are available.28 But the One option to achieve this goal is to build algorithms
benefits of aggregate data may come at individuals’ with an internal self-evaluation or calibration stage such
costs, especially for privacy.29 Doing research by ana- that the machine can test its own certainty and report
lyzing big data sets with data science methods is subject back to the researcher. One attempt in this direction is
to the same trade-off as running a firm in a data-driven the Automatic Statistician, which was developed at
market. Therefore, such research is subjects to the same Cambridge University.33 The tool is set up with funding
laws. As a direct policy response to datafication and AI, from Google and helps researchers to analyze their
the General Data Protection Regulation (GDPR) has datasets while also providing a report in a human-
become effective in the EU in May 2018, regulating the understandable form that explains what it is doing and
legal use of privacy-sensitive data, especially those re- how certain it is about its predictions. This technology is
lating to Internet services.30 The GDPR is already af- related to a recent development within ML, Automated
fecting researchers doing empirical research that uses Machine Learning (AutoML). This approach tackles the
data from the EU or about EU citizens.31 fundamental problems of accountability and verifiabili-
Crucially, the one-to-one translation of the three ty. Here, ML methods and hyper-parameter settings are
trade-offs listed by Agrawal et al. (2018) from the automatically selected and, thereby, reduce the necessity
business to the research domain is subject to further of handcrafted human interventions. Apart from sub-
scrutiny. For instance, it is unclear whether empirical stantial performance improvements, AutoML can pro-
research using data science methods is subject to the vide evaluations of all tested methods and specifica-
tions. Thereby, it can help non-experts to effectively
27 and reliably apply ML techniques.
Our exercise in Section 4 about the dynamics of demand for entre-
preneurial skills, despite all its flaws and omissions, may produce In all social sciences, including entrepreneurship re-
relevant intuition for this point: if it is possible to study the entire search, there is a lot of ground to cover.
population of a country in one research project, the value of studying
small sample sizes (with traditional methods) may diminish.
28
For instance, Facebook offers marketers targeting of more than
29,000 categories of users. As the firm has multidimensional data on Acknowledgments We are grateful to Freek van Gils, George
its users, it is easy to place a given individual in one category even if Knox, and Marcia den Uijl for comments on an earlier draft and to
some data are missing (https://www.propublica.org/article/facebook- Pradeep Kumar and Chayanin Wipusanawan for valuable research
doesnt-tell-users-everything-it-really-knows-about-them). assistance. All errors are our own.
29
See Acquisti et al. (2016) for an overview of the privacy literature
and Dengler and Prüfer (2018) for a rationalization of consumers’
privacy choices even if they have no exogenous taste for privacy.
30
Regulation (EU) 2016/679 of the European Parliament and of the
32
European Council of 27 April 2016 on the protection of natural persons Recently, a Google employee mentioned in personal conversation
with regard to the processing of personal data and on the free move- that the algorithm of Google’s search engine would be changed about
ment of such data (http://ec.europa.eu/justice/data-protection/). 2500 times per year. While the exact number is irrelevant, the high
31
On the positive side, as all of us are also data subjects, not just frequency of changes, which complicates accountability for an algo-
researchers, our personalities and digital footprints are protected much rithm’s results, is not.
33
better in the EU than in other jurisdictions. See https://www.automaticstatistician.com/index/.
J. Prüfer, P. Prüfer

Appendix projects. As of now, there are more than 100 open-source

projects for big data, a fast-growing number.
Getting started yourself The following layered diagram (also called stacked
diagram) organizes the capability or functionality of the
Using data science for your own research does not require components in the layer. In a layer diagram, a compo-
a PhD or other academic credentials in that field (The nent uses the functionality of the components in the
Economist 2018). Bishop (2011), Hastie et al. (2009), layer below it. Normally components at the same layer
Murphy (2012), and Provost and Fawcett (2013) are ex- do not communicate.
cellent entry points in book form. Moreover, many high-
quality online resources are available, for which good
knowledge of basic linear algebra and probability theory
is a big help. Useful resources to learn these methods are
the video lectures of Andrew Ng at Stanford University,
GitHub repositories, and Coursera, Udacity, Udemy, or
edX courses. To gain practical experience with various
kinds of challenging data, data science enthusiasts can try
many open projects available at Kaggle. It can really help Key points of this framework:
to learn fast and hone the practical skills related data
science further. 1. The Hadoop distributed file system (HDFS) is the
Kaggle, the biggest data science community in the foundation for many big data frameworks as it
world, is actually itself a crowdsourcing initiative for data provides scalable and reliable storage.
science.34 Companies that need help sorting data turn to 2. Hadoop YARN provides flexible scheduling and
Kaggle, a new platform that leverages the data science resource management over the HDFS storage.
crowd via competitions. These data scientists work to solve 3. MapReduce is a programming model that simplifies
a company’s data questions in an attempt to win the parallel computing.
company-sponsored financial reward or just for the plea- 4. Hive and Pig are two additional programming
sure of showing off. Founded in 2010, Kaggle boasts a models on top of MapReduce. Hive was created at
community of Btens of thousands^ experts from over 100 Facebook to issue SQL-like queries using
countries and 200 universities in any fields related to data MapReduce on their data in HDFS. Pig was created
science. The Kaggle ranking has become an essential at Yahoo to model data flow based programs using
metric in the world of data science. Some employers have MapReduce.
begun listing a Kaggle rank as an essential qualification 5. Giraph was built for processing large-scale graphs
and Facebook uses Kaggle competitions as part of its efficiently. For example, Facebook uses Giraph to
recruiting strategy. An interview for a job as a data analyze the social graphs of its users.
scientist at Facebook is the prize. 6. Storm, Spark, and Flink (in-memory processing) are
useful for real-time and in-memory processing of
big data on top of the YARN resource scheduler and
Hadoop ecosystem HDFS.
7. Cassandra, MongoDB, and HBase are NoSQL da-
The big data open-source movement potentially com- tabases. Cassandra was created at Facebook.
menced in 2004 when Google published a paper on their Facebook also used HBase for its messaging
in-house processing framework popularly known as platform.
MapReduce (Dean and Ghemawat 2004). Later, Yahoo 8. Zookeeper was created at Yahoo. It is a centralized
released an open-source implementation based on this management system for synchronization, configu-
framework called Hadoop. Subsequently, many other ration, and to ensure high availability of all these
frameworks and tools were released as open-source tools.

34
https://www.kaggle.com/competitions.
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

Box A1 Technical terms and tools for big data applications and data sciencea
Application programming A set of subroutine definitions, protocols, and tools for building application software. In general, a set of
interfaces (APIs) clearly defined methods of communication between various software components. An API may be for
a web-based system, operating system, database system, or computer hardware. The use of open APIs
has resulted in an exponential growth of user-generated data (via apps and software programs) as an
API reports any of its use, e.g., for web scraping, back to the API provider. Hence, it has given direct
and indirect boost to big data and analytics.
Cassandra Apache Cassandra is a free and open-source distributed NoSQL database management system designed
to handle large amounts of data across many commodity servers, providing high availability with no
single point of failure.
Flink An open-source framework for processing data in both real-time mode and batch mode. It provides
several benefits such as fault-tolerant and large-scale computation. Its programming model is similar to
MapReduce. In contrast to MapReduce, it offers additional high-level functions such as join, filter and
aggregation.
Flume A highly distributed, reliable, robust, fault-tolerant, and configurable tool, which collects streaming data
(log data) from various web servers to HDFS.
Git Free and open-source software used as a version control system for tracking changes in computer files
and coordinating work on those files among multiple people.
Hadoop An open-source software library that establishes a framework for the distributed processing of big data
using simple programming models. Hadoop has two main components: Hadoop Distributed File
System (HDFS) a scalable distributed file system for storing large files over distributed machines in a
reliable and efficient way; and MapReduce programming, a model to process huge data in-parallel on
large clusters (thousands of nodes) in a reliable, fault-tolerant manner.
HBase A column-oriented NoSQL database management system that runs on top of Hadoop Distributed File
System (HDFS) similar to Google’s Bigtable and well suited for sparse data sets, which are common in
many big data use cases.
Hive Built on top of Hadoop and allows SQL developers to reading, writing, and managing large datasets in
distributed storage using Hive Query Language (HQL) statements similar to standard SQL.
Library A library is collection of pre-written programs, scripts, or functions that can be loaded on disk for
immediate use. All of the available functions within a (software) library can be used within the program
body without explicitly defining them. With the help of these libraries, one can implement (complex)
algorithms by writing few line of codes.
Pig An abstraction over MapReduce and tool/platform to analyze larger sets of data representing them as data
flows. It is generally used with Hadoop to perform the data manipulation operations in Hadoop. Pig is
amenable and well suited for parallelization and, thus, provides capability to handle very large data
sets.
Spark An open-source cluster computing framework operating on distributed data collections (in-memory
distributed data analysis platform; without a storage component such as Hadoop) primarily targeted at
speeding up batch analysis jobs, iterative ML jobs, interactive queries, and graph processing. The
Spark Big Data platform can be combined with other analytics platforms, such as Databricks, which
uses the well-known programming language Scala. An extension of the core Spark API is Spark
Streaming that enables scalable, high-throughput, and fault-tolerant stream processing of data.
Storm An open-source framework for processing large structured and unstructured data in real time. Storm is a
fault-tolerant framework that is suitable for real-time data analysis, ML, sequential and iterative
computation. Storm is geared for real-time applications while the Hadoop is effective for batch
applications.
Tableau Is a commercial package that can be very useful to visualize big data and to get actionable insights in fast
and efficient way.
Virtual environment software Refers to any software, program, or system that implements, manages, and controls multiple virtual
environment instances. The software is installed within an organization’s existing IT infrastructure and
controlled from within the organization itself. At its core, the main purpose of (Python) virtual
environments is to create an isolated environment for (Python) projects. This means that each project
can have its own dependencies, regardless of what dependencies every other project has.
a
Not all of the terms explained are mentioned in this paper. Given that these are all frequently used terms, we include them anyway as a
service to the reader.
J. Prüfer, P. Prüfer

Box A2 Techniques and tools for text mining

Basic text mining Natural Language Toolkit (NLTK) is a widely used open-source toolkit for text mining and NLP. It has
several handy tools, gives access to many text corpora, and to the most suitable algorithms for such
tasks. [http://www.nltk.org/]
Web scraping BeautifulSoup is a tool to work with web-based data. It facilitates the scraping, parsing, and reading of
web data, as well as data access using web APIs in different formats of data, for example in HTML,
XML, and JSON formats. [https://www.crummy.com/software/BeautifulSoup/bs4/doc/]
Text classification One of the important and typical tasks in supervised machine learning. Assigning categories to
documents, which can be web pages, library books, media articles, etc. has many applications, for
instance, spam filtering, e-mail routing, or sentiment analysis. Several toolkits are available for
supervised text classification. Scikit-learn, an open-source machine learning library in Python, is a
prominent one. [http://scikit-learn.org/stable/]
Information extraction (IE) An important task for natural language understanding and making sense of textual data. The main goal of
IE is to identify and extract fields of interest from free text. It is the first step in converting the
unstructured text to more structured forms. The so-called Stanford NLP is a suite of very useful NLP
tools for IE. [https://nlp.stanford.edu/software/]
Semantic similarity and topic Algorithms to detect semantic similarity are used to group similar words into semantic concepts that have
modeling the same meaning, or appear to have the same meaning. For example, currency–money–coin are
semantically similar. One of the resources useful for semantic similarity is WordNet, which is a
semantic dictionary of words interlinked by semantic relationshipsa. Topic modeling is a widely used
text mining tool for discovering hidden patterns in a text body. A good topic model for example gives
“school,” “university,” “college,” “teacher,” “professor” for a topic “education.”
Sentiment analysis Opinion mining (sometimes known as sentiment analysis or emotion AI) refers to the use of NLP to
systematically identify, extract, quantify, and study affective states and subjective information.
Sentiment analysis is widely used to analyze reviews, survey responses, and online and social media
discussions. There are two ways to perform sentiment analysis: the lexicon-based approach and the
machine learning approach. For both approaches, different tools and algorithms exist as well as
databases of positive and negative wordsb.
Linguistic inquiry and word LIWC is an application of computer-based text analysis tools in psychology. Its two features: the
count (LIWC) processing component and the dictionaries. The processing feature is the program that opens a series of
text files such as essays, poems, blogs, novels, and social media data and then analyzes each file word
by word. Each word in a given text file is compared with the dictionary file. This tool reflects how
language correlates with emotional state, social relationships, thinking styles, and individual differ-
ences. [http://liwc.wpengine.com/]
a
WordNet was developed for English but exists for many other languages today, too. WordNet includes rich linguistic information, e.g., part
of speech, different meanings of the same word, synonyms, words with same meaning, hypernyms, and hyponyms. WordNet is freely
available in NLTK (http://www.nltk.org/howto/wordnet.html) or on the website of Princeton University (https://wordnet.princeton.
edu/wordnet/download/). It is extensively used in many natural language processing tasks and, more broadly, in text mining tasks.
b
For example, Liu and Hu’s opinion lexicon contains around 6800 positive and negative opinion or sentiment words for English: https://www.cs.uic.
edu/~liub/FBS/sentiment-analysis.html#lexicon. SentiWordNet is a good lexical resource for opinion mining that assigns three sentiment scores
(positivity, negativity, and objectivity) to the words in WordNet (http://sentiwordnet.isti.cnr.it/). NLTK and TextBlob are Python libraries that are
frequently used for sentiment analysis based on machine learning. TextBlob is built on the top of NLTK, is more convenient than NLTK for new users,
and has a lot of functionality in NLP tasks. Similar libraries are also available in R and RapidMiner.

The Hadoop ecosystem consists of a growing num- Open Access This article is distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://
ber of open-source tools, providing opportunities to pick
creativecommons.org/licenses/by/4.0/), which permits unrestrict-
the right tool for the right tasks for better performance ed use, distribution, and reproduction in any medium, provided
and lower costs. We describe some tools in further detail you give appropriate credit to the original author(s) and the source,
and recommend optimal use in the following table. provide a link to the Creative Commons license, and indicate if
changes were made.
Data science for entrepreneurship research: studying demand dynamics for entrepreneurial skills in the...

References Operating System Design and Implementation, San

Francisco, CA: 137–150.
Dengler, S. and Prüfer, J. (2018). Consumers’ privacy choices in
Acquisti, A., Taylor, C., & Wagman, L. (2016). The economics of the era of big data. TILEC Discussion Paper No. 2018-014,
privacy. Journal of Economic Literature, 54(2), 442–492. CentER Discussion Paper No. 2018-012.
Aggarwal, R., & Singh, H. (2013). Differential influence of blogs Einav, L., & Levin, J. (2014). Economics in the age of big data.
across different stages of decision making: the case of venture Science, 346(6210), 1243089.
capitalists. MIS Quarterly, 37(4), 1033–1112. Elliott, S. (2017). Computers and the future of skill demand,
Agrawal, A., Gans, J., & Goldfarb, A. (2018). Prediction ma- educational research and innovation. Paris: OECD
chines: the simple economics of artificial intelligence. Publishing. https://doi.org/10.1787/9789264284395-en.
Cambridge: Harvard Business Review Press. George, G., Haas, M., & Pentland, A. (2014). Big data and
Alsayat, A. and El-Sayed, H. (2016), Social media analysis using management. Academy of Management Journal, 57(2),
optimized K-means clustering, Proceedings of 2016 IEEE 321–332.
14th International Conference on Software Engineering Greve, A., & Salaff, J. (2003). Social networks and entrepreneur-
Research, Management and Applications (SERA). ship. Entrepreneurship Theory and Practice, 28, 1–22.
Andrews, K. R. (1980). The concept of corporate strategy (2nd Hambrick, D. C., & Mason, P. A. (1984). Upper echelons: the
ed.). Illinois: Irwin, Homewood. organization as a reflection of its top managers. Academy of
Arntz, M, Gregory, T and U. Zierahn (2016), The risk of automa- Management Review, 9, 193–206.
tion for jobs in OECD countries: a comparative analysis, Hartmann, P. M., Zaki, M., Feldmann, N., & Neely, A. (2016).
OECD Social, Employment and Migration Working Papers, Capturing value from big data–a taxonomy of data-driven
No. 189, OECD Publishing: Paris. business models used by start-up firms. International Journal
Autor, D. (2015). Why are there still so many jobs? The history of Operations & Production Management, 36(10), 1382–
and future of workplace automation. Journal of Economic 1406.
Perspectives, 29(3), 3–30. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of
Bandiera, O,. Prat, A., Hanse, S. and R. Sadun (2017). CEO statistical learning: data mining, inference, and prediction.
behavior and firm performance, Harvard Business School New York: Springer.
Working Paper 17–083. Hisrich, R., Langan-Fox, J., & Grant, S. (2007). Entrepreneurship
Baumol, W. J., Litan, R. E., & Schramm, C. J. (2007). Good research and practice: a call to action for psychology.
capitalism, bad capitalism, and the economics of growth American Psychologist, 62(6), 575–589.
and prosperity. New Haven: Yale University Press. Hoberg, G., & Phillips, G. (2016). Text-based network industries
Bishop, C. (2011). Pattern recognition and machine learning. and endogenous product differentiation. Journal of Political
New York: Springer. Economy, 124(5), 1423–1465.
Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting Hoornaert, S., Ballings, M., Malthouse, E. C., & Van den Poel, D.
poverty and wealth from mobile phone metadata. Science, (2017). Identifying new product ideas: waiting for the
350, 1073–1076. wisdom of the crowd or screening ideas in real time.
Breiman, L. (2001). Statistical modeling: the two cultures. Journal of Product Innovation Management, 34(5), 580–
Statistical Science, 16(3), 199–231. 597.
Bughin, J., Seong, J. Manyika, J., Chui, M., and R. Joshi (2018). Kotter, J. P. (1999). John Kotter on what leaders really do. Boston:
Notes from the AI frontier: modeling the impact of AI on the Harvard Business School Press.
world economy. Discussion Paper McKinsey Global Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.,
Institute. Brewer, D., Christakis, N., Contractor, N., Fowler, J.,
Calvano, E., Calzolari, G., Denicolo, V., & Pastorello, S. (2018). Gutmann, M., Jebara, T., King, G., Macy, M., Roy, D., &
Artificial intelligence, algorithmic pricing and collusion. Van Alstyne, M. (2009). Computational social science.
Mimeo: University of Bologna. Science, 323(5915), 721–723.
Cardon, M. S., Foo, M. D., Shepherd, D., & Wiklund, J. (2012). Lee, J., Hwang, B., & Che, H. (2017). Are founder CEOs more
Exploring the heart: entrepreneurial emotion is a hot topic. overconfident than professional CEOs? Evidence from S&P
Entrepreneurship Theory and Practice, 36(1), 1–10. 1500 companies. Strategic Management Journal, 38, 751–
Cogburn, D. and Hine, M. (2017). Introduction to text mining in 769.
big data analytics. Proceedings of the 50th Hawaii Li, G., Lai, R., D’Amour, A., Doolin, D., Sun, Y., Torvik, V., Yu,
International Conference on System Sciences, HICSS 2017. A., & Fleming, L. (2014). Disambiguation and co-authorship
Courtney, C., Dutta, S., & Li, Y. (2017). Resolving information networks of the U.S. patent inventor database (1975–2010).
asymmetry: signaling, endorsement, and crowdfunding suc- Research Policy, 43(6), 941–955.
cess. Entrepreneurship Theory and Practice, 41(2), 265–290. Mahmoodi, J., Leckelt, M., Van Zalk, M., Geukes, K., & Black,
Cummings, M. E., Rawhouser, H., Vismara, S., Hamilton, E. L. M. (2017). Big data approaches in social and behavioral
(2019). An Equity Crowdfunding Research Agenda: science: four key trade-offs and a call for integration.
Evidence from Stakeholder Participation in the Rulemaking Current Opinion in Behavioral Sciences, 18, 57–62.
Process. Small Business Economics 1–26. https://doi. Mayer-Schönberger, V., & Ramge, T. (2018). Reinventing capital-
org/10.1007/s11187-018-00134-5. ism in the age of big data. London: John Murray.
Dean, J. and Ghemawat, S. (2004), MapReduce: simplified data McAfee, A., & Brynjolfsson, E. (2017). Machine–platform–
processing on large clusters, OSDI'04: Sixth Symposium on crowd: harnessing our digital future. New York: Norton.
J. Prüfer, P. Prüfer

Mintzberg, H. (1973). The nature of managerial work. New York: approach. Journal of Business and Economic Studies, 7(2),
Harper and Row. 64–79.
Monroe, B. L., Pan, J., Roberts, M. E., Sen, M., & Sinclair, B. Samuel, A. (1959). Some studies in machine learning using the
(2015). No! Formal theory, causal inference, and big data are game of checkers. IBM Journal, 3(3), 535–554.
not contradictory trends in political science. PS-Political Shane, S. (2012). Reflections on the 2010 AMR decade award:
Science and Politics, 48(1), 71–74. delivering on the promise of entrepreneurship as a field of
Mullainathan, S., & Spiess, J. (2017). Machine learning: an ap- research. Academy of Management Review, 37(1), 10–20.
plied econometric approach. Journal of Economic Shane, S., & Venkataraman, S. (2000). The promise of entrepre-
Perspectives, 31(2), 87–106. neurship as a field of research. Academy of Management
Murphy, K. (2012). Machine learning: a probabilistic perspective. Review, 25, 217–226.
Cambridge: MIT Press. Sorenson, O. (2018). Social networks and the geography of entre-
Obschonka, M. and Fisch, C. (2018), Entrepreneurial personalities preneurship. Small Business Economics, 51(3), 527–537.
in political leadership. Small Business Economics 50(4),
Spitz-Oener, A. (2006). Technical change, job tasks, and rising
851–869.
educational demands: looking outside the wage structure.
Obschonka, M., Fisch, C., & Boyd, R. (2017a). Using digital
Journal of Labor Economics, 24, 235–270.
footprints in entrepreneurship research: a twitter-based per-
sonality analysis of superstar entrepreneurs and managers. Stephenson-Davidowitz, S. (2017). Everybody lies—big data, new
Journal of Business Venturing Insights, 8, 13–23. data, and what the internet can tell us about who we really
Obschonka, M., Hakkarainen, K., Lonka, K., & Salmela-Aro, K. are. New York: Harper Collins.
(2017b). Entrepreneurship as a twenty-first century skill: Stuart, R., & Abetti, P. A. (1987). Start-up ventures: towards the
entrepreneurial alertness and intention in the transition to prediction of initial success. Journal of Business Venturing,
adulthood. Small Business Economics, 48, 487–501. 2(3), 215–230.
OECD. (2017). OECD digital economy outlook 2017, ch.7. Paris: Taddy, M. (2018). The technological elements of artificial intelli-
OECD Publishing. gence. NBER Working Paper No. 24301.
Petriglieri, G., Ashford, S.J. and A. Wrzesniewski (2018). Tan, S., & Koh, H. C. (1996). Modelling entrepreneurial inclina-
Thriving in the gig economy, Harvard Business Review, tion with an artificial neural network. Journal of Small
March–April: 140–143. Business & Entrepreneurship, 13(2), 14–24.
Provost, F., & Fawcett, T. (2013). Data science for business. Tata, A., Martinez, D., Garcia, D., Oesch, A., & Brusoni, S.
Sebastopol: O’Reilly. (2017). The psycholinguistics of entrepreneurship. Journal
Prüfer, J., & Prüfer, P. (2018). Data science for institutional and of Business Venturing Insights, 7, 38–44.
organizational economics. In C. Ménard & M. M. Shirley Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological
(Eds.), A research agenda for new institutional economics meaning of words: LIWC and computerized text analysis
(pp. 248–259). Cheltenham: Edward Elgar Publishers. methods. Journal of Language and Social Psychology,
Prüfer, J. and Schottmüller, C. (2017) Competing with big data. 29(1), 24–54.
CentER Discussion Paper No. 2017–007. The Economist (2018). No PhD, no problem—new schemes teach
P r ü f e r , P. , K u m a r , P. , & d e n U i j l , M . ( 2 0 1 9 ) . the masses to build AI, October 25, San Francisco.
Arbeidsmarktonderzoek Digitalisering in Topsectoren, Vachelard, J., Gambarra-Soares, T., Augustini, G., Riul, P., &
Mimeo. Tilburg: CentERdata. Maracaja-Coutinho, V. (2016). A guide to scientific
RezaeiZadeh, M., Hogan, M., O’Reilly, J., Cunningham, J., & crowdfunding. PLoS Biology, 14(2), 1–7.
Murphy, E. (2017). Core entrepreneurial competencies and Varian, H. (2014). Big data: new tricks for econometrics. Journal
their interdependencies: insights from a study of Irish and of Economic Perspectives, 28(2), 3–27.
Iranian entrepreneurs, university students and academics.
Ventura, S., Nugent, R., & Fuchs, E. (2015). Seeing the non-stars:
International Entrepreneurship and Management Journal,
(some) sources of bias in past disambiguation approaches and
13, 35–73.
a new public tool leveraging labeled records. Research
Rickne, A., Ruef, M., & Wennberg, K. (2018). The socially and
Policy, 44(9), 1672–1701.
spatially bounded relationships of entrepreneurial activity:
Olav Sorenson—recipient of the 2018 Global Award for Wang, F., Mack, E., & Maciewjewski, R. (2017). Analyzing
Entrepreneurship Research. Small Business Economics, entrepreneurial social networks with big data. Annals of the
51(3), 515–525. American Association of Geographers, 107(1), 130–150.
ROA (2017). De Arbeidsmarkt naar Opleiding en Beroep tot World Economic Forum. (2018). Towards a reskilling
2022, ROA-R-2017/10, Maastricht. revolution—a future of jobs for all. Zwitserland: Davos.
Rosique-Blasco, M., Madrid-Guijarro, A., & García-Pérez-de-
Lema, D. (2018). The effects of personal abilities and self- Publisher’s note Springer Nature remains neutral with regard to
efficacy on entrepreneurial intentions. International jurisdictional claims in published maps and institutional
Entrepreneurship and Management Journal, 14, 1025–1052. affiliations.
Rutherford, M., McMullen, P., & Oswald, S. (2001). Examining
the issue of size and the small business: a self organizing map