Technology and Cryptocurrency Valuation
Technology and Cryptocurrency Valuation
Technology and Cryptocurrency Valuation
February 2022
Abstract
This paper studies whether technology aspects of cryptocurrencies drive their valuation. We fo-
cus on the initial coin offering (ICO) market which allows us to both measure the technological
sophistication of cryptocurrencies and observe their subsequent successes and valuations. Us-
ing various machine learning methods, we construct technology indexes from ICO whitepapers
to capture technological sophistication for all cryptocurrencies. We find that the cryptocurren-
cies with high technology indexes are more likely to succeed and less likely to be delisted
subsequently. Moreover, the technology indexes strongly and positively predict the long-run
performances of the ICOs. Overall, the results suggest that technological sophistication is an
important determinant of cryptocurrency valuations.
* Yukun Liu is with Simon School of Business at University of Rochester. Jinfei Sheng (Corresponding Author)
is with Merage School of Business at University of California Irvine (Email: [email protected]). Wanyi Wang is
with Merage School of Business at University of California Irvine. For helpful comments, we thank David Hirshleifer,
Chong Huang, Arthur Iinuma, Alan Kwan, Jongsub Lee (discussant), Ye Li, Chuchu Liang, Fangzhou Lu (discus-
sant), Evgeny Lyandres (discussant), Feng Mai, Asaf Manela, Daniel Rabetti (discussant), Amin Shams, Yushui Shi,
Donghwa Shin (discussant), Siew Hong Teoh, Aleh Tsyvinski, David Yang, Lu Zheng, Chenqi Zhu, and conference
and seminar participants at University of California Irvine, 2019 Conference on Financial Economics and Account-
ing at New York University, 2020 American Finance Association Conference Poster Session, 2020 CAFR Research
Workshop on FinTech, 2nd Future of Financial Information Conference, 4th Shanghai-Edinburgh Fintech Conference,
Miami Research Conference on Machine Learning and Business, UWA Blockchain and Cryptocurrency Conference,
and 2021 Global AI Finance Research Conference. All errors are our own.
1 Introduction
The rise of blockchain technology is one of the most critical innovations in recent decades.
An early application of blockchain technology that has received much attention is cryptocurrency,
which has experienced exponential growth since the debut of Bitcoin in 2009. To date, there are
over 10,000 cryptocurrencies with a market capitalization of over 1.5 trillion dollars.1 The rapid
growth of cryptocurrency market sparks extensive debates among practitioners, policy makers,
as well as academia. On the one hand, there are concerns about whether there is any fundamental
value of cryptocurrencies. Speculations, price manipulations, and frauds in this space are prevalent.
For example, Satis—a security token advisory firm—claims that over 80 percent of initial coin
offerings (ICOs) in 2017 were scams.2 Recent papers also find evidence of price and volume
manipulations in the cryptocurrency market (Griffin and Shams, 2020; Cong et al., 2020). On
the other hand, many cryptocurrency investors believe that blockchain technology is an important
innovation and has intrinsic value, and cryptocurrencies represent a stake in the future of this
technology.
For companies, we typically use dividends, earnings, or book value to measure their funda-
mental value. However, cryptocurrencies do not distribute dividends, and there is no traditional
accounting information readily available. Cryptocurrencies are also different from fiat currencies
in the sense that their value is not backed by any government. Therefore, it is difficult to eval-
uate the fundamental value of cryptocurrencies in the traditional framework. Recent theories in
cryptocurrency address these differences by emphasizing on the technology sophistication in de-
termining the viability and valuation of coins (see e.g., Fanti et al. (2019); Irresberger et al., 2020;
Iyengar et al., 2020). In this paper, we develop technology indexes to measure the technology
sophistication of individual cryptocurrencies and study whether investors value the technology
aspects of cryptocurrencies.
Measuring the technology sophistication of cryptocurrencies is challenging because of the lim-
1 This is based on information from coinmarketcap.com.
2 For the full report, please see: https://research.bloomberg.com/pub/res/d28giW28tf6G7T_Wr77aU0gDgFQ.
1
ited information disclosed. The only widely adopted disclosures for cryptocurrencies are their
whitepapers during initial coin offerings (ICOs). To attract funding, developers need to carefully
describe all aspects of the initial coin offerings especially the blockchain technology employed.
This feature of whitepaper gives us an unique opportunity to evaluate the technological sophis-
tication of coins at the individual level. Following the ICOs, we can also observe the outcome
and valuation of the cryptocurrencies. Therefore, the ICO market is an ideal laboratory to study
investors’ valuation of cryptocurrency technologies.
To measure the technological components employed in the cryptocurrencies, we use textual
analysis to analyze the content of whitepapers. In particular, we use both supervised machine
learning and unsupervised machine learning methods (i.e., word embedding and Latent Dirich-
let Allocation (LDA)) to construct three measures of technological sophistication (Technology
indexes) from a comprehensive database of ICO whitepapers. The supervised machine learning
method we employ is a top-down approach that closely imitates the way investors assess ICOs. The
unsupervised machine learning methods are bottom-up approaches to study the textual elements
of whitepapers. The advantage of the unsupervised methods is that they require little human input.
We also construct a composite index as the fourth technology index, which is the average of the
three indexes mentioned above. We study the determinants of the tech indexes using various cryp-
tocurrency characteristics. We find that cryptocurrencies that just use the Ethereum blockchain,
have lower GitHub activities, have ambiguous whitepapers, and have less reliable teams tend to
have lower tech indexes. However, the R-squared is only 0.136 when we use all the cryptocurrency
characteristics, suggesting that the majority of the variation in the tech indexes are not captured by
these characteristics.
To understand the role of technological sophistication in cryptocurrency pricing, we start by
studying the relationship between the technology indexes and ICO successes. We first examine
whether the technology indexes are related to ICO fundraising. If the entrepreneurs cannot raise
any funding, the ICO is not likely to succeed, so the ability to raise funding is one of the most im-
portant steps in a successful ICO. If ICO performances are fully driven by speculations, investors
2
would not care about the technology associated with the ICOs. Under this hypothesis, the technol-
ogy indexes would not predict ICO successes. However, we find that ICOs with high technology
indexes are more likely to raise capital and more likely to be traded in the secondary market subse-
quently. The economic magnitude of the effect is significant. For instance, a one standard deviation
increase in the composite technology index is associated with a 10.4 percent increase in the listed
probability, which is a 40.1 percent increase of the average. The results suggest that investors take
the underlying technology of the ICOs into consideration.
Next, we investigate whether the underlying technology of ICOs is associated with subsequent
performances. The process to fully incorporate technology-related information may take months
due to the complexity of blockchain technology. To test this conjecture, we examine the relation-
ship between the technology indexes and the long-run performance of ICOs. We measure long-run
performance using cumulative post-ICO returns, abnormal returns, and liquidity measures. We
find that the ICOs with higher technology indexes tend to have better performance in the long run
compared to other ICOs. A one standard deviation increase in the composite index is associated
with a 23.9 percent increase in cumulative returns at the 300-day horizon.
We also investigate whether our indexes help understand ICO failure measured by delisting.
We find that the ICOs with higher technology indexes are less likely to be delisted subsequently.
The economic magnitude of the effect is also large. For instance, a one standard deviation increase
in the composite technology index leads to a 2.52 percent decrease in delisting probability.
So far, we have shown that the technology indexes strongly and positively predict ICO suc-
cesses and subsequent performances. We argue that the results are consistent with the notion that
investors care about the technological sophistication of the cryptocurrencies, but it takes time for
the market to incorporate the information, leading to predictable returns. We present additional
evidence in support of the delayed reaction mechanism and attempt to rule out potential alternative
explanations. An implication of the delayed reaction mechanism is that investors should be able to
quickly incorporate the fundamental information if the whitepapers are written clearly. Consistent
with the implication, we show that among the whitepapers with better readability, the long-horizon
3
predictive power of the technology indexes is weaker. We also find that there is no return reversal
phenomenon, suggesting that the return predictability results are unlikely to be driven by investor
overreactions.
Overall, these results suggest that the underlying technology is an important determinant of
cryptocurrency prices, and support the argument that investors do take the technological compo-
nents in the ICO whitepapers into their consideration. However, it takes time for investors to
differentiate the fundamentally sound ICOs from the others fully. The delayed reaction from in-
vestors may be caused by investor inattention and the complex nature of the technologies, both of
which necessitate more time to process related information.
This paper contributes to the fast-growing literature on the economics of ICOs and digital assets
in general. Yermack (2017) is the first paper to explore the financial implications of blockchain.
Liu and Tsyvinski (2018) provide one of the first comprehensive analyses of the risk-return tradeoff
of cryptocurrencies. Liu et al. (2019) examine the cross-section of cryptocurrency and establish
a cryptocurrency three-factor model. Recently, several theoretical papers examine the rationale
and mechanisms of ICOs and cryptocurrencies (Cong and He, 2019; Cong et al., 2019; Catalini
and Gans, 2018; Sockin and Xiong, 2018). Our paper is closely related to Cong et al. (2019)
and Sockin and Xiong (2018), which argue that the value of cryptocurrency is fundamentally
anchored by the underlying utility value. In other words, their models predict that coins have
fundamental values and the fundamental values are crucial for performance. However, there is
little evidence showing the importance of the fundamental values of coins because it is hard to
measure that empirically. A set of empirical papers study factors that contribute to ICO success,
including Howell et al. (2020), Deng et al. (2018), and Lee et al. (2019). Lyandres et al. (2020)
studies the determinants of ICO sucesses and performances, and overturn some existing findings
in the literature. In general, they find social media and team play a significant role in ICO success
and performance. Although some prior papers touch about whitepapers (e.g., Dittmar and Wu,
2019 and Florysiak and Schandlbauer, 2019), our paper is the first paper that tries to measure
the technological sophistication of cryptocurrencies using various machine learning methods and
4
account for the relationship between whitepapers with ICO short and long-run performance.3 Our
tech indexes appear to play a significant role in explaining ICO success, short-, and long-horizon
performance, all of which are not well understood in the literature.
This paper provides support to the theoretical literature that links the technological advances
of blockchain to the fundamentals and valuations of cryptocurrencies. Budish (2018), Abadi and
Brunnermeier (2018), and Hinzen et al. (2019) discuss the limitations of proof-of-work technolo-
gies and the pricing implications of them. Fanti et al. (2019) show that the pricing implications of
proof-of-stake. Consistent with the theoretical implications of the literature, our paper shows that
the technological components affect the valuations of cryptocurrencies.
Our study also adds to the literature on machine learning and textual analysis in finance.4 The
application of machine learning in finance is a new and growing literature. Existing studies use
machine learning methods to construct text-based uncertainty (Manela and Moreira, 2017), predict
stock returns (Gu et al., 2020), measure corporate culture (Li et al., 2019), and analyze online
reviews (e.g., Sheng, 2019). Buehlmaier and Whited (2018) measure firms’ financial constraints
using textual analysis of firms’ annual reports. Kelly et al. (2018) use textual analysis to construct
indicators of patent quality. Recently, Bybee et al. (2020) use machine learning to measure the
state of the economy via textual analysis of business news. To our best knowledge, this paper is
the first paper to use machine learning methods to conduct textual analyses of cryptocurrencies.
Our paper employs both supervised and unsupervised machine learning methods, which allows us
to draw reliable references.
The rest of the paper is organized as follows. Section 2 explains the background of ICOs and
the data we use. Section 3 introduces the construction and validation of the technology indexes.
Section 4 describes our main empirical results and Section 5 documents the subsample and addi-
tional results. We conclude and discuss implications for policy in Section 6.
3 Some studies also look at the text of social media about cryptocurrencies. For example, Shams (2019) use text
from Reddit to measure the connectivity among cryptocurrencies.
4 See Tetlock, 2014 and Gentzkow et al., 2019 for reviews on textual analysis. Textual analysis includes both
machine learning methods and other methods, such as word count. For recent studies using the word count method,
please see Liu and Matthies (2018)and Fisher et al., 2020.
5
2 Background and Data
In a typical ICO, entrepreneurs issue digital assets (“tokens”) that are implemented on a blockchain
or a contract to deliver such tokens in the future (e.g., a Simple Agreement for Future Tokens, or
SAFT). Entrepreneurs then use the raised capital to create an online platform or ecosystem where
the native token can be used.
In general, these tokens can be classified into three types based on their purposes. The first
type is called “utility token” because its purpose is to redeem a product or service in the future—
this is the largest group of tokens. The second type is called “security token”, which is similar
to conventional securities but recorded and exchanged on a blockchain to reduce transaction costs
and create a record of ownership. This type of token gives holders the rights for associated cash
flows, such as dividends. The third type is called “asset token”, which serves as a general-purpose
medium of exchange and store of value. These are often termed “coins”, such as Bitcoin.
Initial Coin Offering is appealing to both start-up companies and investors. The start-up com-
panies that choose to issue ICOs are usually those that "conventionally finance themselves with
angel or venture capital (VC) investment" (see Howell et al., 2020). ICOs are attractive to these
start-up companies because ICOs allow them to avoid regulations from SEC and intermediaries
such as venture capitalists and banks, leading to lower financing costs and easier access to capital.
Investors participate in ICOs for various reasons. Some investors may believe in the intrinsic value
of the project and are optimistic about the technological innovations embedded therein. Other
investors may be speculators who are attracted by the quick cash-out ability.
The first ICO was issued by Mastercoin in July 2013. In 2014, Ethereum also launched a token
sale and raised over $15 million to support its development. In 2017, ICOs have become popular,
and 875 startups successfully raised capital using token sales during the year. As of February 2019,
6
ICOs have raised over 25 billion USD.5
As a new source for seed and early-stage funding, ICOs raise money from many small investors
over the Internet. In that sense, the ICOs are similar to crowdfunding, where investors get future
rewards or deals on products and get securities for exchange. However, ICOs are different from
crowdfunding in that they are blockchain-based and involve more advanced technology for their
products and services. ICOs are also similar to initial public offerings (IPOs) in the sense that
tokens can be listed on one or more cryptocurrency exchanges, so investors can benefit from the
price appreciation of a listed token even before the project launches. This process is usually much
faster than that of IPOs. The whole process ranges from several days to several months, but there
is no guarantee of listing.
Our dataset consists of three different components: ICO characteristics from trackico.com,
daily trading data from coinmarketcap.com, and textual measures from ICO whitepapers. There
are over 4,100 ICOs on trackico.com, with 2,452 closed, 575 trading, 264 ongoing, 82 pre-sale,
307 upcoming and 422 unknown. We focus on ICOs between January 2017 and December 2018.
The final sample consists of 2,916 ICOs, which raised more than $17 billion in total. For each
ICO, we collect the following information: ICO start and end date, ICO price, total capital raised,
trading status, pre-ICO, bonus, platform, accepted currency, the founder team, country, industry,
links of whitepapers, official website, GitHub and Twitter.
We define two measures of ICO success. The first one is “Trading”, a self-reported indicator
variable by fundraisers to trackico.com, indicating whether the token is trading on cryptocurrency
exchanges. The second one is “Success”, which equals to 1 if an ICO successfully raised any
capital (Benedetti and Kostovetsky, 2018). Other ICO characteristics serve as control variables.
“ICO length” is the number of days between the start and end of an ICO. “ICO price” is the cost
per token in US dollars. “Total Raised” is the amount of money raised in millions of US dollars.
5 Source: https://icobench.com/.
7
“Pre ICO”, “Bonus”, “Ethereum Based” and “Accept BTC” are indicator variables about whether
the ICO has a pre-ICO, offers bonus to investors, is built on Ethereum platform and accepts Bitcoin
as a payment currency, respectively. “Team size” is calculated as the number of team members.
We define “Has GitHub” and “Has Twitter” to be indicator variables of whether the fundraiser has
a GitHub or a Twitter homepage. We further control for Bitcoin price on the ICO start date or the
coin’s listing day as a proxy for the market sentiment. Finally, we control for quarterly, categorical
and geographical (continent-level) fixed effects.
Next, we merge ICO data with information from coinmarketcap.com, the leading information
source of cryptocurrency trading data, which is also a primary information source in the ICO liter-
ature. By the end of 2018, coinmarketcap.com has provided data for over 3,600 cryptocurrencies,
among which 2,070 are active while 1,583 are delisted. We collect daily opening price and 24h
dollar trading volume on all coins from August 2013 to December 2018. We then use token names,
ticker symbols, and website slugs to merge these variables with our ICO data. Since many coins
on coinmarketcap.com were not issued through ICO, and many ICOs do not list their coins on any
exchange, we get a merged sample of 765 observations.
With the merged sample, we first define a third ICO success measure, “CMC Trading”, which
equals to one if the coin has ever appeared on coinmarketcap.com. This measure also aims at
characterizing the same fact (i.e. whether the coin is traded on an exchange) as the self-reported
measure “Trading”, but is more comprehensive.6 Therefore, we use “CMC Trading” in our main
analysis and consider the other measures in the robustness tests. We define “First Open/ICO Price”
to measure the premium on the listing day and “Delist” to characterize whether the coin is delisted
from cryptocurrency exchanges. We also calculate the cumulative rate of return, Bitcoin-adjusted
rate of return and 24h trading volume after the coin has been listed for 7 days, 30 days, 90 days,
180 days, 240 days and 300 days. These measures capture the short- and long-term performance
and liquidity of cryptocurrencies.
The last set of variables comes from textual analysis of ICO whitepapers, which are down-
6 Thecorrelation between CMC Trading and Trading is 75.8%. Trading is highly accurate if it equals to 1, but is
not comprehensive, as we identified approximately 200 more trading tokens on coinmarketcap.com.
8
loaded from trackico.com. We obtained 1,629 valid whitepapers in PDF format. In Table OA.4,
we list all other variations of whitepaper status. Next, we convert PDF files into TXT format,
which can be used as the raw input for textual analysis.
Using this whitepaper corpus, we first construct our main measures of technology indexes,
which we explain in detail in Section 3. Moreover, we consider three well-known textual measures
as control variables: Readability, Tone, and Uncertainty. “Readability” is characterized by the
Fog Index, a widely adopted measure in finance and accounting literature. Developed by Robert
Gunning in 1952, Fog Index is a linear combination of the percentage of complex words and the
average number of words per sentence.7 “Tone” is the difference between positive and negative
words divided by the total number of words, and “Uncertainty” is the percentage of uncertainty
words among all words used in a whitepaper. All lexical categories are defined in Loughran and
McDonald (2011).
We report the summary statistics of the sample characteristics in Table 1. Panel A of Table
1 presents summary statistics on variables related to ICO characteristics. On average, it takes 51
days to complete an ICO with a team of 11 people. 18% of the ICOs are self-reported as trading
and 38% have non-zero values of capital raised. Moreover, 60% have a GitHub homepage for their
project and over 90% have set up their Twitter accounts.
Panel B of Table 1 presents summary statistics on the merged sample. Consistent with the
literature, we identify that 26% of ICOs have listed tokens on an exchange at some point in time.
Among these listed cryptocurrencies, only 10% are delisted while the remaining 90% are still
active. On average, investing in a cryptocurrency during an ICO can earn a premium of 120% on
the first trading day, indicating a large amount of first-day price reaction. Moreover, the return
of cryptocurrency investment increases as time goes by, from 19% during a 7-day holding period
9
to 151% during a 300-day holding period. The 24h trading volume fluctuates with different time
spans, varying from $1.5 million to $2.78 million. ICO characteristics with respect to the merged
subsample are also reported in this panel.
In this section, we discuss how we measure the technological sophistication for the cryptocur-
rencies based on their whitepapers. We first present how we construct the technology indexes using
different machine learning methods, and then we validate these measures.
We use several machine learning techniques to capture the technological components of the
whitepapers, including both supervised and unsupervised methods. We first construct a super-
vised machine learning index. We mimic the way investors evaluate whitepapers and manually
assign scores to 200 whitepapers, which we use as the training set. Then, we use a supervised
machine learning algorithm, train it on the training sample, and extrapolate scores to the remaining
whitepapers. The supervised machine learning method we employ is a top-down approach that
closely imitates the way investors assess ICOs.
Additionally, we use two different techniques in the unsupervised machine learning literature
to measure the technological aspect of the whitepapers: word embedding along with K-means
clustering and Latent Dirichlet Allocation topic modeling approach. The unsupervised machine
learning methods are bottom-up approaches to study the textual elements of whitepapers. One
important advantage of unsupervised machine learning methods is that they require little human
input. In other words, they do not require researchers to have good prior knowledge about what
type of words they are looking for in the texts. Below, we briefly summarize the three methods and
their estimations, and refer interested readers to the original paper for details.
10
Supervised Machine Learning
First, we use supervised machine learning methods to construct a technology index. Supervised
machine learning methods learn from a training set in which both the input and the output are
known. To construct the training sample, we read through 200 randomly selected whitepapers and
give a score from 1 to 4 based on their technical sophistication. The process closely imitates the
way investors evaluate the whitepapers. All the whitepapers emphasize on using blockchain and
related technologies. Thus, these projects either employ more advanced blockchain technology
or apply existing blockchain technology to different areas. The readers assign a high score (e.g.,
3 or 4) to a whitepaper when they think the ICO project involves more advanced and convincing
technology. For example, Filecoin uses a novel class of Proof-of-Storage schemes called Proof-
of-Replication, and receives an average score of 4. Then, we conduct preprocessing to the training
set. We form all two-word phrases in the corpus, remove unigrams and bigrams that appear in
less than ten documents, and convert the corpus to a document-term matrix. The final training set
consists of 200 documents and 20,586 unique terms.
We consider the following supervised machine learning approaches as potential candidates:
panelized linear methods (LASSO, ridge, and elastic net), dimension reduction methods (PCR and
PLS), decision tree boosting methods (random forest, gradient boosting), and neural networks. In
the Online Appendix, we provide a brief introduction for each supervised method. In order to tune
the hyperparameters of the supervised learning models and find the best model for constructing our
supervised technology index, we need to quantify the performance of the model. We evaluate the
model performance based on out-of-sample ' 2 . We use 5-fold cross-validation to build a validation
set whose labels are known but are not used for training. Specifically, we divide the training set
into five subsets, each of which contains 40 observations. Following that, each subset will be
used as the validation set to evaluate the model based on ' 2 , while the remaining four subsets are
used as the training set. The average out-of-sample ' 2 is the simple average of ' 2 on the five
2
subsamples. Table 3 shows the best out-of-sample R-square ('$$( ) for each supervised method
and their corresponding hyperparameters. For our sample, partial least square (PLS) performs the
11
2
best and has a '$$( of 45.88%. Hence, we use the predicted technology score from PLS as our
supervised technology index.
Word Embedding
Word embedding is one of the most popular word representation methods in natural language
processing (NLP) in recent years. Developed by Mikolov et al. (2013), its goal is to map words
to numerical vectors, such that the semantic similarity between words is captured by the geomet-
ric distance in the vector space. How to construct such vectors? The intuition comes from the
famous quotation of Firth (1957)—“You shall know a word by the company it keeps.” In other
words, the meaning of a word can be inferred from the context, so words appearing in similar con-
texts should have similar meanings.8 Word embedding has two main advantages over traditional
“bag-of-words” methods. First, it greatly reduces the number of dimensions. Word embedding
vectors usually have only a few hundred dimensions, while bag-of-words models are typically
sparse vectors of thousands of dimensions. Hence, it is a more efficient representation of the raw
text. Second, word embedding maps synonyms to adjacent vectors, so we can use clustering meth-
ods on the vector space to divide words into different topics. We use K-means as the clustering
algorithm. It is one of the simplest and most popular unsupervised machine learning methods.
Given a fixed number of clusters (k), K-means seeks a partition of the dataset, such that the within-
cluster sum of squared distances between each observation and its closest centroid is minimized.
In the Online Appendix, we provide details on the theoretical background of word embedding and
k-means clustering and how to choose the optmial number of topics.
We find that the optimal number of topics detected by the algorithm is 20. Hence, we use K-
means to cluster word embedding vectors into 20 topics. To interpret the embedding and clustering
results, we give each topic a label. For clustering methods, topics are mutually exclusive, so each
word can only be grouped into one topic. We name the topics based on the most frequent words in
each cluster. Table OA.1 lists the top 15 most frequent terms of each topic.
8 Li et al. (2019) provide a good example in the Appendix to illustrate the intuition.
12
To further understand the relationship between topics, we apply two machine learning tech-
niques. The first one is hierarchical agglomerative clustering (Murtagh and Legendre, 2014),
which can be used to construct a taxonomy of our topic model. Following Bybee et al. (2020),
we agglomerate topics recursively according to the semantic similarity between topics, as captured
by the distance between cluster centroids. Figure 1 displays the result and shows that three topics
(“blockchain”, “system”, and “algorithm”) belong to the same cluster. Another technique we use is
multidimensional scaling (MDS, Torgerson, 1958), which is a non-linear dimensionality reduction
algorithm such that the two-dimensional representation best preserves the distance between topics
in the original space. “Blockchain”, “system” and “algorithm” are combined into a broader topic in
the taxonomy. They are also adjacent to each other in the inter-cluster distance map. Therefore, we
consider these three topics as technology-related topics. For each whitepaper, we calculate the per-
centage of words that belong to the “blockchain”, “information” or “algorithm” topics, normalize
it to zero mean and unit standard deviation, and define it as our embedding-based tech index.
The second unsupervised machine learning method we use is Latent Dirichlet Allocation (LDA).
LDA is a popular method in the finance and economic literature. It has been used to analyze the
structure of economic news (Bybee et al., 2020) and to detect latent topics among employee re-
views (Sheng, 2019). The basic idea is that each document can be represented as a probability
distribution over various topics, where each topic is a probability distribution over the vocabulary
of a corpus. Similar to other textual analysis methods, LDA methods involve a step to remove
useless information (i.e., stop words) and then represent the text as data. In the Online Appendix,
we introduce the LDA model and describe the preprocessing procedures and the choice of topics
in more detail. We find that 20 topics is optimal based on the selection process.
To understand the LDA output with 20 topics, we interpret these topics by looking at top words
associated with each topic. This is a common approach adopted in most finance and economics
literature (e.g., Hansen et al., 2018; Sheng, 2019). Table OA.2 displays the top 15 most relevant
13
terms of each LDA topic (see Online appendix for details on how to find top words for each topic).
We assign a label to each topic based on these key terms. Similar to word embedding, we also
use hierarchical agglomerative clustering and multidimensional scaling (MDS) to understand the
correlation between LDA topics. Panel A of Figure 2 shows the tree structure of LDA topics
and Panel B shows the MDS results. These results suggest that we should group “information”,
“blockchain” and “system” together and define the normalized proportional attention allocated to
the three topics as our LDA-based tech index.
Composite Index
Finally, we create a composite index to aggregate the information from the above three indexes.
This is done by taking the simple average of the supervised, embedding-based and LDA-based
technology indexes. Both supervised and unsupervised machine learning methods have pros and
cons. The composite index can potentially reduce the noise of each index, resulting in a useful
proxy. For most of the empirical analysis, we show the results for all of them, including the
composite index.
Given that the construction of the technology index is one of the key components of this paper,
we use different machine learning methods to capture the technological sophistication of cryp-
tocurrencies. These methods have different advantages. Supervised machine learning methods are
relatively easy to interpret. Unsupervised machine learning methods require little human input and
do not require prior knowledge about the subject from the researchers.
In our paper, we employ multiple methods to construct the technology indexes and find evi-
dence that the measures capture meaningful information. First, the measures from different ma-
chine learning methods are highly correlated with each other, suggesting that the measures are just
driven by noise. Second, we construct a composite index to reduce the noise of each measure.
It is possible that each method capture only some aspect of the true technological components of
14
cryptocurrencies. Then, the composite index would provide a better proxy because it aggregates
the information from the three individual indexes. Third, we find consistent results based on all
four measures of technology indexes.
To better understand the indexes, we study the determinants of the technology indexes. We uti-
lize cryptocurrency characteristics from several dimensions, including whether they use Ethereum
blockchain, GitHub data, whitepaper information, and other characteristics. GitHub is an open-
source online platform that provides repository hosting service for developers. Using the API pro-
vided by GitHub, we obtain the number of (1) users subscribing updates of the repository (watch),
(2) “likes” received by the repository (star), (3) copies made by other developers (fork), (4) code
revisions (commit), (5) pointers to specific versions (branch) and (6) developers who have con-
tributed to the source code (contributor). These measures are often used by researchers to proxy
for product quality and post-ICO technology development (Deng et al., 2018; Dittmar and Wu,
2019). For the determinant results, we use commit as the GitHub measure. In the Online Ap-
pendix, we also use other GitHub measures as robustness checks and obtain qualitatively similar
results.
Table 2 documents the results that relate the technology indexes to these cryptocurrency char-
acteristics. We use the composite index as the dependent variable and we show qualitatively sim-
ilar results using the other indexes in the Online Appendix. Each of columns (1)–(4) reports the
determinant models based on a dimension of coin characteristics. Column (1) shows that cryp-
tocurrencies that use Ethereum blockchain tend to have lower tech indexes, confirming the prior
that cryptocurrencies that build their own blockchain on average have higher tech indexes. Column
(2) shows that cryptocurrencies with more code revisions in GitHub have higher tech indexes. In
Column (3), we find that cryptocurrencies with ambiguous whitepapers tend to have lower tech in-
dexes. In Column (4), we find that cryptocurrencies with more reliable and supportive teams have
higher tech indexes. For example, team size and the Twitter account dummy positively predict tech
indexes. Column (5) combines all the cryptocurrency characteristics and delivers consistent mes-
sages. However, the R-squared of Model (5) is 0.136, suggesting that the majority of the variation
15
in the tech indexes are not captured by the cryptocurrency characteristics.
4 Main Results
In this section, we examine whether the technological component of ICOs is associated with
ICO success, short-run, and long-run performances. We capture the technological component of
ICOs using the four technology indexes we defined above, and we evaluate an ICO using both its
fund-raising stage information and its subsequent performance data.
First, we study the set of characteristics in ICO whitepapers that are most related to ICO suc-
cess. We use two ways to measure ICO success. The first measure of ICO success is based on
whether the cryptocurrency is listed on the coinmarketcap.com (CMC trading) and the second
measure is based on whether the ICO successfully raised capital. If the entrepreneur cannot raise
any funding, the ICO is not likely to succeed. Therefore, the ability to raise funding is one of the
most important steps in a successful ICO. If investors care about the technological components
of ICOs, we should expect that it is easier for ICOs with more sophisticated technologies to raise
funding. Companies voluntarily disclose whitepapers to communicate with investors in the fund-
raising stage, and one of the primary ways that investors evaluate coins is through whitepapers.
If whitepapers indeed inform investors about the different aspects of the ICOs, we would be able
to extract information from the whitepapers. As discussed in Section 3, we form four measures
to summarize the technological component of ICOs: (1) an index based on word embedding, (2)
an index based on LDA, (3) an index based on supervised machine learning, and (4) a composite
index.
Table 4 documents the results that relate ICO whitepapers’ characteristics to ICO successes.
Panel A of Table 4 presents results based on CMC trading and Panel B presents results based on
whether the cryptocurrency successfully raised capital. We report coefficient estimates for each
16
of the four whitepaper indexes as well as the control variables. Time, categorical, and geographic
fixed effects are included in the specifications when indicated.
Panel A shows that the CMC trading indicator positively loads on all four tech indexes, sug-
gesting that when the tech indexes are high, the cryptocurrencies are more likely to be listed on
coinmarketcap.com. The coefficient estimates are 0.070, 0.107, 0.086, and 0.124 for the four in-
dexes, respectively. The relationships are highly significant at the 1 percent level for all four cases.
The economic magnitudes are large. For example, the coefficient estimate on the composite tech-
nology index is 0.124 in the univariate specification and the standard deviation of the tech_comp
index is 0.84. In other words, a one standard deviation increase in the composite index leads to
an increase of the listed probability by 10.4 percent—a 40.1 percent increase of the sample aver-
age of the listed probability. In the multivariate specification with controls and fixed effects, the
coefficient estimate on the composite technology index is 0.066. That is, a one standard deviation
increase in the composite index is associated with an increase of the listed probability by 5.54
percent under the multivariate specification.
Panel B measures ICO success based on whether the ICO raised capital (Success indicator).
The coefficient estimates are largely consistent with those in Panel A—the coefficients on the four
indexes are 0.061, 0.077, 0.056, and 0.091. The loadings on the four technology indexes remain
highly statistically and positively significant at the 1 percent level for all the specifications. The
coefficient estimate on the composite technology index is 0.091 in the univariate regression, which
suggests that a one standard deviation increase in the composite index is associated with a 7.64
percent increase of the probability that the ICO raised capital—a 20.1 percent increase of the
sample average. In the multivariate specification with controls and fixed effects, the coefficient
estimate on the composite technology index is 0.060. That is, a one standard deviation increases
in the composite index is associated with a 5.04 percent increase in the probability that the ICO
raised capital.
Further evidence that the technological component serves as an important factor for ICO suc-
cess is the ' 2 . For example, Panel A of Table 4 shows that a single variable of each of the tech-
17
nology indexes already explains between 3 percent and 6 percent of the variation of CMC trading.
Interestingly, in untabulated results, we find that the Quarterly Fixed Effects seem to be the most
important factor–they explain 14 percent of the variation of the CMC Trading variable. In other
words, the timing of the ICOs is important in determining whether they can successfully raise
capital. Nevertheless, our technology indexes are still some of the most important factors that con-
tribute to the success of an ICO. Overall, the results show that when an ICO whitepaper contains
more discussion on technology-related topics as captured by our indexes, the ICO is more likely to
be successful.
Industry Subsample
In this subsection, we test whether the technology indexes are stronger predictors of ICO suc-
cesses in industries that technological components are deemed more important. In certain indus-
tries (e.g., platform; trading), investors may scrutinize the technological components of the ICOs,
while in other industries (e.g., gaming; charity), this is not the case. We categorize “platform”,
“cryptocurrency”, and “trading” as the technology-related industry, and construct an indicator
variable (“industry”) to denote the technology-related industries. We test whether the technology
indexes strongly predict ICO successes for coins in the technology-related industries.
We present the subsample results based on industries in Table 5. Consistent with the base-
line results, the technology indexes positively predict ICO successes. The cross-terms between
the technology indexes and the “industry” indicator are all positive and largely significant. The
economic magnitudes are large. For example, judging from the composite index, the coefficient
estimates almost double for the coins in the technology-related industries relative to the rest of the
coins. These results also support the view that investors value the technological components of the
cryptocurrencies, especially for the coins in the technology-related industries.
18
4.2 Long-Horizon Performance
In this section, we investigate whether the technology indexes help forecast the medium- to
long-horizon ICO returns. In the equity market, initial public offerings tend to underperform in
the long run (see Ritter, 1991; Loughran and Ritter, 1995). In sharp contrast, initial coin offerings
perform well in the medium- to long-horizon (see Benedetti and Kostovetsky, 2018). In order
to study the speed of information acquisition of the investors, we ask whether the long-horizon
performance of the ICOs is related to the technology indexes.
We track the subsequent returns of the ICOs over different horizons—from 7-days ahead to
300-days ahead. Shumway (1997) documents that stock delisting is associated with a negative 10
percent return on average. In the robustness test section, we experiment with alternative assump-
tions and show that the results are consistent.
First, we look at how the technology indexes predict the subsequent performance of initial coin
offerings. The results are documented in Table 6. Panel A, B, C, and D of Table 6 document the
results for the index based on supervised machine learning, index based on word embedding, index
based on LDA, and the composite index, respectively. We regress the cumulative ICO returns on
current technology indexes, controls, and fixed effects. In general, we find that the technology
indexes positively predict the subsequent performances of the ICOs. For example, based on the
composite technology index, the point estimates are positive across all horizons. The point esti-
mates steadily increase but are insignificant at short horizons. The point estimates start to become
significant in longer horizons. At the 240-day horizon, the point estimate increases to 0.280, in-
dicating a 23.5 percent increase in cumulative returns at this horizon for one standard deviation
increase in the composite technology index. At the 300-day horizon, a one standard deviation in-
crease in the composite technology index leads to a statistically significant 23.9 percent increase
in cumulative returns.
The ICOs took place at different times, and our return measures do not take the time component
information into consideration. A common factor that is important for the ICO market is Bitcoin
19
returns. Thus, we also conduct a similar exercise with abnormal returns that are adjusted to Bit-
coin returns. Table 7 reports the results of this test and shows similar results in terms of statistical
significance and economic magnitude as in Table 6. For example, based on the composite tech-
nology index, the point estimates remain positive across all horizons. The point estimates become
significant at the 180-day horizon. At the 240-day horizon, the point estimate increases to 0.300,
indicating a 25.2 percent increase in cumulative returns at this horizon for one standard deviation
increase in the composite technology index. At the 300-day horizon, a one standard deviation in-
crease in the composite technology index leads to a statistically significant 29.7 percent increase
in cumulative returns.
Overall, the medium- and long-horizon results are consistent with the idea that it takes time
for the market to fully incorporate information about technological sophistication. Although coins
with high technology scores have a high probability of raising funds, investors undervalue these
high-tech coins on average.
In this section, we use two additional measures to evaluate ICO performances. The first one is
the liquidity measure and the second one is the delisting probability measure.
We measure coins’ liquidity as the log transformation of the 24-hour trading volume. On
average, we find that liquidities are higher for older coins, consistent with Howell et al. (2020). We
examine the relationships between characteristics of whitepapers and coins’ liquidity measures.
We report the results in Table 8. In our model specifications, we include quarterly, categorical, and
geographic fixed effects. We find that the four technology indexes are positively associated with
coin liquidity. These results are always statistically significant across the different horizons since
inception.
We then investigate the relationships between coins’ delisting probability and the characteris-
tics of the whitepapers. We define Delist as an indicator variable, which is equal to 1 if a token
is delisted from CMC. The results are reported in Table 9. The results show that coins with high
20
technology scores are less likely to be delisted subsequently. The economic magnitude of the ef-
fect is large. For instance, in the standalone specification, a one standard deviation increase in the
composite technology index leads to a 2.52 percent decrease in delisting probability.
The results in this section highlight that coins with high technology scores are intrinsically
superior. The results provide supports to our argument that the investors in the coin market take
technical aspects of the ICOs into consideration. However, as we have shown above, it takes a
considerable amount of time for the market to reach the proper pricing of the ICOs eventually.
5 Discussion
In the previous section, we show that the technology indexes strongly and positively predict
ICO successes and subsequent performances. We argue that the results are consistent with the
notion that investors care about the technological sophistication of the cryptocurrencies, but it takes
time for the market to incorporate the information leading to predictable returns. In the first two
parts of this section, we present additional evidence in support of the delayed reaction mechanism
and attempt to rule out potential alternative explanations. In the last two parts of the section, we
present additional robustness checks.
In this subsection, we compare our tech indexes with other measures that may contain infor-
mation on the technological sophistication of cryptocurrencies, including a GitHub measure and a
simple word count measure.
One candidate that potentially captures some information of the technological sophistication
of cryptocurrencies is the GitHub measures. However, the GitHub measures are ex post measures
that capture information about the successes of the ICOs. Moreover, these measures may contain
information such as the hype around cryptocurrencies.
In addition, there are multiple methods to conduct textual analysis. For example, the word-
21
count method where we can just count the number of words that belong to a dictionary is well-
accepted in the finance and economic literature (e.g., Manela and Moreira, 2017; Liu and Matthies,
2018; Fisher et al., 2020). The word-count method is particularly useful when researchers have
good prior knowledge about what they are looking for and the list of words is straightforward.
However, cryptocurrency and blockchain are new phenomena and researchers have limited knowl-
edge about what should be a good list of words to describe the technology involved. In this case,
unsupervised machine learning methods, such as LDA, are more proper and can overcome this is-
sue. One important advantage of machine learning methods, especially the unsupervised machine
learning methods such as word embedding and LDA, is that they do not require researchers to have
good prior knowledge about what type of words they are looking for in the texts.
With that being said, we construct technology measures from GitHub and from a simple word
count method to compare with our tech indexes. The Github measure we use is commits, the num-
ber of code revisions of a project on GitHub. The simple word count measure captures the percent-
age of technology words in a whitepaper, where the technology words are defined by a blockchain
dictionary. 9 The complete word list can be found in Table OA.3. In Table 10, we present results
using the tech indexes to predict CMC trading, controlling these two types of measures. Columns
(1)–(3) report results using the composite tech index, the GitHub commits measure, and the simple
word count measure, and the point estimates on the variables are all positive and significant at the 5
percent level. Columns (4) and (5) report results using the composite tech index controlling for the
GitHub commits measure and the simple word count measure, respectively. When both the com-
posite tech index and the GitHub measure are included, the coefficient estimates on both measures
remain positive and highly statistically significant. However, the tech index completely subsumes
the explanatory power of the simple word count measure. When all three variables are included,
the coefficient estimate on the composite tech index remains positive and statistically significant at
the 1 percent level, and the point estimate on the simple word count measure is insignificant.
9 See https://consensys.net/knowledge-base/a-blockchain-glossary-for-beginners/;
https://blockgeeks.com/guides/blockchain-glossary-from-a-z/; https://www.blockchaintechnologies.com/glossary/.
22
5.2 Subsample on Whitepaper Readability
In the main result section, we argue that the findings are consistent with the investor delayed
reaction to technological sophistication due to thcan be found in Table OA.3. In Table 10, we
present results using the tech indexes to predict CMC trading, controlling these two types of mea-
sures. Columns (1)–(3) report results using the composite tech index, the GitHub commits mea-
sure, and the simple wore complex nature of the cryptocurrency whitepapers. An implication of
this argument is investors should be able to quickly incorporate the fundamental information if
the whitepapers are written clearly. Therefore, among the whitepapers with high readability, we
should expect weaker results on long-horizon performances.
We measure the whitepaper readability using the Fog index. We construct an indicator variable
(“Easy”) that equals to 1 if the whitepaper has a below-median Fog index and 0 otherwise. We
present the results in Table 11. Panel A of Table 11 presents results based on the rate of returns.
Consistent with the baseline results, we find that the technology indexes positively and significantly
predict the long-horizon returns. The cross-terms between the technology indexes and the indicator
variable (“Easy”) are negative and significant at the long-horizons, suggesting that the long-horizon
return predictability of the technology indexes concentrate among the cryptocurrencies with low
readability. For example, the coefficient estimate on the composite index at the 300-day horizon
is 1.217, while the cross term between the composite index and the indicator variable at the same
horizon is -1.160. That is, the long-horizon return predictive power of the composite index entirely
concentrate on the cryptocurrencies with low readability.
Panel B of Table 11 shows results based on the Bitcoin-adjusted rate of returns. Similar to the
results in Panel A, we find that the cross terms between the technology indexes and the indicator
variables are negative and significant at the long-horizons across all the specifications. Overall,
we confirm the implication of the investor delayed reaction mechanism: the long-horizon return
predictability results are weaker for cryptocurrencies with high readability.
23
5.3 Return Reversal
In the previous section, we find that the technology indexes positively and significantly predict
cumulative ICO returns over the long horizons. We argue that the findings are consistent with in-
vestors’ delayed reaction to the technical aspects of the cryptocurrencies. An important alternative
interpretation of the findings is that investors may overreact to technological sophistication of the
cryptocurrencies, leading to results of ICO return predictability. Barberis et al. (1998) theoretically
demonstrate that investor overreaction to fundamentals can lead to overvaluation of asset values.
Pastor and Veronesi (2003) show that investor learning about uncertain fundamentals can lead to a
bubble-like phenomenon. A common implication of the models based on investor overreaction or
learning of asset fundamentals is that the asset values would eventually reverse to the fundamental
values. Technology is one important aspect of fundamental of cryptocurrencies. Therefore, we
would expect return reversal if investors over-react to technological fundamentals of cryptocurren-
cies.
To test this alternative mechanism, in this section, we test whether there is a long-horizon return
reversal phenomenon for the ICOs with high technology indexes. To detect any return reversal
effect, we use the technology indexes to predict ICO returns from 180 days onward. The results
are documented in Table 12. Panel A and Panel B of the table document results for the rate of
returns and Bitcoin-adjusted rate of returns, respectively. Overall, we do not find evidence of more
subsequent return reversal for coins with high technology indexes.
Extensive research has shown that there is a substantial amount of first-day performance in
initial public offerings in the equity market.10 Recently, Benedetti and Kostovetsky (2018) docu-
ment a similar first-day price reaction in the initial coin offering market. In this section, we study
whether the technology indexes help predict not only the long-horizon phenomenon but also the
24
first-day price reaction of ICOs.
Our measure of first-day price reaction is defined as the natural logarithm of the ratio between
the first opening price and the ICO offer price. By definition, the sample only includes coins
with trading records. Table 13 reports the results for ICO first-day price. Quarterly, categorical
and geographic fixed effects are included in the specifications when indicated. We find that the
coefficient estimates of the four technology indexes are all positive and significant at the 1 percent
level. In other words, the technology indexes positively and significantly predict the first-day
price reaction. The coefficient estimates are 0.337, 0.414, 0.310, and 0.458 for the four indexes,
respectively. The economic magnitudes of the coefficient estimates are large. The coefficient
estimate remains stable in the multivariate specification with controls and fixed effects, where the
coefficient estimate on the composite index is 0.417.
Overall, the technology indexes are strongly and positively predict both short-horizon and long-
horizon ICO performances. These two sets of results suggest that, although coin market investors
take the technical aspects of coins into consideration, they fail to incorporate the information fully.
5.5 Robustness
In this section, we conduct several robustness tests. First, we use an alternative measure of
success, Trading, which indicates whether the token is traded on a cryptocurrency exchange. We
examine whether the technology indexes predict ICO success under this measure and run a similar
regression as in Table 4. Table 14 Panel A reports the result. The coefficients on the technology
indexes are positive and significant and support the same conclusion as in Table 4.
Second, we use linear regression in Table 4 where the dependent variable is a binary variable.
Alternatively, we can use a Logit or Probit model. Table 14 Panel B reports the results from a
Logit regression and finds similar results as in Table 4. In the untabulated results, we show that the
results under the Probit model are qualitatively similar.
Third, it is well-documented that we have to impute delisted returns for equity to avoid delisting
bias in the data (Shumway, 1997). The equity return data from CRSP automatically contain im-
25
puted returns for delisted stocks. For the same reason, we may need to impute returns for delisted
ICOs. We set a large negative value -99% as their returns after listed for all delisted ICOs. We then
redo the tests on whether the technology indexes affect short-run and long-run returns with and
without adjusting Bitcoin returns as in Table 6 and 7. Table 14 Panels C and D report the results.
Similar to the results in Tables 6 and 7, ICOs with higher technology indexes tend to outperform
in the long-run. The economic magnitudes are also close.
6 Conclusion
There are two views about cryptocurrency and blockchain technology. The first view is that
the cryptocurrency market represents bubbles and fraud. The second one believes that the value of
the cryptocurrency market comes from the innovative technologies and that a stake in cryptocur-
rencies is an investment in the future of the technology. This study contributes to this debate by
providing novel measures of technological sophistication of cryptocurrencies via textual analysis
of ICO whitepapers. We construct a set of four text-based technology indexes from a comprehen-
sive sample of ICOs’ whitepapers. We find that the ICOs with higher Tech-Index are more likely to
succeed and less likely to be delisted subsequently. Although the Tech-index does not statistically
significantly affect the short-run returns of ICOs, it has a positive impact on their long-run per-
formance. In short, our findings suggest that technological sophistication is an important driving
force for the performances and valuations of ICOs.
Our findings have important policy implications. Although SEC has launched several initiatives
on regulating ICOs, there are no clear disclosure requirements. Our results show that disclosures
such as whitepapers are potentially important for the long-term development of the cryptocurrency
market. Thus, it might be useful to set up a requirement or guideline for formats and necessary
components in the whitepaper, which is a natural analogy for disclosure requirements for public
firms (e.g., 10K) and financial firms (e.g., 497K for mutual funds).
26
References
Abadi J, Brunnermeier M. 2018. Blockchain economics. Working Paper,National Bureau of Eco-
nomic Research .
Beatty RP, Ritter JR. 1986. Investment banking, reputation, and the underpricing of initial public
offerings. Journal of Financial Economics 15: 213–232.
Benedetti H, Kostovetsky L. 2018. Digital tulips? returns to investors in initial coin offerings.
Working Paper, Boston College .
Blei DM, Ng AY, Jordan MI. 2003. Latent dirichlet allocation. Journal of Machine Learning
Research 3: 993–1022.
Budish E. 2018. The economic limits of bitcoin and the blockchain. Working paper, University of
Chicago and NBER .
Buehlmaier MM, Whited TM. 2018. Are financial constraints priced? evidence from textual
analysis. The Review of Financial Studies 31: 2693–2728.
Bybee L, Kelly BT, Manela A, Xiu D. 2020. The structure of economic news. Working paper, Yale
University .
Catalini C, Gans JS. 2018. Initial coin offerings and the value of crypto tokens. Working paper,
University of Toronto and NBER .
Cong LW, He Z. 2019. Blockchain disruption and smart contracts. Review of Financial Studies
32: 1754–1797.
Cong LW, Li X, Tang K, Yang Y. 2020. Crypto wash trading. Available at SSRN 3530220 .
Cong LW, Li Y, Wang N. 2019. Tokenomics: Dynamic adoption and valuation. Working paper,
University of Chicago .
Deng X, Lee YT, Zhong Z. 2018. Decrypting coin winners: Disclosure quality, governance mech-
anism and team networks. Working paper, Shanghai University of Finance and Economics .
Dittmar RF, Wu DA. 2019. Initial coin offerings hyped and dehyped: An empirical examination.
Working paper, University of Michigan .
Firth JR. 1957. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis .
Fisher AJ, Martineau C, Sheng J. 2020. Macroeconomic attention and announcement risk premia.
Working paper, University of British Columbia .
27
Florysiak D, Schandlbauer A. 2019. The information content of ico white papers. Working paper,
Available at SSRN 3265007 .
Gentzkow M, Kelly B, Taddy M. 2019. Text as data. Journal of Economic Literature 57: 535–574.
Griffin JM, Shams A. 2020. Is bitcoin really un-tethered? Journal of Finance, forthcoming .
Griffiths TL, Steyvers M. 2004. Finding scientific topics. Proceedings of the National academy of
Sciences 101: 5228–5235.
Gu S, Kelly B, Xiu D. 2020. Empirical asset pricing via machine learning. Review of Financial
Studies, forthcoming .
Hansen S, McMahon M, Prat A. 2018. Transparency and deliberation within the fomc: a compu-
tational linguistics approach. Quarterly Journal of Economics 133: 801–870.
Hinzen FJ, John K, Saleh F. 2019. Proof-of-work’s limited adoption problem. Working Paper, New
York University .
Howell ST, Niessner M, Yermack D. 2020. Initial coin offerings: Financing growth with cryp-
tocurrency token sales. Review of Financial Studies, forthcoming .
Irresberger F, John K, Saleh F. 2020. The public blockchain ecosystem: An empirical analysis.
NYU Stern School of Business .
Kelly B, Papanikolaou D, Seru A, Taddy M. 2018. Measuring technological innovation over the
long run. Technical report, National Bureau of Economic Research.
Lee J, Li T, Shin D. 2019. The wisdom of crowds in fintech: Evidence from initial coin offerings.
Working paper, University of Florida .
Li K, Mai F, Shen R, Yan X. 2019. Measuring corporate culture using machine learning. Available
at SSRN 3256608 .
Liu Y, Matthies B. 2018. Long run risk: Is it there? Working paper, Yale University .
Liu Y, Tsyvinski A. 2018. Risks and returns of cryptocurrency. Working paper, Yale University
and NBER .
Liu Y, Tsyvinski A, Wu X. 2019. Common risk factors in cryptocurrency. Working paper, Yale
University and NBER .
Loughran T, McDonald B. 2011. When is a liability not a liability? textual analysis, dictionaries,
and 10-ks. Journal of Finance 66: 35–65.
Loughran T, Ritter JR. 1995. The new issues puzzle. Journal of Finance 50: 23–51.
Lyandres E, Palazzo B, Rabetti D. 2020. Ico success and post-ico performance. Working Paper .
28
Manela A, Moreira A. 2017. News implied volatility and disaster concerns. Journal of Financial
Economics. 123: 137–162.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. 2013. Distributed representations of words
and phrases and their compositionality. In Advances in neural information processing systems.
3111–3119.
Murtagh F, Legendre P. 2014. Ward’s hierarchical agglomerative clustering method: which algo-
rithms implement ward’s criterion? Journal of Classification 31: 274–295.
Pastor L, Veronesi P. 2003. Stock valuation and learning about profitability. Journal of Finance
58: 1749–1789.
Ritter JR. 1991. The long-run performance of initial public offerings. Journal of Finance 46: 3–27.
Röder M, Both A, Hinneburg A. 2015. Exploring the space of topic coherence measures. In
Proceedings of the eighth ACM international conference on Web search and data mining. 399–
408.
Russell SJ, Norvig P. 2010. Artificial Intelligence-A Modern Approach (3rd internat. edn.). Pearson
Education.
Satopaa V, Albrecht J, Irwin D, Raghavan B. 2011. Finding a" kneedle" in a haystack: Detecting
knee points in system behavior. In 2011 31st international conference on distributed computing
systems workshops. IEEE, 166–171.
Shams A. 2019. What drives the covariation of cryptocurrency returns? Working paper, Ohio State
University. .
Sheng J. 2019. Asset pricing in the information age: Employee expectations and stock returns.
Working paper, University of California Irvine .
Shumway T. 1997. The delisting bias in crsp data. Journal of Finance 52: 327–340.
Sievert C, Shirley K. 2014. Ldavis: A method for visualizing and interpreting topics. In Proceed-
ings of the workshop on interactive language learning, visualization, and interfaces. 63–70.
Taddy M. 2012. On estimation and selection for topic models. In Artificial Intelligence and
Statistics. 1184–1193.
Tetlock PC. 2014. Information transmission in finance. Annual Review Financial Economics 6:
365–384.
Yermack D. 2017. Corporate governance and blockchains. Review of Finance 21: 7–31.
29
Appendix: Variable Definition
Variable Definition
ICO Success Measures:
CMC Trading A dummy variable that equals to one if a cryptocurrency is shown as listed on
coinmarketcap.com (CMC).
Trading A self-reported dummy by ICO fundraisers about whether the cryptocurrency
is traded on an exchange.
Success A dummy variable indicating whether the ICO raises any capital.
Trading Variables:
First Open/ICO Price The ratio between the first day’s opening price and the ICO price.
Delist An indicator about whether a token is delisted from CMC.
Rate of Return The rate of return that investors earn if they buy cryptoccurrency at the
opening price on the first listing day and sell them after a certain holding
period.
Trading Volume The 24-hour trading volume in millions of USD after they have been listed on
CMC for a certain period of time.
Whitepaper Measures:
Tech_sup The normalized predicted technology score from partial least squares (PLS), a
supervised machine learning approach.
Tech_embed The normalized percentage of words in the “blockchain”, “information” or
“algorithm” topics of the word embedding and clustering approach.
Tech_lda The normalized proportional attention allocated to the “information”,
“blockchain” and “system” topics of the LDA topic modelling approach.
Tech_comp The simple average of the Tech_sup, Tech_embed and Tech_lda.
Fog Index A readability measure defined as
0.4[(F>A3B/B4=C4=24B) + 100((2>< ?;4GF>A 3B)/F>A3B)], where “complex
words” are words with three or more syllables.
Tone The difference between number of positive and negative words defined in
Loughran and McDonald (2011) divided by the total number of words in a
whitepaper.
Uncertainty The number of uncertainty words defined in Loughran and McDonald (2011)
divided by the total number of words in a whitepaper.
ICO characteristics:
Has GitHub A dummy variable that equals to one if the ICO project has a GitHub
homepage.
Has Twitter A dummy variable that equals to one if the ICO project has a Twitter account.
ICO Length The number of days from the start to the end of an ICO campaign.
Team Size The number of ICO team members.
Pre ICO A dummy variable indicating whether if a pre-ICO exists.
Bonus A dummy variable indicating whether the fundraiser offers bonus to investors.
Ethereum Based A dummy variable indicating whether the ICO project is built on Ethereum.
Accept BTC A dummy variable indicating whether the ICO accepts Bitcoin as a currency
of payment.
BTC Price (ICO) The price of Bitcoin in thousands of US dollars on the day an ICO initiates.
BTC Price (List) The price of Bitcoin in thousands of US dollars on the day an ICO is shown
as listed on CMC.
30
Figures & Tables
31
Figure 2: LDA Visualization
This figure plots the relationship between LDA-based topics. Panel (a) displays the taxonomy gener-
ated by hierarchical agglomerative clustering. Panel (b) shows the similarity between topics in a two-
dimensional space. The size of the circle represents the relative topic prevalence in the corpus.“Information”,
“blockchain” and “system” are used to construct the LDA-based tech index.
(a) Taxonomy
32
Table 1: Summary Statistics
This table presents summary statistics on variables related to ICO characteristics, outcomes and whitepaper
measures. Panel A shows descriptive statistics for 2,916 ICOs completed before December 31st, 2018.
Panel B summarizes a subsample of 765 ICOs listed on coinmarketcap.com. For each variable, we show the
number of non-missing observations, the mean, the standard deviation and the 10th, 50th and 90th percentile
values. Please refer to the “variable definition” in the Appendix for the definition of each variable.
Panel A: Full Sample
Obs. Mean SD p10 p50 p90
ICO Success Measures
CMC Trading 2916 0.26 0.44 0 0 1
Trading 2916 0.18 0.39 0 0 1
Success 2916 0.38 0.49 0 0 1
Whitepaper Measures
Tech_sup 1629 0 1.00 -0.96 -0.19 1.44
Tech_embed 1629 0 1.00 -1.02 -0.23 1.35
Tech_lda 1629 0 1.00 -0.62 -0.49 1.63
Tech_comp 1629 0 0.84 -0.81 -0.25 1.20
Fog Index 1629 16.7 12.6 13.2 15.7 18.5
Tone 1629 0.28 0.73 -0.58 0.29 1.10
Uncertainty 1629 0.75 0.39 0.35 0.67 1.25
ICO Characteristics
Has GitHub 2916 0.60 0.49 0 1 1
Has Twitter 2916 0.91 0.29 1 1 1
ICO Length 2683 50.7 45.8 14 32 100
Team Size 2916 11.0 7.05 3 10 20
Pre ICO 2916 0.51 0.50 0 1 1
Bonus 2916 0.20 0.40 0 0 1
Ethereum Based 2916 0.83 0.37 0 1 1
Accept BTC 2916 0.40 0.49 0 0 1
BTC Price (ICO) 2669 7.80 3.07 4.23 7.28 11.3
BTC Price (List) 710 7.59 3.70 2.73 7.03 13.5
ICO Price 1684 1.57 17.8 0.01 0.10 1
33
Panel B: Listed Sample
Obs. Mean SD p10 p50 p90
Trading Variables
First Open/ICO Price 413 2.20 4.66 0.16 0.97 3.79
Delist 765 0.10 0.30 0 0 1
Rate of Return
7 Days 741 0.19 0.83 -0.45 -0.04 1.03
30 Days 730 0.30 1.84 -0.71 -0.28 1.60
90 Days 686 0.65 3.11 -0.87 -0.43 3.21
180 Days 566 1.00 5.14 -0.95 -0.64 4.00
210 Days 530 0.84 4.27 -0.96 -0.68 3.57
240 Days 486 0.69 3.89 -0.96 -0.70 3.20
270 Days 438 1.46 8.40 -0.96 -0.72 3.69
300 Days 397 1.51 8.91 -0.97 -0.74 3.85
330 Days 356 1.34 8.21 -0.98 -0.72 3.31
360 Days 289 1.60 7.74 -0.96 -0.67 4.20
Trading Volume ($ MIL)
Listing Days 751 2.40 11.9 0.0023 0.12 3.90
7 Days 739 1.63 5.58 0.0015 0.083 3.60
30 Days 725 1.50 5.62 0.0011 0.066 2.53
90 Days 680 1.60 5.58 0.00045 0.11 3.22
180 Days 564 2.78 13.5 0.00039 0.069 3.99
210 Days 526 2.24 8.30 0.00020 0.065 3.90
240 Days 482 2.60 12.4 0.00023 0.048 3.31
270 Days 436 1.73 5.91 0.00016 0.061 3.09
300 Days 393 2.60 11.5 0.00025 0.058 3.50
330 Days 352 2.22 8.48 0.00021 0.067 3.19
360 Days 285 2.45 9.65 0.000048 0.084 3.39
Whitepaper Measures
Tech_sup 422 0.27 1.13 -0.92 0.0014 1.88
Tech_embed 422 0.41 1.17 -0.79 0.14 2.28
Tech_lda 422 0.33 1.23 -0.62 -0.26 2.51
Tech_comp 422 0.34 1.03 -0.73 0.033 1.87
Fog Index 422 17.2 18.9 13.3 15.5 18.3
Tone 422 0.20 0.72 -0.70 0.23 1.03
Uncertainty 422 0.79 0.40 0.35 0.71 1.30
ICO Characteristics
Has GitHub 765 0.70 0.46 0 1 1
Has Twitter 765 0.96 0.19 1 1 1
ICO Length 656 34.9 42.0 2 30 63
Team Size 765 12.1 8.02 3 11 22
Pre ICO 765 0.26 0.44 0 0 1
Bonus 765 0.075 0.26 0 0 0
Ethereum Based 765 0.80 0.40 0 1 1
Accept BTC 765 0.30 0.46 0 0 1
BTC Price (ICO) 642 7.45 3.94 2.54 7.10 13.8
BTC Price (List) 710 7.59 3.70 2.73 7.03 13.5
ICO Price 420 2.37 19.9 0.01 0.12 1.22
34
Table 2: Technology Indexes Determinant
This table presents the determinants of our tech index. The dependent variable is the composite tech index
(Tech comp). Column (1) links the tech index to whether an ICO uses Ethereum blockchain; column (2)
presents the relation between thetech index and GitHub commits (the number of code revisions); column
(3) considers other text-based measures of ICO whitepapers; column (4) presents estimates with ICO char-
acteristics; column (5) includes all variables. The reported t-statistics are based on robust standard errors.
***, **, and * indicate statistical significance at the 1%, 5%, and 10% levels respectively.
(1) (2) (3) (4) (5)
Ethereum Based -0.219*** -0.122**
(0.066) (0.061)
Ln(commits) 0.104*** 0.071***
(0.011) (0.011)
Has GitHub -0.115** -0.058
(0.050) (0.050)
Fog Index -0.003*** -0.002*
(0.001) (0.001)
Tone -0.265*** -0.226***
(0.031) (0.030)
Uncertainty -0.186*** -0.217***
(0.054) (0.050)
ICO Length -0.002*** -0.001***
(0.001) (0.000)
Team Size 0.015*** 0.012***
(0.003) (0.003)
Has Twitter 0.212** 0.156*
(0.085) (0.084)
BTC Price (ICO) -0.018** -0.010
(0.008) (0.008)
Pre ICO -0.040 -0.004
(0.044) (0.042)
Bonus -0.077* -0.059
(0.047) (0.045)
Accept BTC -0.095** -0.064
(0.043) (0.041)
Constant 0.184*** -0.156*** 0.260*** -0.094 0.127
(0.063) (0.031) (0.055) (0.112) (0.125)
'2 0.009 0.098 0.047 0.048 0.136
Observations 1629 1629 1629 1483 1483
35
Table 3: Technology Indexes
This table presents results related to the construction of tech indexes. Panel A shows the correlation between
the four tech indexes. Panel B compares various supervised machine learning methods with their out-of-
sample (OOS) ' 2 and corresponding hyperparameters.
36
Table 4: ICO Success
This table examines the relationship between tech indexes and ICO success. The dependent variable is CMC
Trading in Panel A and Success in Panel B. For each tech index, the first column presents the univariate
result, and the second column displays estimates with control variables and fixed effects. The reported t-
statistics are based on robust standard errors. ***, **, and * indicate statistical significance at the 1%, 5%,
and 10% levels respectively.
Panel A: CMC Trading
(1) (2) (3) (4) (5) (6) (7) (8)
Supervised Embedding LDA Composite
Tech_sup 0.070*** 0.039***
(0.012) (0.012)
Tech_embed 0.107*** 0.048***
(0.011) (0.012)
Tech_lda 0.086*** 0.047***
(0.012) (0.013)
Tech_comp 0.124*** 0.066***
(0.013) (0.015)
ICO Length -0.001*** -0.001*** -0.001*** -0.001***
(0.000) (0.000) (0.000) (0.000)
Team Size 0.008*** 0.009*** 0.009*** 0.008***
(0.001) (0.001) (0.001) (0.001)
Has GitHub 0.053** 0.044** 0.048** 0.045**
(0.021) (0.021) (0.021) (0.021)
Has Twitter 0.163*** 0.172*** 0.168*** 0.166***
(0.035) (0.036) (0.035) (0.035)
BTC Price (ICO) 0.010 0.010 0.010 0.011
(0.007) (0.006) (0.006) (0.006)
Pre ICO -0.035 -0.031 -0.031 -0.032
(0.023) (0.023) (0.023) (0.023)
Bonus 0.008 0.010 0.007 0.009
(0.022) (0.022) (0.022) (0.022)
Accept BTC -0.013 -0.011 -0.010 -0.010
(0.021) (0.021) (0.021) (0.021)
Ethereum Based -0.017 -0.009 -0.012 -0.009
(0.030) (0.030) (0.030) (0.030)
Fog Index -0.000 0.000 -0.000 0.000
(0.001) (0.001) (0.001) (0.001)
Tone -0.000 0.003 0.004 0.006
(0.014) (0.014) (0.015) (0.015)
Uncertainty 0.014 0.031 0.023 0.025
(0.028) (0.028) (0.028) (0.028)
Constant 0.259*** 0.610*** 0.259*** 0.521*** 0.259*** 0.503*** 0.259*** 0.511***
(0.011) (0.094) (0.011) (0.099) (0.011) (0.099) (0.011) (0.097)
Fixed Effects No Yes No Yes No Yes No Yes
'2 0.026 0.322 0.060 0.324 0.038 0.323 0.057 0.327
Observations 1629 1382 1629 1382 1629 1382 1629 1382
37
Panel B: Capital Raised > 0
(1) (2) (3) (4) (5) (6) (7) (8)
Supervised Embedding LDA Composite
Tech_sup 0.061*** 0.041***
(0.012) (0.013)
Tech_embed 0.077*** 0.043***
(0.012) (0.014)
Tech_lda 0.056*** 0.037***
(0.012) (0.014)
Tech_comp 0.091*** 0.060***
(0.014) (0.016)
ICO Length -0.001*** -0.001*** -0.001*** -0.001***
(0.000) (0.000) (0.000) (0.000)
Team Size 0.008*** 0.009*** 0.009*** 0.008***
(0.002) (0.002) (0.002) (0.002)
Has GitHub 0.089*** 0.082*** 0.086*** 0.082***
(0.025) (0.025) (0.025) (0.025)
Has Twitter 0.083 0.092* 0.089* 0.086
(0.053) (0.053) (0.054) (0.053)
BTC Price (ICO) -0.005 -0.005 -0.005 -0.004
(0.007) (0.007) (0.007) (0.007)
Pre ICO -0.012 -0.008 -0.009 -0.010
(0.029) (0.028) (0.028) (0.028)
Bonus 0.103*** 0.104*** 0.102*** 0.104***
(0.031) (0.031) (0.031) (0.031)
Accept BTC 0.068*** 0.070*** 0.070*** 0.071***
(0.024) (0.024) (0.024) (0.024)
Ethereum Based -0.014 -0.007 -0.011 -0.007
(0.033) (0.034) (0.034) (0.033)
Fog Index -0.001 -0.000 -0.001 -0.000
(0.001) (0.001) (0.001) (0.001)
Tone 0.012 0.013 0.014 0.017
(0.018) (0.018) (0.018) (0.018)
Uncertainty 0.066** 0.081** 0.073** 0.076**
(0.033) (0.033) (0.033) (0.033)
Constant 0.377*** 0.633*** 0.377*** 0.557*** 0.377*** 0.554*** 0.377*** 0.545***
(0.012) (0.121) (0.012) (0.124) (0.012) (0.126) (0.012) (0.123)
Fixed Effects No Yes No Yes No Yes No Yes
'2 0.016 0.256 0.025 0.256 0.013 0.254 0.025 0.258
Observations 1629 1382 1629 1382 1629 1382 1629 1382
38
Table 5: ICO Success—Industry Subsample
This table examines the relationship between tech indexes and ICO success for different technology-related
industries. The dependent variable is CMC Trading. Industry is a dummy that equals to 1 if the ICO belongs
to “platform”, “cryptocurrency”, and “trading” industries. For each tech index, the first column presents
the univariate result, and the second column displays estimates with control variables and fixed effects. The
reported t-statistics are based on robust standard errors. ***, **, and * indicate statistical significance at the
1%, 5%, and 10% levels respectively.
Supervised Embedding LDA Composite
Tech_sup 0.052*** 0.030*
(0.017) (0.016)
Tech_sup*Industry 0.036 0.017
(0.023) (0.021)
Tech_embed 0.085*** 0.028*
(0.017) (0.016)
Tech_embed*Industry 0.040* 0.035*
(0.022) (0.021)
Tech_lda 0.053*** 0.012
(0.016) (0.015)
Tech_lda*Industry 0.069*** 0.049**
(0.023) (0.022)
Tech_comp 0.089*** 0.034*
(0.020) (0.019)
Tech_comp*Industry 0.067** 0.049*
(0.027) (0.026)
Industry -0.008 0.010 -0.009 0.011 0.003 0.016 -0.004 0.014
(0.021) (0.020) (0.021) (0.020) (0.021) (0.020) (0.021) (0.020)
ICO Length -0.001*** -0.001*** -0.001*** -0.001***
(0.000) (0.000) (0.000) (0.000)
Team Size 0.008*** 0.009*** 0.009*** 0.008***
(0.001) (0.001) (0.001) (0.001)
Has GitHub 0.053*** 0.044** 0.048** 0.045**
(0.021) (0.021) (0.021) (0.021)
Has Twitter 0.160*** 0.169*** 0.169*** 0.163***
(0.036) (0.036) (0.036) (0.035)
BTC Price (ICO) 0.010 0.010 0.010 0.010
(0.007) (0.006) (0.007) (0.006)
Pre ICO -0.036 -0.030 -0.029 -0.032
(0.023) (0.023) (0.023) (0.023)
Bonus 0.009 0.009 0.007 0.009
(0.021) (0.022) (0.022) (0.022)
Accept BTC -0.017 -0.013 -0.015 -0.013
(0.020) (0.020) (0.020) (0.020)
Ethereum Based -0.012 -0.002 -0.007 -0.004
(0.030) (0.030) (0.030) (0.030)
Fog Index -0.000 -0.000 -0.000 -0.000
(0.001) (0.002) (0.001) (0.001)
Tone 0.001 0.004 0.002 0.006
(0.014) (0.014) (0.014) (0.014)
Uncertainty 0.025 0.043 0.033 0.037
(0.028) (0.028) (0.028) (0.028)
Constant 0.263*** 0.579*** 0.264*** 0.468*** 0.259*** 0.447*** 0.262*** 0.455***
(0.016) (0.090) (0.016) (0.097) (0.016) (0.098) (0.016) (0.096)
Fixed effects No Yes No Yes No Yes No Yes
R2 0.028 0.309 0.062 0.313 0.044 0.310 0.061 0.314
Observations 1629 1382 1629 1382 1629 1382 1629 1382
39
Table 6: Rate of Return
This table presents the effects of tech indexes on cryptocurrency returns. The dependent variable is the log
transformation of gross return over a given period. Panel A, B, C and D display the supervised, embedding-
based, LDA-based and composite tech index respectively. Column (1)-(6) display results for six horizons:
7 days, 30 days, 90 days, 180 days, 240 days and 300 days. We include control variables related to ICO
characteristics and whitepapers in all columns. Quarterly, categorical and geographical fixed effects are
considered under all circumstances. The reported t-statistics are based on robust standard errors. ***, **,
and * indicate statistical significance at the 1%, 5%, and 10% levels respectively.
Panel A: Supervised Index
(1) (2) (3) (4) (5) (6)
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_sup 0.013 0.036 -0.001 -0.031 0.019 -0.005
(0.031) (0.062) (0.090) (0.106) (0.120) (0.142)
ICO Length 0.000 0.000 -0.001 -0.003 -0.003 0.003
(0.001) (0.001) (0.001) (0.004) (0.005) (0.008)
Team Size -0.002 0.003 0.005 0.001 0.002 -0.001
(0.004) (0.007) (0.010) (0.013) (0.016) (0.018)
Has GitHub 0.033 0.017 0.178 0.282 0.351 0.601
(0.076) (0.135) (0.176) (0.239) (0.259) (0.363)
Has Twitter 0.066 0.028 0.025 -0.016 0.067 0.165
(0.148) (0.555) (0.711) (0.840) (0.755) (0.820)
BTC Price (ICO) -0.000 -0.030 -0.061** -0.098*** -0.084** -0.093**
(0.013) (0.019) (0.026) (0.029) (0.032) (0.044)
Pre ICO -0.012 -0.169 -0.023 0.114 -0.118 0.471
(0.076) (0.131) (0.176) (0.284) (0.356) (0.575)
Constant 0.114 -0.359 -1.298 -0.381 -0.400 -0.807
(0.654) (1.292) (0.896) (1.218) (1.203) (1.425)
40
Panel C: LDA Index
(1) (2) (3) (4) (5) (6)
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_lda 0.045 0.066 0.124 0.136 0.241** 0.238*
(0.034) (0.056) (0.076) (0.086) (0.098) (0.133)
ICO Length 0.000 0.000 -0.001 -0.003 -0.003 0.003
(0.001) (0.001) (0.001) (0.004) (0.005) (0.008)
Team Size -0.002 0.003 0.004 -0.003 -0.001 -0.003
(0.004) (0.007) (0.010) (0.013) (0.016) (0.018)
Has GitHub 0.027 0.013 0.154 0.247 0.269 0.473
(0.077) (0.136) (0.176) (0.241) (0.259) (0.376)
Has Twitter 0.063 0.019 0.022 -0.026 0.057 0.146
(0.153) (0.587) (0.740) (0.862) (0.818) (0.854)
BTC Price (ICO) -0.000 -0.029 -0.059** -0.096*** -0.082** -0.090**
(0.012) (0.019) (0.026) (0.029) (0.033) (0.044)
Pre ICO 0.004 -0.146 0.020 0.203 -0.007 0.534
(0.078) (0.134) (0.178) (0.291) (0.354) (0.577)
Constant -0.033 -0.555 -1.713* -0.830 -1.077 -1.429
(0.665) (1.322) (0.938) (1.225) (1.179) (1.384)
41
Table 7: Bitcoin-Adjusted Rate of Return
This table presents the effects of tech indexes on Bitcoin-adjusted returns. The dependent variable is the log
transformation of gross return, log(1+ROR), minus the log transformation of Bitcoin gross return over the
same period. Panel A, B, C and D display the supervised, embedding-based, LDA-based and composite tech
index respectively. Column (1)-(6) display results for six different horizons: 7 days, 30 days, 90 days, 180
days, 240 days and 300 days. We include control variables related to ICO characteristics and whitepapers
in all columns. Quarterly, categorical and geographical fixed effects are considered under all circumstances.
The reported t-statistics are based on robust standard errors. ***, **, and * indicate statistical significance
at the 1%, 5%, and 10% levels respectively.
Panel A: Supervised Index
(1) (2) (3) (4) (5) (6)
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_sup 0.012 0.037 0.011 0.061 0.091 0.058
(0.029) (0.057) (0.085) (0.095) (0.111) (0.133)
ICO Length 0.000 0.000 -0.001 -0.002 -0.002 0.004
(0.001) (0.001) (0.001) (0.003) (0.004) (0.007)
Team Size -0.001 0.002 0.008 0.004 0.006 -0.000
(0.004) (0.006) (0.009) (0.011) (0.014) (0.017)
Has GitHub 0.053 -0.034 0.111 0.221 0.249 0.424
(0.068) (0.126) (0.162) (0.199) (0.241) (0.331)
Has Twitter 0.191 -0.003 0.145 0.527 0.348 0.342
(0.178) (0.650) (0.912) (0.881) (0.788) (0.864)
BTC Price (ICO) 0.009 -0.009 -0.034 -0.079*** -0.061* -0.060
(0.012) (0.018) (0.023) (0.027) (0.031) (0.045)
Pre ICO -0.017 -0.166 -0.005 0.062 -0.075 0.333
(0.071) (0.113) (0.171) (0.240) (0.310) (0.492)
Constant -0.056 -0.500 -1.873* -2.879** -3.431*** -2.894**
(0.620) (1.206) (1.031) (1.227) (1.135) (1.366)
42
Panel C: LDA Index
(1) (2) (3) (4) (5) (6)
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_lda 0.040 0.066 0.119 0.136* 0.226** 0.290**
(0.033) (0.056) (0.073) (0.077) (0.094) (0.126)
ICO Length 0.000 0.000 -0.001 -0.003 -0.002 0.004
(0.001) (0.001) (0.001) (0.003) (0.004) (0.006)
Team Size -0.001 0.002 0.007 0.004 0.005 0.000
(0.004) (0.006) (0.010) (0.011) (0.014) (0.016)
Has GitHub 0.048 -0.039 0.090 0.199 0.175 0.274
(0.068) (0.127) (0.162) (0.199) (0.244) (0.342)
Has Twitter 0.189 -0.015 0.143 0.511 0.357 0.420
(0.179) (0.692) (0.958) (0.955) (0.895) (0.928)
BTC Price (ICO) 0.009 -0.008 -0.034 -0.078*** -0.059* -0.054
(0.012) (0.018) (0.024) (0.027) (0.031) (0.046)
Pre ICO -0.003 -0.143 0.041 0.138 0.020 0.360
(0.073) (0.116) (0.174) (0.245) (0.312) (0.492)
Constant -0.188 -0.693 -2.278** -3.331*** -4.117*** -3.765***
(0.630) (1.244) (1.082) (1.271) (1.166) (1.352)
43
Table 8: Trading Volume
This table presents the relationship between tech indexes and cryptocurrency liquidity. The dependent vari-
able is the log transformation of 24-hour trading volume in USD. Column (1) displays results on the listing
day. Column (2) to (7) display results for six time points: 7 days, 30 days, 90 days, 180 days, 240 days and
300 days. We include control variables in all columns. Quarterly, categorical and geographical fixed effects
are considered under all circumstances. The reported t-statistics are based on robust standard errors. ***,
**, and * indicate statistical significance at the 1%, 5%, and 10% levels respectively.
Panel A: Supervised Index
(1) (2) (3) (4) (5) (6) (7)
Listing 7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_sup 0.334** 0.333** 0.346* 0.501** 0.491* 0.464* 0.344
(0.165) (0.166) (0.198) (0.204) (0.253) (0.251) (0.286)
ICO Length -0.005 -0.006 -0.004 -0.006 -0.006 -0.001 -0.004
(0.005) (0.006) (0.005) (0.008) (0.010) (0.009) (0.018)
Team Size 0.014 0.017 0.027 0.022 0.074*** 0.047 0.083**
(0.021) (0.018) (0.020) (0.020) (0.027) (0.029) (0.036)
Has GitHub 0.263 0.510 0.444 0.805 0.889* 1.272** 1.580**
(0.361) (0.386) (0.462) (0.512) (0.536) (0.575) (0.738)
Has Twitter -0.487 -0.046 -0.676 -1.101 0.697 -0.431 -1.184
(1.162) (1.073) (1.648) (1.614) (2.080) (1.508) (1.557)
BTC Price (ICO) 0.073 0.096 -0.039 0.012 -0.049 -0.053 0.036
(0.069) (0.065) (0.072) (0.072) (0.087) (0.087) (0.108)
Pre ICO -0.399 -0.487 -0.453 -1.005* -1.335* -2.061** -0.553
(0.411) (0.474) (0.501) (0.575) (0.749) (0.836) (1.238)
Constant 11.971*** 14.015*** 11.312*** 9.960*** 5.065 6.954*** 9.246***
(1.650) (1.722) (2.339) (2.125) (3.157) (2.558) (2.992)
44
Panel C: LDA Index
(1) (2) (3) (4) (5) (6) (7)
Listing 7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_lda 0.497*** 0.429*** 0.578*** 0.725*** 0.503** 0.621*** 0.600*
(0.148) (0.159) (0.170) (0.171) (0.209) (0.224) (0.306)
ICO Length -0.004 -0.006 -0.003 -0.006 -0.006 -0.001 -0.003
(0.004) (0.005) (0.005) (0.007) (0.009) (0.008) (0.017)
Team Size 0.018 0.022 0.030 0.029 0.083*** 0.055** 0.092**
(0.019) (0.018) (0.020) (0.020) (0.026) (0.026) (0.037)
Has GitHub 0.223 0.487 0.415 0.738 0.854 1.123* 1.310*
(0.351) (0.380) (0.448) (0.498) (0.545) (0.570) (0.772)
Has Twitter -0.552 -0.117 -0.762 -1.198 0.583 -0.519 -1.256
(1.274) (1.236) (1.899) (1.902) (2.229) (1.674) (1.518)
BTC Price (ICO) 0.075 0.098 -0.032 0.014 -0.046 -0.045 0.048
(0.069) (0.065) (0.071) (0.073) (0.088) (0.089) (0.108)
Pre ICO -0.235 -0.339 -0.247 -0.752 -1.096 -1.826** -0.485
(0.412) (0.482) (0.499) (0.574) (0.752) (0.815) (1.218)
Constant 10.524*** 12.772*** 9.618*** 7.839*** 3.503 5.193* 7.778**
(1.805) (1.902) (2.573) (2.330) (3.302) (2.664) (3.003)
45
Table 9: Delisting Probability
This table presents OLS estimates of the relationship between tech indexes and ICO delisting probabilities.
The dependent variable is Delist, a dummy variable that equals to 1 if a token was shown as “inactive” on
coinmarketcap.com by the end of 2018. For each tech index, the first column presents univariate result, and
the second column displays estimates with control variables and fixed effects. The reported t-statistics are
based on robust standard errors. ***, **, and * indicate statistical significance at the 1%, 5%, and 10%
levels respectively.
(1) (2) (3) (4) (5) (6) (7) (8)
Supervised Embedding LDA Composite
Tech_sup -0.028*** -0.043***
(0.008) (0.015)
Tech_embed -0.030*** -0.025*
(0.010) (0.014)
Tech_lda -0.013 -0.016
(0.008) (0.012)
Tech_comp -0.030*** -0.037**
(0.009) (0.015)
ICO Length -0.000 -0.000 -0.000 -0.000
(0.000) (0.000) (0.000) (0.000)
Team Size 0.002 0.001 0.001 0.002
(0.002) (0.002) (0.002) (0.002)
Has GitHub -0.052 -0.050 -0.055 -0.050
(0.040) (0.039) (0.040) (0.040)
Has Twitter -0.203 -0.187 -0.189 -0.191
(0.205) (0.203) (0.207) (0.203)
BTC Price (ICO) -0.011** -0.012** -0.011** -0.011**
(0.005) (0.005) (0.005) (0.005)
Pre ICO 0.018 0.008 0.014 0.009
(0.042) (0.045) (0.043) (0.043)
Bonus 0.013 0.021 0.023 0.019
(0.064) (0.065) (0.065) (0.065)
Accept BTC -0.003 -0.002 -0.005 -0.004
(0.034) (0.034) (0.034) (0.034)
Ethereum Based -0.021 -0.027 -0.027 -0.030
(0.053) (0.052) (0.054) (0.053)
Fog Index -0.001** -0.001** -0.001* -0.001**
(0.001) (0.001) (0.001) (0.001)
Tone -0.025 -0.018 -0.017 -0.022
(0.021) (0.021) (0.021) (0.021)
Uncertainty 0.010 0.006 0.012 0.006
(0.051) (0.051) (0.052) (0.052)
Constant 0.079*** 0.693** 0.083*** 0.684** 0.075*** 0.692** 0.081*** 0.718**
(0.014) (0.336) (0.015) (0.332) (0.013) (0.340) (0.014) (0.334)
Fixed Effects No Yes No Yes No Yes No Yes
'2 0.015 0.166 0.018 0.152 0.004 0.148 0.015 0.157
Observations 422 329 422 329 422 329 422 329
46
Table 10: Comparison with Other Measures
This table compares our tech index with other technology measures: GitHub commit and simple word
counts. Ln(commits) is the logarithm of the number of code revisions on GitHub. Simple_word_count
measures the percentage of words in a whitepaper that belongs to a self-defined technology word list. The
complete word list can be found in Table OA.3. The dependent variable is CMC Trading. We include
control variables related to ICO characteristics and whitepapers in all columns. Quarterly, categorical and
geographical fixed effects are considered under all circumstances. The reported t-statistics are based on
robust standard errors. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% levels respec-
tively.
(1) (2) (3) (4) (5) (6)
Tech_comp 0.066*** 0.059*** 0.066*** 0.059***
(0.015) (0.015) (0.017) (0.017)
Ln(commits) 0.017*** 0.013*** 0.013***
(0.005) (0.005) (0.005)
Simple_word_count 0.024** 0.001 0.001
(0.011) (0.012) (0.012)
Control Variables Yes Yes Yes Yes Yes Yes
Fixed Effects Yes Yes Yes Yes Yes Yes
'2 0.327 0.322 0.318 0.331 0.327 0.331
Observations 1382 1382 1382 1382 1382 1382
47
Table 11: Long Horizon Performance—Subsample on Readability
This table examines the long horizon performance of cryptocurrencies for different readability subsamples.
The dependent variable is rate of return in panel A and Bitcoin-adjusted return in panel B. Easy is a dummy
that equals to 1 if the whitepaper has a below median Fog index. We include control variables related
to ICO characteristics and whitepapers in all columns. Quarterly, categorical and geographical fixed effects
are considered under all circumstances. The reported t-statistics are based on robust standard errors. ***,
**, and * indicate statistical significance at the 1%, 5%, and 10% levels respectively.
Panel A: Rate of Return
(1) (2) (3) (4) (5) (6)
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_sup 0.015 0.065 0.068 0.202 0.323 0.489*
(0.049) (0.093) (0.127) (0.181) (0.201) (0.271)
Tech_sup*Easy -0.008 -0.069 -0.143 -0.335* -0.445* -0.645**
(0.060) (0.115) (0.175) (0.195) (0.236) (0.314)
Easy 0.038 0.017 0.020 0.026 -0.127 -0.069
(0.069) (0.122) (0.168) (0.222) (0.255) (0.328)
Controls Yes Yes Yes Yes Yes Yes
Fixed effects Yes Yes Yes Yes Yes Yes
R2 0.076 0.147 0.225 0.357 0.412 0.386
Observations 316 310 286 218 184 140
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_embed 0.024 0.107 0.174 0.402** 0.673*** 0.913***
(0.052) (0.092) (0.131) (0.186) (0.223) (0.202)
Tech_embed*Easy 0.008 -0.082 -0.133 -0.300 -0.479** -0.698***
(0.061) (0.105) (0.155) (0.206) (0.231) (0.221)
Easy 0.029 0.015 0.000 -0.018 -0.132 -0.110
(0.071) (0.120) (0.168) (0.226) (0.249) (0.328)
Controls Yes Yes Yes Yes Yes Yes
FEs Yes Yes Yes Yes Yes Yes
R2 0.078 0.151 0.230 0.371 0.456 0.448
Observations 316 310 286 218 184 140
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_lda 0.017 0.060 0.193 0.437*** 0.713*** 0.946***
(0.052) (0.086) (0.122) (0.167) (0.196) (0.231)
Tech_lda*Easy 0.040 -0.009 -0.136 -0.415** -0.632*** -0.865***
(0.062) (0.109) (0.150) (0.186) (0.220) (0.248)
Easy 0.022 -0.008 -0.021 -0.030 -0.195 -0.150
(0.068) (0.119) (0.166) (0.225) (0.253) (0.339)
Controls Yes Yes Yes Yes Yes Yes
FEs Yes Yes Yes Yes Yes Yes
R2 0.082 0.148 0.231 0.370 0.450 0.437
Observations 316 310 286 218 184 140
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_comp 0.027 0.111 0.206 0.517** 0.886*** 1.217***
(0.062) (0.108) (0.156) (0.227) (0.266) (0.273)
Tech_comp*Easy 0.015 -0.078 -0.192 -0.511** -0.814*** -1.160***
(0.072) (0.126) (0.191) (0.238) (0.281) (0.286)
Easy 0.027 0.009 0.007 0.024 -0.091 -0.034
(0.070) (0.121) (0.168) (0.225) (0.250) (0.325)
Controls Yes Yes Yes Yes Yes Yes
FEs Yes Yes Yes Yes Yes Yes
R2 0.079 0.149 0.229 0.370 0.452 0.445
Observations 316 310 286 218 184 140
48
Panel B: Adjusted Rate of Returns
(1) (2) (3) (4) (5) (6)
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_sup 0.005 0.026 0.025 0.256* 0.383** 0.508**
(0.048) (0.086) (0.121) (0.154) (0.179) (0.249)
Tech_sup*Easy 0.008 0.001 -0.046 -0.279 -0.427* -0.589**
(0.058) (0.107) (0.163) (0.173) (0.217) (0.286)
Easy 0.005 -0.024 0.000 0.050 -0.056 0.028
(0.066) (0.115) (0.155) (0.198) (0.236) (0.302)
Controls Yes Yes Yes Yes Yes Yes
FEs Yes Yes Yes Yes Yes Yes
R2 0.096 0.139 0.155 0.319 0.302 0.284
Observations 311 305 281 213 180 137
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_embed 0.027 0.101 0.159 0.342** 0.605*** 0.870***
(0.050) (0.088) (0.121) (0.158) (0.190) (0.188)
Tech_embed*Easy 0.000 -0.038 -0.101 -0.173 -0.399* -0.600***
(0.059) (0.102) (0.145) (0.174) (0.202) (0.200)
Easy 0.002 -0.029 -0.006 -0.016 -0.072 -0.035
(0.068) (0.113) (0.154) (0.200) (0.234) (0.304)
Controls Yes Yes Yes Yes Yes Yes
FEs Yes Yes Yes Yes Yes Yes
R2 0.098 0.146 0.164 0.338 0.349 0.366
Observations 311 305 281 213 180 137
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_lda 0.007 0.032 0.153 0.337** 0.576*** 0.913***
(0.052) (0.085) (0.115) (0.142) (0.178) (0.210)
Tech_lda*Easy 0.049 0.037 -0.075 -0.276* -0.465** -0.754***
(0.062) (0.107) (0.140) (0.161) (0.200) (0.220)
Easy -0.008 -0.041 -0.027 0.001 -0.126 -0.071
(0.065) (0.112) (0.152) (0.202) (0.235) (0.311)
Controls Yes Yes Yes Yes Yes Yes
FEs Yes Yes Yes Yes Yes Yes
R2 0.102 0.142 0.164 0.328 0.329 0.348
Observations 311 305 281 213 180 137
adjror7 adjror30 adjror90 adjror180 adjror240 adjror300
Tech_comp 0.020 0.079 0.163 0.463** 0.819*** 1.193***
(0.061) (0.105) (0.149) (0.190) (0.230) (0.253)
Tech_comp*Easy 0.023 -0.009 -0.110 -0.364* -0.692*** -1.033***
(0.070) (0.122) (0.179) (0.202) (0.249) (0.257)
Easy -0.003 -0.034 -0.009 0.029 -0.032 0.034
(0.067) (0.114) (0.155) (0.201) (0.233) (0.299)
Controls Yes Yes Yes Yes Yes Yes
FEs Yes Yes Yes Yes Yes Yes
R2 0.098 0.143 0.161 0.335 0.347 0.362
Observations 311 305 281 213 180 137
49
Table 12: Is There Return Reversal?
This table presents the effects of tech indexes on long term returns of cryptocurrencies. The dependent
variable is ;>6(1 + '$'180 > 9 ), the gross return from 180 listing days onward. Panel A displays results on
rate of returns and panel B shows Bitcoin-adjusted rate of returns. Column (1)-(6) display results for six
horizons from the listing day: 210 days, 240 days, 270 days, 300 days, 330 days and 360 days. We include
control variables related to ICO characteristics and whitepapers in all columns. Quarterly, categorical and
geographical fixed effects are considered under all circumstances. The reported t-statistics are based on
robust standard errors. ***, **, and * indicate statistical significance at the 1%, 5%, and 10% levels respec-
tively.
Panel A: Rate of Returns
(1) (2) (3) (4) (5) (6)
210 Days 240 Days 270 Days 300 Days 330 Days 360 Days
Tech_sup 0.052 0.047 0.081* 0.034 0.061 0.035
(0.036) (0.041) (0.043) (0.045) (0.041) (0.082)
'2 0.297 0.306 0.404 0.553 0.481 0.495
Tech_embed 0.048 0.086** 0.080* 0.089* 0.070 0.008
(0.035) (0.038) (0.041) (0.048) (0.059) (0.077)
'2 0.296 0.319 0.403 0.563 0.482 0.493
Tech_lda 0.051 0.065* 0.012 0.032 -0.049 -0.115
(0.034) (0.039) (0.041) (0.050) (0.063) (0.071)
'2 0.298 0.312 0.390 0.553 0.479 0.512
Tech_comp 0.071* 0.093** 0.082* 0.073 0.040 -0.037
(0.039) (0.046) (0.046) (0.056) (0.063) (0.085)
'2 0.301 0.316 0.400 0.557 0.477 0.495
Controls &FEs Yes Yes Yes Yes Yes Yes
Observations 207 184 156 140 132 103
Panel B: Bitcoin-Adjusted Rate of Returns
210 Days 240 Days 270 Days 300 Days 330 Days 360 Days
Tech_sup 0.037 0.021 0.039 -0.001 0.031 0.024
(0.030) (0.038) (0.041) (0.039) (0.041) (0.062)
'2 0.287 0.343 0.419 0.529 0.448 0.564
Tech_embed 0.045 0.037 0.076* 0.069 0.042 0.028
(0.030) (0.035) (0.042) (0.046) (0.059) (0.071)
'2 0.291 0.346 0.429 0.538 0.449 0.565
Tech_lda 0.047 0.017 0.014 0.025 -0.064 -0.082
(0.030) (0.033) (0.043) (0.045) (0.059) (0.064)
'2 0.292 0.342 0.416 0.530 0.453 0.574
Tech_comp 0.060* 0.035 0.061 0.044 0.005 -0.016
(0.034) (0.041) (0.048) (0.051) (0.062) (0.076)
'2 0.293 0.344 0.421 0.531 0.447 0.564
Controls &FEs Yes Yes Yes Yes Yes Yes
Observations 202 180 153 137 129 101
50
Table 13: ICO First-Day Price
This table presents OLS estimates of the relationship between tech indexes and ICO first-day price. The
dependent variable is the log transformation of the ratio between the first day’s opening price and ICO price.
For each tech index, the first column presents univariate result, and the second column displays estimates
with control variables and fixed effects. The reported t-statistics are based on robust standard errors. ***,
**, and * indicate statistical significance at 1%, 5%, and 10% respectively.
Ln(First Opening Price/ICO Price)
(1) (2) (3) (4) (5) (6) (7) (8)
Supervised Embedding LDA Composite
Tech_sup 0.337*** 0.299***
(0.090) (0.109)
Tech_embed 0.414*** 0.368***
(0.076) (0.098)
Tech_lda 0.310*** 0.287***
(0.074) (0.091)
Tech_comp 0.458*** 0.417***
(0.092) (0.117)
ICO Length -0.002 -0.001 -0.001 -0.001
(0.002) (0.002) (0.002) (0.002)
Team Size -0.001 0.007 0.008 0.003
(0.014) (0.012) (0.011) (0.012)
Has GitHub -0.059 -0.128 -0.046 -0.100
(0.263) (0.253) (0.254) (0.254)
Has Twitter -0.074 -0.162 -0.284 -0.151
(0.368) (0.392) (0.431) (0.376)
BTC Price (ICO) -0.012 0.010 -0.008 -0.002
(0.048) (0.048) (0.049) (0.047)
Pre ICO -0.363 -0.205 -0.278 -0.266
(0.341) (0.348) (0.352) (0.346)
Bonus 0.152 0.128 0.137 0.158
(0.402) (0.407) (0.414) (0.404)
Accept BTC -0.014 -0.049 0.014 -0.015
(0.236) (0.229) (0.236) (0.232)
Ethereum Based -0.299 -0.206 -0.136 -0.183
(0.385) (0.367) (0.384) (0.371)
Fog Index -0.007 -0.005 -0.007 -0.005
(0.007) (0.007) (0.007) (0.007)
Tone 0.103 0.105 0.102 0.133
(0.146) (0.147) (0.148) (0.145)
Uncertainty -0.036 0.121 0.021 0.070
(0.318) (0.328) (0.327) (0.325)
Constant -0.295*** -2.998*** -0.404*** -3.392*** -0.324*** -3.578*** -0.380*** -3.552***
(0.105) (0.923) (0.111) (0.916) (0.105) (0.974) (0.110) (0.924)
Fixed Effects No Yes No Yes No Yes No Yes
'2 0.064 0.305 0.111 0.328 0.066 0.306 0.102 0.325
Observations 238 199 238 199 238 199 238 199
51
Table 14: Robustness Tests
This table displays several robustness tests. Panel A redoes Table 4 using Trading as the dependent variable.
Panel B is the Logit regression version of Table 4. Besides, to mitigate the concern of survivorship bias, we
impute -99% to returns of delisted cryptocurrencies and redo table 6 and table 7. Results are presented in
Panel C and panel D. The reported t-statistics are based on robust standard errors. ***, **, and * indicate
statistical significance at the 1%, 5%, and 10% levels respectively.
Panel A: Trading
(1) (2) (3) (4) (5) (6) (7) (8)
Supervised Embedding LDA Composite
Tech_sup 0.058*** 0.034***
(0.011) (0.010)
Tech_embed 0.101*** 0.047***
(0.010) (0.010)
Tech_lda 0.082*** 0.051***
(0.011) (0.011)
Tech_comp 0.114*** 0.065***
(0.013) (0.013)
ICO Length -0.001*** -0.001*** -0.001*** -0.001***
(0.000) (0.000) (0.000) (0.000)
Team Size 0.004*** 0.005*** 0.005*** 0.004***
(0.001) (0.001) (0.001) (0.001)
Has GitHub 0.004 -0.005 -0.002 -0.005
(0.016) (0.016) (0.016) (0.016)
Has Twitter 0.102*** 0.110*** 0.106*** 0.104***
(0.028) (0.028) (0.029) (0.028)
BTC Price (ICO) 0.007 0.007 0.007 0.007
(0.006) (0.006) (0.006) (0.006)
Pre ICO -0.029* -0.025 -0.025 -0.027
(0.017) (0.017) (0.017) (0.017)
Bonus 0.021 0.022* 0.020 0.022*
(0.013) (0.013) (0.013) (0.012)
Accept BTC 0.013 0.016 0.017 0.017
(0.015) (0.015) (0.015) (0.015)
Ethereum Based 0.007 0.015 0.013 0.015
(0.023) (0.023) (0.022) (0.022)
Fog Index -0.000 -0.000 -0.000 -0.000
(0.001) (0.001) (0.001) (0.001)
Tone -0.005 -0.001 0.001 0.002
(0.011) (0.011) (0.011) (0.011)
Uncertainty 0.011 0.027 0.020 0.022
(0.021) (0.021) (0.021) (0.021)
Constant 0.167*** 0.767*** 0.167*** 0.678*** 0.167*** 0.649*** 0.167*** 0.668***
(0.009) (0.065) (0.009) (0.068) (0.009) (0.069) (0.009) (0.067)
Fixed Effects No Yes No Yes No Yes No Yes
'2 0.024 0.383 0.074 0.389 0.049 0.390 0.066 0.393
Observations 1629 1382 1629 1382 1629 1382 1629 1382
52
Panel B: Logit regression
(1) (2) (3) (4) (5) (6) (7) (8)
Supervised Embedding LDA Composite
Tech_sup 0.343*** 0.287***
(0.054) (0.094)
Tech_embed 0.527*** 0.345***
(0.055) (0.085)
Tech_lda 0.398*** 0.293***
(0.052) (0.089)
Tech_comp 0.596*** 0.443***
(0.065) (0.108)
ICO Length -0.012*** -0.012*** -0.012*** -0.012***
(0.003) (0.003) (0.003) (0.003)
Team Size 0.064*** 0.069*** 0.069*** 0.066***
(0.012) (0.012) (0.012) (0.012)
Has GitHub 0.426** 0.374** 0.402** 0.381**
(0.175) (0.175) (0.175) (0.175)
Has Twitter 1.463*** 1.491*** 1.461*** 1.459***
(0.417) (0.432) (0.418) (0.425)
BTC Price (ICO) 0.051 0.055 0.051 0.055
(0.034) (0.033) (0.034) (0.034)
Pre ICO -0.263 -0.211 -0.236 -0.229
(0.186) (0.186) (0.187) (0.187)
Bonus 0.122 0.129 0.108 0.126
(0.245) (0.245) (0.246) (0.247)
Accept BTC -0.186 -0.179 -0.167 -0.164
(0.174) (0.174) (0.174) (0.175)
Ethereum Based -0.148 -0.094 -0.097 -0.082
(0.255) (0.257) (0.259) (0.261)
Fog Index -0.000 0.000 -0.000 0.001
(0.008) (0.008) (0.008) (0.008)
Tone -0.007 0.023 0.014 0.036
(0.118) (0.118) (0.120) (0.120)
Uncertainty 0.059 0.220 0.137 0.152
(0.239) (0.237) (0.233) (0.239)
Constant -1.077*** -6.461*** -1.108*** -6.693*** -1.083*** -6.538*** -1.101*** -6.593***
(0.058) (0.986) (0.059) (1.010) (0.058) (0.987) (0.059) (0.999)
Fixed Effects No Yes No Yes No Yes No Yes
Pseudo ' 2 0.0214 0.322 0.050 0.325 0.031 0.322 0.046 0.326
Observations 1629 1351 1629 1351 1629 1351 1629 1351
53
Panel C: Rate of return (-99% return for delisted coins)
(1) (2) (3) (4) (5) (6)
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_sup 0.044 0.081 0.037 0.006 0.094 0.071
(0.049) (0.070) (0.093) (0.104) (0.117) (0.132)
Constant -1.905 -2.869* -2.378** -0.430 0.029 0.028
(1.246) (1.490) (1.027) (1.162) (1.258) (1.440)
Controls & FEs Yes Yes Yes Yes Yes Yes
' 2 0.152 0.204 0.256 0.373 0.438 0.428
Observations 323 319 293 228 198 157
54
Panel D: Bitcoin-adjusted rate of return (-99% return for delisted coins)
(1) (2) (3) (4) (5) (6)
7 Days 30 Days 90 Days 180 Days 240 Days 300 Days
Tech_sup 0.011 0.050 0.019 0.077 0.152 0.096
(0.029) (0.058) (0.085) (0.095) (0.113) (0.130)
Constant -0.315 -1.844 -1.716* -2.876** -3.648*** -2.607*
(0.658) (1.578) (1.014) (1.265) (1.348) (1.466)
Controls & FEs Yes Yes Yes Yes Yes Yes
' 2 0.091 0.156 0.163 0.311 0.302 0.287
Observations 312 308 282 217 186 144
55
Online Appendix: Technical Notes on Measure Construction
1.1 Basics
Supervised learning is “the machine learning task of learning a function that maps an input to
an output based on example input-output pairs.” (Russell and Norvig, 2010). Mathematically, this
can be expressed as estimating a function f(·) given input variables (X) and output variables (Y),
such that the mapping function Y=f(X) is satisfied as much as possible. Various machine learning
models impose different constraints on the function, resulting in different optimization results.
We use the simplest regression model, ordinary least squares (OLS), as our benchmark. The
objective function of OLS is:
OLS works well when there are only a few predictors, but its performance deteriorates significantly
as the dimension of predictors increases. Unfortunately, high dimensionality and sparseness are
both common features of text data. Hence, we apply more advanced machine learning methods to
avoid the "curse of dimensionality".
The first set of methods we use are panelized linear approaches. The idea is to add a penalty
term in the objective function to reduce a model’s fit on noise and hence enhances prediction ac-
curacy. We consider LASSO, ridge regression and elastic net for this approach. Another common
approach to deal with high-dimensional data is dimension reduction. While panelized linear meth-
ods select a subset of predictors that have strong predictive power, dimension reduction methods
combine predictors into several main components while retaining as much information as possible.
We apply principal component regression (PCR) and partial least squares (PLS) in this vein. All
the methods above are linear regression models, but we are also interested in using non-linear ap-
56
proaches to get better prediction accuracy. We consider decision trees (random forest and gradient
boosting) and neural network algorithms. Next, I briefly introduce each of these machine learning
methods that we consider as candidates to construct our supervised tech index.
1.1.1 LASSO
LASSO (least absolute shrinkage and selection operator) is a common approach employed to
deal with high-dimensional sparse data. The objective function for LASSO is:
⇢
1
min ||H - V|| 2 + _||V|| .
V #
The first term is the same as OLS, while the second term is a penalty on non-zero coefficients,
with l representing the regularization strength. The effect of LASSO is to select only a subset of
predictors by pushing other predictor coefficients to 0.
1.1.2 Ridge
Ridge regression (also known as Tikhonov regularization) is another useful method to mitigate
the problem of dimensionality by adding a L2-norm regularization term as penalty. The objective
function is:
⇢
1
min ||H - V|| 2 + _||V|| 2 .
V #
The difference of Ridge regression from LASSO is that it shrinks the coefficients of unimportant
predictors but do not set them to 0. Hence, ridge regression is a regularization approach, but not a
variable selection approach.
Elastic net is a combination of LASSO and ridge regression. It optimizes the following objec-
tive function:
57
⇢
1
min ||H - V|| 2 + _U||V||| + _ (1 U) ||V|| 2 .
V #
U controls the weight between L1 and L2 norm penalty. If U = 1, it is the same as LASSO; if
U = 0, it becomes ridge regression. By averaging between LASSO and ridge regression, elastic net
is expected to combine the advantages of both methods.
Principal component regression combines standard linear regression with principal component
analysis (PCA). Specifically, PCR regresses the dependent variable (Y) on principal components
of independent variables (X), as opposed to regressing Y directly on X in OLS. Since the principal
components are extracted based on their ability to explain the variation in X, the forecasting goal
(Y) does not come into play until the final regression step.
Partial least squares (PLS) regression shares some similarities with PCR, but it constructs the
principal components of X with the goal to best explain the covariance between X and Y. It first
projects both the independent variables (X) and dependent variables (Y) to a new space, in which
the projection of the X-space that explains the most variation of the Y-space. It then runs a lin-
ear regression model in the new space. PLS is especially helpful when predictors are more than
available observations or when predictors are highly collinear.
Random forests come from decision trees. A decision tree is a set of logic conditions on input
variables (X) that lead to predictions on the target output (Y). The following figure illustrates a
regression tree example. The first condition used to determine y is whether G 1 is greater than a.
Conditional on the answer to this question, another logic condition will be raised. This process
iterates until the value of y is determined. Different from linear regressions, the regression tree is a
58
non-linear and non-parametric method. A random forest is an ensemble of multiple decision trees.
It outputs the average prediction of each individual tree. Although a single tree may be a weak
prediction model, through combination the random forest can have a strong performance.
Gradient boosting is another approach to ensemble regression trees. At each step, a new tree
is fitted on the negative gradient of a given loss function. Hence, new trees aim at correcting the
error of preceding trees. To avoid overfitting on residuals, following trees will be discounted at
each step. This process is repeated until a total number of N trees is reached.
Artificial neural network is a broad set of machine learning algorithms inspired by the biologi-
cal neural structure of human brains. It is a layer-by-layer structure, where each layer is composed
of “neurons”, and the layers are connected by "edges". The following figure shows an example,
the feedforward neural network. The input layer is the input variables (X), and the output layer is
the outcome (Y). Each node of the hidden layer represents the following operation:
8 = 5 (F 0 + -,) ,
59
where W, the linear weight matrix on the inputs, represents the “edges” connecting the input layer
and the hidden layer. There are multiple choices for 5 (·), one of which is the sigmoid function:
1
5 (G) =
1+4 G
The output of the hidden layer ( 8) can then be used as the input of the output layer or another
concatenated hidden layer. This process continues until the output layer is arrived.
We tune one hyperparameter for each of the supervised machine learning methods. For LASSO
and ridge regression, we change the regularization strength (_); for elastic net, we alter the linear
weight (U); for PCR and PLS, we vary the number of principal components; for random forest and
gradient boosting, we adjust the number of trees; for neural network, we tune the number of nodes
of the hidden layer. Figure OA.1 presents the hyperparameter search results. Table 3 shows the
best out-of-sample R-square ('$$( 2 ) for each supervised method and their corresponding param-
eters. It may be surprising that the most popular and advanced neural network approach works the
worst among all methods and even underperforms the most basic OLS. This is due to the mismatch
between the high-dimensional predictors and the relatively small sample size. NN is a highly pa-
rameterized model, and we do not have enough observation to get all parameters well-tuned. This
mismatch can also explain why dimension reduction methods (especially PLS) works particularly
60
well on our dataset. By limiting the predicting variables to only a few principal components, the
number of parameters is manageable for our training set.
2.1 Model
In practice, word embedding vectors are estimated with a two-layer neural network, the Skip-
gram model. Given a sequence of words F 1 ,F 2 ,. . . F) , the inference problem is to maximize the
average log probability of the context of F C :
1’ ’
)
log ? F C+ 9 |F C
) C=1 2 9 2, 9<0
⇣ 0 ⌘
4G ? E F)$ E F
? (F $ |F ) = Õ, 0
F=1 4G ? E F)$ E F
where E ( F ) and E ( F $ ) represent the input and output representation of word F C , and W denotes
vocabulary size. The embedding of F C is the projection vector between the input and output layer.
Given a fixed number of clusters (k), the objective function of k-means is to find a partition of
the dataset, such that the within-cluster sum of squared distances between each observation and its
closest centroid are minimized. Equivalently, this can be expressed as:
’’
arg min ||G `8 || 2
(
8=1 G2(8
61
where `8 is the average of data points in (8 .
A k-means algorithm works as follows:
1. Specify the number of clusters k. Randomly select k data points as cluster centroids (`).
1 ’
`9 = G
||( 9 || G2(
9
4. Repeat 2) and 3), until the assignments of data points no longer change.
2.2 Preprocessing
Before estimation, we preprocess the raw text step by step to get a cleaner input. We first
split all documents into words and convert them to lowercases. We then apply lemmatization to
convert all words to its root form. Because word embedding uses contextual information, we do not
remove individual words before estimating the vector representation, so as not to affect the sentence
structure. After obtaining the word embedding vector, we drop stop-words and low-frequency
words that appear less than 20 times in the vocabulary. Finally, we transform preprocessed text
into numerical counts that we use as the input of word embedding estimation. The corpus is
represented as a ⇡ ⇥ + document-term matrix M, where " ( 3, E) indicates the count of the E-th
word in the 3-th document. This is the “bag-of-words” representation. The underlying assumption
is that the order of words does not matter. Although this is an oversimplification of reality, it retains
a large amount of information while keeping the algorithm simple. The final corpus consists of
2,262 documents and 20,145 unique terms.
62
2.3 Choice of topics
One important step of k-means is to find the optimal number of topics. We take a data-driven
approach to select the best model. To be specific, we apply the “Elbow method” to the distortion
score (the sum of squared distances between each point and its assigned centroid), which is a
heuristic method to find the appropriate number of clusters on a dataset. “Elbow” refers to the
point where adding another cluster does not give much improvement to the model.11 To determine
the “Elbow”, Satopaa et al. (2011) propose an algorithm detecting the point of maximum curvature
as the elbow, where the curvature can be calculated as:
00
5 (G)
5 (G) = ⇣ ⌘ 32
2
1 + 5 (G)
0
Figure OA.2 presents the results on the optimal number of topics. We find that the optimal
number of topics detected by the algorithm is 20.
Latent Dirichlet Allocation (LDA), developed by Blei et al. (2003), is a generative probabilistic
modeling approach. The basic idea is that each document can be represented as a probability
distribution over various topics, where each topic is a probability distribution over the vocabulary
of a corpus. Suppose there are latent topics, ⇡ documents and + unique terms in the corpus.
LDA assumes the following data generating process for each document d:
1. Draw V : from a multinomial distribution, where V : (a 1 ⇥+ vector) denotes the word distri-
bution of topic : for each : = 1, 2, . . . , .
11 Hansen et al. (2018) use the method to select the number of topics for the FOMC transcripts.
63
2. Draw \ 3 from a Dirichlet distribution, where \ 3 (a 1 ⇥ vector) denotes the topical distribu-
tion of document 3.
Intuitively, one can think of generating a document with # words as repeating the action of "gen-
erating a word" by # times, where each word is generated in two steps: first, roll a -sided dice
to select a topic; conditional on the topic being selected, roll another +-sided dice to choose a
word. Note that the probability of obtaining each side is not equal. It corresponds to \ 3 and V :
respectively.
Given a corpus and a latent topic number , the inference problem of LDA is to compute the
posterior distribution of hidden variables ⇥ = (\ 1 , \ 2 , . . . , \ ⇡ ) and B = (V1 , V2 , . . . , V ), such that
the generated distribution resembles the observed distribution of words of each document. Since
the distribution is usually mathematically intractable, it is solved with Gibbs sampling algorithm
(Griffiths and Steyvers, 2004) in practice.
3.2 Preprocessing
Similar to word embedding, we preprocess the raw text to get a cleaner input of the LDA
model. First, we split all documents into words and convert to lowercases. We then remove
common stop-words like “the”, “a” and “I”, as they appear frequently in text but convey little
information. Second, we convert all words to its root form, so that words like “communicates”,
“communicating” all become “communicate”. Third, we identify common two-word collocations
which appears more than 20 times in the corpus. For example, “machine learning” conveys a
specific meaning different from “machine” and “learning”. Fourth, we drop infrequent unigrams
and bigrams that appear in less than 10 documents. Finally, we convert the preprocessed text to a
64
document-term matrix, as what we do for word embedding analysis. The final corpus consists of
2,262 documents and 26,410 unique terms.
An important yet challenging task of LDA is to find the optimal number of topics ( ). As
discussed in Hansen et al. (2018), there is a trade-off between model interpretability and statistical
goodness-of-fit. If is too small, the model does not fit the data well, and the topics generated
are often too general and mix multiple themes. However, if is too large, the topics are too
fine-grained, which impairs the interpretability of the model. To balance the two effects, we adopt
a statistical measure—topic coherence—to select (Röder et al., 2015). A topic is said to be
coherent if its top words frequently co-occur with each other. In particular, we use normalized
pointwise mutual information (NPMI) that has been proved to have the largest correlation to human
topic coherence ratings to calculate co-occurrence:
⇣ ⌘
% ( F 8 ,F 9 ) +n
log %(F 8 )% ( F 9 )
# %" F8 , F 9 = ,
log % F 8 , F 9 + n
where %(F 8 ), %(F 9 ) and %(F 8 , F 9 ) denote the probability that F 8 appears, F 8 appears and F 8 and
F 9 jointly appear in the corpus. n is added to avoid taking logarithm on zero.
We consider candidates of topic numbers ( ) ranging from 10 to 80 in increments of ten.
Figure OA.3 shows the topic coherence of each LDA model with different specifications of K. It
indicates that = 20 maximizes the coherence measure and produces the best results. To under-
stand the LDA output with 20 topics, we need to interpret the estimated topics. Since each topic is
a probability distribution over all unique terms in the vocabulary, a natural way to name each topic
is to read the terms with the highest probabilities and manually assign a label. However, the most
frequent terms often appear in multiple topics, making it difficult to distinguish between topics.
An alternative way is to look for terms that exclusively appear in a given topic. This is defined
as the ratio of a term’s probability within a topic to its probability across all topics (Taddy, 2012).
65
Bybee et al. (2020) adopts this approach to analyze the structure of economics news from the Wall
Street Journal. However, this measure may put too much weight on very rare terms, which can also
be hard to interpret. Following Sievert and Shirley (2014), we use the relevance measure, which is
defined as the weighted average of the two measures above:
? (F|C)
'4;4E0=24 (C4A< F|C> ?82 C) = _ ⇥ ? (F|C) + (1 _) ⇥
? (F)
We find LDA topics with _ = 0.6 yields the best topic interpretability.
66
Figure OA.1: Supervised Learning Hyperparameter Search
This figure plots the hyperparameter search results of the supervised method. For each subplot, the
2
solid blue line indicates how '$$( varies with different parameter choices, and the dashed red line
indicates the parameter that gives the best performance.
67
Figure OA.2: Elbow Method
This figure shows the elbow method used to select the most appropriate number of clusters. The
blue solid line plots the elbow curve of the distortion score, and the red dashed line indicates the
“elbow” detected by the algorithm.
This figure plots the topic coherence measure with different specifications of LDA topic numbers.
68
Figure OA.4: Distribution of Technology Indexes
69
Figure OA.5: Technology Index Validation
This figure plots the relationship between the composite technology index and GitHub measures. In panel
(a), the variable of interest is subscriber, which measures the number of users subscribing repository up-
dates; in panel (b), star indicates the number of “likes” received by the repository; in panel (c), fork proxies
for repository copies made by other developers; in panel (d), commit represents how many times the code
has been revised; in panel (e), branch is the amount of pointers to specific versions of the repository; and
in panel (f) contributor reflects how many developers have contributed to the source. The red solid line
represents the linear fitting of GitHub measures on the composite tech index.
(a) Subscriber (b) star
70
Table OA.1: Embedding Key Terms
This table displays the top 15 most frequent terms of each word embedding clustering and their topic labels.
The number in parentheses indicates the percentage of terms belonging to the topic.
71
Table OA.2: LDA Key Terms
This table displays the top 15 most relevant terms of each LDA topic. The number in parentheses indicates
the relative prevalence of the topics in the corpus.
72
Table OA.3: Blockchain technology word list
This table presents the complete word list that we use to count blockchain technology words as an
alternative measure of technology sophistication.
accenture DAPP gigabyte protocal
address DDOS halve record
airdrop DDOS attack hard fork relayer
altcoin decentralize harware wallet reproduction
AML decryption hash robustness
API deposit hashcash Satoshi Nakamoto
ASIC difficulty hashrate scalability
authentication digital asset hot wallet scrypt
Bitcoin digital identity IBM self execute
BTC digital signature immutable serialization
block distributed ledger IPFS server
block height double spend KYC SHA-256
blockchain EEA ledger shard
bounty EIP liquid democracy smart contract
bug bounty encryption liquidity soft fork
chain ERC mainnet solidity
cipher ETH merkle tree stable coin
client Ether multi signature stablecoin
coin Ethereum NFT testnet
cold storage EVM node timestamp
cold wallet exchange oracle transaction fee
collective fiat private key validator
confirmation fiat currency public key wallet
consensus fork proof wallet address
cryptocurrency gartner proof of authority workflow
cryptography gas proof of stake (PoS)
DAO genesis block proof of work (PoW)
73
Table OA.4: Summary Statistics on Whitepaper Status
This table lists all possible whitepaper status and their frequencies.
Frequency Percent (%)
Downloaded. 1629 55.90
URL response: client error. 535 18.36
URL response: server error. 104 3.57
Unable to get URL response. 403 13.83
Invalid PDF files. 155 5.32
Whitepaper not found. 54 1.85
Whitepaper is accessible but not downloadable. 27 0.93
Permission is required to access. 7 0.24
Total 2914 100.00
74
Table OA.6: Technology Indexes Validation
This table validates tech indexes with measures from GitHub. Panel A, B, C and D display the supervised,
embedding-based, LDA-based and composite tech index respectively. For each column, watch measures
the number of users subscribing repository updates; star indicates the number of “likes” received by the
repository; fork proxies for the copies made by other developers; commit represents the number of times
the code has been revised; branch is the amount of pointers to specific versions of the repository; and
contributor reflects how many developers have contributed to the source code. All GitHub measures are
in logarithmic forms. The reported t-statistics are based on robust standard errors. ***, **, and * indicate
statistical significance at the 1%, 5%, and 10% levels respectively.
Panel A: Supervised Index
(1) (2) (3) (4) (5) (6)
ln(watch) ln(star) ln(fork) ln(commits) ln(branch) ln(contributor)
Tech_sup 0.561*** 0.635*** 0.530*** 0.754*** 0.429*** 0.453***
(0.064) (0.081) (0.073) (0.089) (0.052) (0.057)
Constant 1.985*** 1.842*** 1.428*** 4.081*** 1.887*** 1.943***
(0.056) (0.063) (0.055) (0.084) (0.047) (0.049)
Observations 861 861 861 861 861 861
'2 0.107 0.106 0.098 0.090 0.091 0.094
Panel B: Embedding Index
(1) (2) (3) (4) (5) (6)
ln(watch) ln(star) ln(fork) ln(commits) ln(branch) ln(contributor)
Tech_embed 0.651*** 0.792*** 0.665*** 0.995*** 0.563*** 0.580***
(0.061) (0.073) (0.068) (0.078) (0.050) (0.051)
Constant 1.950*** 1.795*** 1.389*** 4.018*** 1.852*** 1.908***
(0.054) (0.060) (0.051) (0.080) (0.044) (0.046)
Observations 861 861 861 861 861 861
'2 0.148 0.170 0.158 0.160 0.161 0.158
Panel C: LDA Index
(1) (2) (3) (4) (5) (6)
ln(watch) ln(star) ln(fork) ln(commits) ln(branch) ln(contributor)
Tech_lda 0.504*** 0.600*** 0.496*** 0.726*** 0.440*** 0.454***
(0.066) (0.079) (0.072) (0.088) (0.052) (0.056)
Constant 1.983*** 1.836*** 1.424*** 4.072*** 1.879*** 1.936***
(0.056) (0.062) (0.054) (0.083) (0.046) (0.048)
Observations 861 861 861 861 861 861
'2 0.095 0.105 0.095 0.092 0.106 0.104
Panel D: Composite Index
(1) (2) (3) (4) (5) (6)
ln(watch) ln(star) ln(fork) ln(commits) ln(branch) ln(contributor)
Tech_comp 0.796*** 0.940*** 0.785*** 1.148*** 0.665*** 0.691***
(0.072) (0.089) (0.083) (0.094) (0.059) (0.062)
Constant 1.949*** 1.796*** 1.390*** 4.023*** 1.853*** 1.908***
(0.054) (0.060) (0.052) (0.080) (0.044) (0.046)
Observations 861 861 861 861 861 861
'2 0.161 0.174 0.161 0.155 0.164 0.163
75
Table OA.7: Determinant of Technology Index
This table presents the determinants of tech indexes. The dependent variable in panel A, B and C is the
supervised, embedding-based and LDA-based tech index, respectively. Column (1) links the tech index to
whether an ICO uses Ethereum blockchain; column (2) presents the relation between the tech index and
GitHub commits (the number of code revisions); column (3) considers other text-based measures of ICO
whitepapers; column (4) presents estimates with ICO characteristics; column (5) includes all variables. The
reported t-statistics are based on robust standard errors. ***, **, and * indicate statistical significance at the
1%, 5%, and 10% levels respectively.
Panel A: Supervised Index
(1) (2) (3) (4) (5)
Ethereum Based -0.102 -0.063
(0.076) (0.073)
ln_commits 0.093*** 0.066***
(0.013) (0.013)
Has GitHub -0.125** -0.136**
(0.062) (0.060)
Fog Index -0.003*** -0.002
(0.001) (0.002)
Tone -0.211*** -0.193***
(0.035) (0.033)
Uncertainty 0.028 -0.029
(0.070) (0.062)
ICO Length -0.002*** -0.002***
(0.001) (0.000)
Team Size 0.031*** 0.028***
(0.004) (0.003)
Has Twitter 0.281*** 0.246***
(0.096) (0.095)
BTC Price (ICO) -0.016* -0.009
(0.009) (0.009)
Pre ICO 0.043 0.072
(0.053) (0.052)
Bonus -0.036 -0.007
(0.059) (0.057)
Accept BTC -0.074 -0.050
(0.050) (0.049)
Constant 0.085 -0.124*** 0.081 -0.414*** -0.348**
(0.071) (0.039) (0.065) (0.123) (0.148)
'2 0.001 0.052 0.026 0.075 0.119
Observations 1629 1629 1629 1483 1483
76
Panel B: Embedding Index
(1) (2) (3) (4) (5)
Ethereum Based -0.294*** -0.172**
(0.077) (0.073)
ln_commits 0.122*** 0.083***
(0.013) (0.013)
Has GitHub -0.119** -0.014
(0.060) (0.061)
Fog Index -0.004** -0.004
(0.002) (0.003)
Tone -0.282*** -0.227***
(0.040) (0.039)
Uncertainty -0.364*** -0.383***
(0.065) (0.065)
ICO Length -0.002*** -0.002***
(0.001) (0.001)
Team Size 0.008** 0.004
(0.004) (0.004)
Has Twitter 0.150 0.075
(0.124) (0.126)
BTC Price (ICO) -0.027*** -0.016*
(0.010) (0.009)
Pre ICO -0.115** -0.071
(0.053) (0.051)
Bonus -0.103* -0.094*
(0.055) (0.053)
Accept BTC -0.134*** -0.095*
(0.051) (0.048)
Constant 0.247*** -0.193*** 0.419*** 0.202 0.569***
(0.072) (0.037) (0.072) (0.154) (0.175)
'2 0.012 0.097 0.042 0.040 0.131
Observations 1629 1629 1629 1483 1483
77
Panel C: LDA Index
(1) (2) (3) (4) (5)
Ethereum Based -0.262*** -0.132*
(0.078) (0.074)
ln_commits 0.098*** 0.063***
(0.014) (0.013)
Has GitHub -0.100* -0.023
(0.060) (0.062)
Fog Index -0.002* -0.001
(0.001) (0.002)
Tone -0.304*** -0.257***
(0.036) (0.034)
Uncertainty -0.222*** -0.240***
(0.059) (0.057)
ICO Length -0.001* -0.001
(0.001) (0.001)
Team Size 0.007* 0.004
(0.004) (0.004)
Has Twitter 0.205** 0.148*
(0.090) (0.086)
BTC Price (ICO) -0.012 -0.003
(0.010) (0.010)
Pre ICO -0.047 -0.015
(0.054) (0.052)
Bonus -0.092 -0.077
(0.058) (0.056)
Accept BTC -0.076 -0.047
(0.053) (0.052)
Constant 0.220*** -0.151*** 0.281*** -0.071 0.161
(0.073) (0.036) (0.061) (0.130) (0.140)
'2 0.009 0.062 0.042 0.016 0.081
Observations 1629 1629 1629 1483 1483
78