1811 01348 PDF
1811 01348 PDF
1811 01348 PDF
Abstract. In this paper, we analyze the topology and the content found on the “darknet”,
the set of websites accessible via Tor. We created a darknet spider and crawled the darknet
starting from a bootstrap list by recursively following links. We explored the whole connected
component of more than 34,000 hidden services, of which we found 10,000 to be online. Con-
trary to folklore belief, the visible part of the darknet is surprisingly well-connected through
arXiv:1811.01348v2 [cs.CY] 7 Nov 2018
hub websites such as wikis and forums. We performed a comprehensive categorization of the
content using supervised machine learning. We observe that about half of the visible dark web
content is related to apparently licit activities based on our classifier. A significant amount
of content pertains to software repositories, blogs, and activism-related websites. Among
unlawful hidden services, most pertain to fraudulent websites, services selling counterfeit
goods, and drug markets.
1 Introduction
Crawlers and search engines have fundamentally changed the web. In June 1993, the first
crawler-based search engine Wandex knew just 130 web sites. Soon after, the world wit-
nessed the emergence of many more search engines such as Excite, Lycos, AltaVista, and
Google. The usability and success of these search engines may have fueled the exponential
growth of the web in the mid-1990s.
Today, many web pages offer content that is dynamically generated for each visit, and as
such difficult to crawl. Also, a large part of the web is hidden from crawlers: web sites that
dislike to be crawled can include a robots.txt file, and a large part of the web is protected
behind passwords. All these web sites (dynamic content, password- or robot-protected) are
known as the invisible part of the web.
While interactions between users and web sites became increasingly encrypted (https),
the very nature of the internet protocol (IP) cannot prevent that one can observe who is
communicating with whom, e.g. whether a user visits a certain web page.
Like the regular web (the “clearnet”), the Tor darknet also has a large invisible part
(dynamic content or password protected). Because names of web servers are generally
non-mnemonic, being invisible in the darknet is the default rather than the exception.
On the other hand, the darknet also features web servers that absolutely want to be
visible. The most prominent example is facebookcorewwwi.onion, an anonymous entrance
to the regular Facebook web site. More surprisingly, even if you are in an unlawful business
(e.g. selling drugs, guns, or passports), you may want to be visible, as there is no business
for those who are completely invisible. Figure 1 summarizes all these possible web site
types.
So does it make sense to build a crawler and search engine for the visible part of the
darknet, a “Google for the darknet?” While it is generally believed that the darknet is
impenetrable [4,27], we show that the darknet has a surprisingly substantial visible part
(Section 3); moreover, we show that this visible part of the darknet is well-connected with
hyperlinks, and that crawling indeed is possible. In addition, in Section 5, we use machine
1
Clearnet Darknet
Googleable web [our paper]
Visible Publicly accessible sites Visible part of the Darknet
Your e-mail inbox Where spies meet?
Invisible Protected networks Dark side of the Darknet
Table 1. Overview of different web site types. The deep web consists of the italicized cells.
learning techniques to understand the content of the visible darknet: language, topic of
content (e.g. drugs, blogs, adult etc), and legality of content. We compare our results to
previous work in Section 6.
Our contributions are summarized as follows:
1. We develop and publish novel open source implementations of a darknet crawler and
classifier which addresses unique challenges not typically encountered in clearnet set-
tings.
2. We provide the first topological analysis of the Tor network graph, concluding that it
is well connected and scale-free.
3. We apply supervised machine learning on downloaded content and report our findings
on physical language, topic and legality.
Fig. 1. The architecture of the crawler. Modules are depicted in bold. [20]
In this section, we describe the structure of the spider we built to crawl the visible darknet.
To understand some of our design choices, we present the encountered problems and the
different strategies applied to resolve these issues.
The goal of the crawler is to discover and explore a large part4 of the darknet by
recursively following any links it finds in the downloaded content. We distinguish a hidden
service, which refers to a particular .onion address, from a path, which is a URL of a
particular website hosted on a specific hidden service. We store the downloaded content
itself, the extracted URIs, the link structure, as well as some metadata such as timestamps
and data type of the content. We seeded our crawler with a list of 20,000 hidden service
addresses from previous work [17], which allowed us to reach a high initial crawl speed.
Out of those 20,000 hidden services, only approximately 1,500 were still online.
Our crawler uses the following four modules:
4
We measure the size of the darknet in number of hidden services.
2
– Conductor: Manages and orchestrates the download, processing, and storage of con-
tent by scheduling the other components.
– Network module: Downloads content and handles network related exceptions and
issues.
– Parser: Parses the downloaded content and extract URIs that will be used recursively
to download further content.
– Database: Stores the downloaded content as well as the network’s structure and URIs.
– Tor proxy: Enables the crawler to access the Tor network.
The modules are depicted in Figure 1; they are described below in detail.
The crawler is newly developed for this work and made public under the MIT license
for further use by other researchers5 . It implements a complete data analysis chain, from
collection to data preprocessing and classification. The data collected not only contains
the content of the hidden services but also the link structure, status information, network
topology information and can be configured to track the network changes over time.
3
Early in the process we discovered an issue when just blindly downloading anything that
is passed on as a download task. If a large part of the download tasks point to the same
hidden service and the network module opens up a connection to the same hidden service
for each of those download tasks, a darknet web server might classify our behaviour as an
attempted DoS attack. First, we do not want to disturb the targeted services. Secondly,
servers may initiate rate limiting countermeasures against what they consider a DoS attack,
e.g. captchas or similar barriers.
In order to prevent this, we implemented a push back mechanism. This mechanism
keeps track of the number of already running requests per hidden service and only starts
a new download task to the same hidden service if there are less than four live concurrent
connections to this hidden service. If all available slots for this hidden service are in use,
the download task is put into a waiting queue. This queue will be checked for waiting
requests as soon as one of the four pending requests finishes.
The value of four concurrent requests was chosen to be lower than the default Tor
Browser’s setting of six6 . Being below makes the crawler less detectable. Note that four
concurrent requests is higher than the proposed number of two simultaneous connections
by RFC 2616 [29]; however, most modern browsers use a value between four and eight as
well.
The entries in the queues do need memory. If the pool is repopulated and a lot of down-
load tasks reside in the queue, the memory consumption grows unboundedly. This issue
cannot be solved by the network module but requires the other modules that repopulate
the pool to choose the entries in the pool wisely, such that not too many download tasks
point to the same hidden services and therefore end up in the waiting queue.
Additionally, many concurrent download tasks in the waiting queues can slow down
the crawling progress. To ensure the queues will be emptied eventually we do the following
every time a download task returns: we check whether there is a download task in a waiting
queue that does not violate our limit of four connections per hidden service. If such a task
exists we pop it out from the queue, else we get a new entry from the pool. However, in
a worst case scenario all hidden services in progress have four concurrent tasks from the
waiting queues. Then, all available connections will be busy emptying these queues and no
further progress will be made towards other hidden services. This implies that the number
of hidden services needed to delay the overall progress of the crawler on the Tor network
is 25% of the total number of connections.
4
data in memory, but this data will also be immediately discarded if the detected MIME
type is not whitelisted.
After the whitelist MIME type filter, only text documents should be available. These
can still contain textual representations of media content in the form of data URIs such
as base64 encoded data. Here we employ a blacklist filter that replaces any occurrence of
a non-textual data URI by a placeholder.
The previous steps all happen in memory and that data is not accessible to anybody.
At this point, all rich media content should be removed, and therefore the content is ready
to be passed back for persistent storage.
The conductor balances the needs of the other modules. It is the conductor’s responsibility
to dispatch URIs for download to the network module and get new URIs from the database
into the download task pool of the network module. It is important for the conductor to keep
the number of download tasks in the pool balanced in order to prevent an unpredictable
increase in memory usage. It also uses information about the network module queues when
repopulating the pool, by only requesting paths from hidden services that have not yet
entries in their queue.
Prioritization of Downloads
Deciding which URI to schedule next for download is important not only to prevent DoS-
like behaviour as described in Section 2.1, but also to ensure we make progress on the
discovery of the visible part of the darknet. We applied several prioritization schemes to
ensure good progress and to prevent issues with some pathological examples. We will now
introduce those different schemes and explain for each why we introduced them, when they
work and where they fail.
– Naïve: The first approach was to greedily fill up the pool with new download tasks in
the order they were previously inserted into the database. However, most of the pages
do link a lot to pages of the same hidden service and contain only a small amount
of links that point to different hidden services. When populating the pool with such
entries, we will fill up immediately the queues in the network module and increase
memory consumption. In that case, the speed of exploration of the darknet might be
slowed down, as the network module is busy processing the queues.
– Random: To prevent this issue, we started randomizing the selection of paths inserted
into the download task pool. This works well when all hidden services in the database
have a similar number of paths available for download. However, there exist pages that
contain a huge number of links to themselves – we call these black hole pages. An
example for a black hole page is a bitcoin explorer that lists every action ever recorded
on the bitcoin blockchain. This explorer took up to 90% of the available paths in our
database in an early version of our crawler. Even randomization then will lead to a
pool filled with 90% download tasks pointing to the same hidden service. The random
strategy also does not do a good job at discovering new hidden services, since we select
the next page to be downloaded at random.
– New-hidden-service-first: This strategy tries to advance through the darknet as
fast as possible by downloading only a single page per hidden service. This method
is efficient when the collection of yet uncrawled hidden services is large because it
advances very quickly to new hidden services. However, the first page downloaded from
5
a hidden service is not always useful for classification or collection of further links. Thus
this strategy only works well for seed generation but not for data collection.
– Prioritized: Since the new-hidden-service-first strategy does not guarantee good data
for link collection or classification, we attempted to use another part of the dataset
we collected along the crawling progress to serve as an indicator of which path should
be dispatched next. For each path, we calculate the number of unique hidden services
linking to this path. This count then serves as a proxy to assess the importance of a
path [31]. When many unique hidden services link to a particular path, that path likely
contains important information. We then start downloading the path with the highest
distinct incoming link count. Although this strategy works well in general and results
in a nice ordering of paths to be scheduled, there exist two issues with this strategy.
First, we found that the darknet contains closely interconnected subnetworks. One
such subnetwork is a bitcoin scamming scheme. The scammers spin up several hundred
hidden services, which then link to each other in order to prove their credibility. To
further support their claim, they also link to the bitcoin explorer mentioned above,
pointing to transactions that support their case. This increases the distinct incoming
link count enormously (The effects of this will be discussed in Section 3), therefore the
prioritization scheme favors pages from the scamming subnetwork. This leads to a state
where our crawler is only downloading content from a subnetwork of the darknet for
hours and not making any progress on the rest of the darknet. Second, about 1.2 million
paths have only incoming links from one single hidden service. At this stage a secondary
prioritization scheme is required, since the prioritized scheme becomes inefficient when
it returns 1.2 million candidates to be scheduled for download.
– Iterative: This strategy is the result of an attempt to make equal progress on all
hidden services as well as on the whole network. First we make a pass through all
hidden services and remember the ones that still have uncrawled paths available. In a
second phase we then chose exactly one path to be downloaded per hidden service. In
order to select a path for download in the second phase, we use the distinct incoming
count as secondary measurement. In case of a draw on the distinct incoming count, we
just pick the path we found earlier. This strategy is computationally and I/O intensive,
as it needs to do two passes over the data, once over all hidden services and once over all
paths. New hidden services are considered as soon as the pool needs to be repopulated.
This strategy is then iteratively applied to fill the pool.
In the development process, we first started off with the naïve approach and evolved
from there in the order described above. For data collection, we applied a combination of
the strategies described. In the beginning of the crawling process, we used the new-hidden-
service-first strategy to get an initial broad scan across the available hidden services. We
then switched to the iterative strategy, since it progressed best on the network. However,
we noticed that in the end of the crawling process most hidden services do not have any
available paths, and thus at this point the iterative strategy is unnecessarily complex. Since
the number of paths available per hidden service as well as the distinct incoming counts
are in the same order of magnitude for all hidden services, random or prioritized strategy
seem more suitable for this phase.
Running our crawler using these combinations of prioritization schemes, we observe that
the connected component of the darknet which we have explored converges sufficiently after
7 hops from our bootstrapping list, as seen in Figure 2.
6
2.3 Extracting URIs
In order to store all the downloaded content as well as the link structure, we use a relational
Postgres database. First of all, the baseUrls table stores all the hidden service addresses
and a denormalized7 count of the number of hits, which describes how often this hidden
service was referenced.
We now need the path to Type of Data Count
specify a full URI: We introduce
Hidden service addresses found 34,714
the paths table, which is keep-
Hidden services responding 10,957
ing the state of an entry. In this
Hidden services returning useable content 7,566
role, the paths table contains a
Paths 7,355,773
flag indicating if a download for
Links 67,296,302
this specific path is in progress,
Downloaded paths 1,561,590
a timestamp indicating when
the last successful download fin- Content (useable) 667,356
ished, an integer indicating at Table 2. Detailed numbers of our crawler
which depth in the graph the
path was found, the path itself, and a possible subdomain. Additionally, it contains the
7
Denormalization is an attempt to increase access speeds by storing the result of a calculation persistently
7
denormalized count of distinct incoming links to allow for efficient use by our prioritization
scheme.
To store the downloaded content, we introduce the contents table, which stores the
content itself as well as some meta information about the content such as the HTTP status
code and the MIME type. Furthermore, we use a crawl timestamp to track the changes of
the network over time while allowing for multiple downloads of the same path.
Last, we introduce the link table that stores the topology of the network. The link
table establishes an n-to-m mapping between the paths table, since one page may contain
multiple links to other pages which can be uniquely identified by their path and the base
URL.
3 Darknet Structure
In this section, we present the results of the crawling process and analyze the darknet
structure. We show that at least 15% and at most 50% of the darknet is visible. In addition
we show that this visible part is well connected.
Statistics
8
the hidden wiki8 , where a plethora of services offer IRC chat, FTP sharing or P2P sharing
and require both a different communication protocol and different ports to be accessed.
Top Ten MIME Types Returned Share of Pages Additionally, we noticed that
text/html 85.33% the network is highly volatile;
about 30% of all hidden services
image/jpeg 5.28%
running an HTTP(S) server
image/png 3.23%
switch from online to offline and
text/plain 0.85%
vice versa within a week, as de-
application/octet-stream 0.72%
picted in Figure 3. Thus, some
application/zip 0.55%
of the stale links we followed
application/rss+xml 0.48%
during the crawling process pos-
application/json 0.47%
sibly point to a server that is go-
image/gif 0.44% ing to be online again in the fu-
application/epub+zip 0.27% ture.
Table 3. Top ten MIME types. Text and images contribute about Downloading content of dark-
95% of all content. net websites that include images
or videos is potentially illegal. As we already mentioned, during downloading we detected
the MIME type to filter out the potentially illegal content and stored the distribution across
MIME types. In total, we found 134 different MIME types, from which the vast majority
is of type text/html, as expected. The second most common MIME type we encountered
was image types jpeg and png, which is not surprising since almost every webpage has at
least an icon that is linked somewhere on the page. The top ten MIME types are listed in
the Table 3 and a more extensive list of 39 MIME types is available in Appendix A.
Network Structure
We also studied the network structure (captured links and hidden services) and observed
that the darknet behaves like a scale-free network, similarly to the clearnet and various
social networks. In Figure 4, we present a snapshot of the visible darknet.
The degree distribution of a scale-
free network follows a power law and
the average path length of the net-
work is proportional to ln N/ ln ln N .
The diameter of the network is 11. In
Figure 5, it is evident that the de-
gree of the hidden services is power
law distributed, except from an outlier
which stems from a bitcoing scamming
scheme, closely interlinking a high
number of hidden services in an at-
Fig. 5. A histogram depicting the count of pages having a tempt to prove their credibility. More-
particular number of incoming links. over, we note that the calculated av-
erage path length is 4.1 ≈ lnlnln10957
10957 =
4.17. Thus, the structure of the visible part of the darknet seems similar to the structure
of the clearnet and typical approaches such as crawlers work well on both networks.
Figure 5 depicts the distribution of incoming link counts. Most paths can only be
reached with a single hyperlink. In general the distribution of hyperlinks per path follows
a power law distribution. There is an obvious outlier: Several thousand paths have exactly
8
Hidden wiki: zqktlwi4i34kbat3.onion
9
Fig. 4. A visualization of approximately 10% of the visible darknet. The size of the node represents how
large is the part of the network he has access to. The color indicates the ranking when applying page rank;
darker (violet) nodes are the ones with the highest rank.
196 distinct incoming links. This is a bitcoin scamming scheme that tries to increase its
credibility by spinning up multiple hidden services, all of which link to each other, that
claim that the provided service is valid. Many of these also point to the same bitcoin
transactions on a bitcoin explorer site to prove their claim, explaining the second, smaller
outlier.
Comparing Figure 5 with Figure 6,
we find that measuring the in-degree is
much more stable than measuring the
out-degree. This is expected to some
degree since the incoming links are typ-
ically more resistant to manipulation.
On the contrary, the outgoing links
are fully controlled by the provider of
a hidden service, and thus a single
page can have arbitrarily many outgo-
ing links.
Fig. 6. A histogram depicting the count of pages having a
particular number of outgoing links. Our most surprising structural
finding is that the visible part of the
darknet is well connected. While examining the degree distribution, we noticed that the
network contains hubs, web sites that feature a large number of hyperlinks. These hubs
clearly tie the network together, as they connect sites that do not necessarily have anything
in common. One may expect that removing important hubs will partition the darknet into
10
many small components. However, rather surprisingly, even removing all the largest hubs
(with degree greater than 1000) does not disconnect the network.
In Table 5, we present the top ten hidden services ranked by the number of incoming
links from unique hidden services. First, we observe that one market place is clearly dom-
inant in this list. Moreover, we notice that the second most linked hidden service provides
hosting and related services for darknet pages for free. In exchange many of the pages
running on this hoster’s infrastructure link back to the hidden service. The third most
linked hidden service is the same blockchain explorer that breaks the random strategy, as
mentioned in subsection 2.2. We found that the bitcoin scamming scheme that breaks the
prioritized strategy links to this particular bitcoin explorer. The bitcoin scamming scheme
makes up for more than 50% of the distinct incoming links of the bitcoin explorer. How-
ever, the most remarkable observation is that a library made it into the list of the most
referenced services.
11
Hub Name URI Size
DreamMarket [Multiple] 2130
Daniel’s Hosting dhosting4okcs22v.onion 889
Blockchain Explorer blockchainbdgpzk.onion 523
The Tor Mail Box torbox3uiot6wchz.onion 280
DuckDuckGo (Search engine) 3g2upl4pq6kufc4m.onion 166
Deep Dot Web (News) deepdot35wvmeyd5.onion 151
Matrix Trilogy (Image sharing) matrixtxri745dfw.onion 139
TORCH: Tor Search xmh57jrzrnw6insl.onion 134
Hidden Wiki zqktlwi4fecvo6ri.onion 126
Imperial Library (Books) xfmro77i3lixucja.onion 124
Table 5. Top ten hidden services ranked by the number of incoming links from unique other hidden
services.
4 Analysis Methodology
Given the size of the collected data set of 667,356 paths, we relied on automated prepro-
cessing and machine learning techniques to classify their contents. The main modules used
to achieve this are:
The goal of the classifier is twofold: Firstly, to determine the legality of a particular
piece of content, which we express as a binary decision. Secondly, to classify the content
into a particular category describing its topic. The modules we developed to perform these
tasks are part of our open source release and are described in detail below.
The preprocessor has two main tasks to accomplish: First it linguistically preprocesses
the contents and then builds the indexes needed for classification. Initially, we extract the
clean text from the content. For this purpose, we use cheerio [8] to strip any HTML tags
and other script tags we find in the input string. After getting a clean string, we identify
the physical language of the content, in order to decide which linguistic preprocessing will
be applied to the string. We employ franc [39] and the minimal set of 82 languages franc
provides. Franc also offers more inclusive language sets; however, a more inclusive language
set might produce worse language classification results, since it is too fine grained (e.g.,
Scottish English vs. British English).
We found that approximately 66% of the content is in English, followed by 11% in
Russian. The physical language distribution is depicted in Figure 8. In total we found 60
different language. For a more extensive list of 26 languages, see Appendix B. Since the
vast majority of content is written in English, we solely consider English language in the
classification process.
12
Fig. 8. Language distribution in the visible darknet.
For the following steps we need an array of terms instead of a string; thus we tokenize
the content string into words. Now that the language of the content is known, we remove
stop words [23] depending on the language. As it turns out, some of the content only
contains stop words, resulting in an empty content after stop-words removal. This content
is immediately marked as empty.
After stop word removal, we normalize the content further and apply stemming [40]
based on the porter stemmer [33] when appropriate. Previous work [15,16,19,36] suggests
that stemming does not work equally well in all languages, hence we only apply stemming
for English content. Lemmatization (determining the canonical form of a set of words
depending on the context) is not considered since previous work showed that it can even
degrade performance when applied to English content.
Next, we build a positional index, which allows us to choose whether the classification
should happen on a list of words, a bag of words or a set of words approach. Since we want
to keep the most information possible and employ an SVM classifier, we decided to use a
bag of words approach for the classification of labels. However, the software provides the
functionality for all three approaches, which makes it easier to experiment with different
machine learning techniques on the collected data set. For the bag of words model, we need
to store the frequency of each word; the frequency can be extracted from the positional
index, which is discussed in the following subsection. In addition, we keep the extracted
clean content string to facilitate manual classification. Thereby, the clean content string
can be directly presented to a human classifier, in order to get the labeled training data
set.
We store the extracted content in the table cleanContents with a foreign key to the
original content. This table has a one-to-one assignment for both the language as well as
the primary class label to the labels table. Since the legality of a document is expressed
as a binary decision, we directly use a boolean column to store the legal value, where true
corresponds to legal, and false to illegal.
The goal of the training phase is to employ active learning [41]; hence, it is essential
to measure the certainty of the classifier’s decisions. For this purpose, we insert a column
to store the probability output of the classifier for both label and legal classification. Now,
we can apply active learning techniques in the classification phase.
13
A positional index stores information about which term appears in which document
at which positions. The positional index itself is constructed with three different tables.
The first table is the list of terms, which has a one-to-many mapping to the postings
table. A posting is a pair, mapping a term to a document. Therefore each posting in the
posting table has a foreign relation to a clean content and a term, thus creating the tuple
<term ID, cleanContent ID>. In order to store a positional index, every posting points
to multiple positions, describing the positions at which the referenced term appears in
the specified clean content.
As discussed in [27,38], a Support Vector Machine (SVM) is well suited for text classifi-
cation, mainly because the SVM efficiently copes with high dimensional features that are
typical for text classification. Moreover, an SVM is easy to train and apply, compared to
different types of neural networks. We use k-fold cross-validation and compare the output
scores to decide which parameters and kernels are performing better. The parameters that
deliver the best results are then used to train the model.
We begin the classification process with a very small training set of about 200 entries,
in an attempt to minimize the size of the training set, and then apply active learning [41].
Active learning requires knowledge on the uncertainty level of the classifier for each sample,
since in every iteration we manually label the samples with the highest uncertainty. The
idea is to manually label the entries with the highest entropy to convey the most information
to the classifier.
These entries are then added to the data set, and the previously trained classifier
is trained again on the updated data set. Since the original training set was small, we
reevaluated different SVM classifiers from time to time to ensure we are still using the best
possible configuration. We found that a linear kernel worked best for both classification
problems. With this approach, we reached an F1 score of 60% and an accuracy of 75% for
the label classifier, and an F1 score of 85% and an accuracy of 89% for the legal classifier.
During the training of the classifier and the examination of preliminary results, we
observed that some pages did not contain enough information to be assigned a label. In this
case, the empty label was added and excluded from the corpus. Pages not containing enough
information were either actually (or almost) empty, often containing only a placeholder like
“coming soon”, or were login pages without further information about what a user is logging
into or for what service the user could register. If there are other pages of the same hidden
service available, it is sufficient to assign the page a label. If no other pages are available,
or they do not contain enough information for labeling, we are not able to classify such a
hidden service.
The labels are chosen in an attempt to represent the most common categories and
answer often posed questions about content found in the Tor network. We always labeled
the contents with the most specific label possible. For example, if we have a page that is
a blog about information security, this page would be labeled “Security”.
In order to label the hidden services, we take the maximum number of labeled pages per
hidden service and assign this label to the hidden service as well. For legal classification,
we assign illegal to a hidden service as soon as at least 10% of the contents are illegal. This
threshold is used to reduce false positives from the classification process. In Figure 9, we
show how the number of illegal hidden services changes with different thresholds. Lower
values than 1% only introduce minor changes to the result. Overall, we observe that the
number of illegal hidden services is quite stable.
14
Fig. 9. Estimated number of hidden services containing illegal content for different thresholds.
The labels are listed below, each with a description and examples.
– Activism: Any content that is conveying activism, e.g. human rights activism.
– Adult: Pornographic content of any type except paedophilic content. It does not have
to contain media content to be classified as Adult; textual descriptions are also included
if they are clearly of pornographic nature.
– Art: Anything ranging from literature to images (either photographs or computer gen-
erated) to whole web pages being an art project. This label does not include pages that
contain sharing books or other pieces of art. Such a page would belong to the Sharing
label.
– Blog: Any web page that has a clearly defined set of authors that can publish content
on the page and the content is not generated interactively as in a forum. Any web page,
even if not in typical blog form is counted here if it fulfills the above condition and
cannot be labeled more specifically, including marketing or news content.
– Counterfeit: Any page that offers or describes how to create counterfeits is listed here.
Most common topics are passports and credit cards.
– Drugs: Any page where the main topic is drugs. This does not only include market-
places selling drugs but also forums and information pages about drugs, including drug
prevention pages.
– Empty: The page does not contain enough information to be labeled with some other
label. This category mainly includes empty pages, pages with too short messages and
login pages that have no reference to the service they belong to. Furthermore, pages
containing an error code (e.g. 404) in the returned HTML were also assigned the Empty
label.
– Exchange: Any currency exchange page, typically between cryptocurrencies or from
cryptocurrency to fiat money and back.
– Forum: Either forums or chats. We consider them to belong to the same label since in
both cases users are populating the page with content and responding to each other.
The only difference is of a temporal nature, how fast one expects a response. This label
is only assigned when no other label fits better. For example, a forum focused on drugs
would be classified into the Drugs label.
– Fraud: Offers that are of a fraudulent nature, such as bitcoin frauds that attempt to
trick the user into sending them bitcoin for some promised higher return value.
15
– Gambling: Mainly online casinos running with cryptocurrencies or other betting games
that promise to pay out money for certain events.
– Hacking: Contains anything related to breaching computer systems, independent of
being black hat or white hat hacking.
– Hosting: Offers to host a web page, darknet as well as clearnet.
– Laundry: Pages offering money laundering services, often for cryptocurrencies stem-
ming from dishonest sources.
– Hub: Hubs are pages listing links of darknet sites, such as “The Hidden Wiki”.
– Mail: E-mail provider, anonymized or not.
– Market: Any market selling goods that do not qualify for other labels. Also markets
that sell multiple goods of different types.
– Murder: Pages promising to execute a murder in exchange for money. Some of these
pages could also be considered scams as indicated in news and blog publications [24].
However, we introduce this label separately since hitmen services are often linked to
the darknet.
– Paedophilic: Any pages containing either paedophilic content directly, including me-
dia content or textual representation of fantasies. This label also contains forums and
support groups for paedophiles.
– Phishing: Fake pages that try to grab login credentials for different services from the
user (e.g. Facebook or Google).
– Search Engine: Any page offering to search for darknet hidden services. Often these
keep a manually edited index.
– Security: Pages containing information about computer security and personal security.
– Seized: Pages taken offline by law enforcement agencies.
– Service: Any service that does not fit into one of the other labels, e.g. a darknet
calendar.
– Sharing: Any page that shares content, either illegal (pirating) or legal, e.g., Dropbox.
– Social Network: Social networks similar to Facebook or Twitter. Facebook itself is
accessible through the darknet [2] and thus belongs to this label.
– Software Development: Anything that has to do with software or its development.
We consolidated selling software and development into one label since in most cases,
the page serves as both the marketing page as well as the git repository or test page.
– Terrorism: Anything related to extremism and terrorism. We do not classify the type
of terrorism.
– Weapons: Selling or discussing goods around weapons, be it guns, rifles or other devices
that can kill or harm people.
– Whistleblowing: Pages similar to Wikileaks. Pages that contain a drop off for whistle-
blowers also qualify for this label.
16
between legal and illegal. Since our goal is to provide a rough estimate on the distribution
of the content, we follow a rather conservative approach when assessing legality: when the
purpose of the content is an obviously illegal action, the content is classified as illegal, even
if it might not technically be. One might say that this classifier is slightly biased towards
classifying contents as illegal.
5 Darknet Content
In this section, we present the results of the content classification, both in terms of label and
legality. The visible darknet mostly consists of empty pages; just above 30% of the content
is classified in this label. For the legal classification of both pages and hidden services, we
exclude the content with an empty label. Such content usually does not contain enough
information for labeling, as we cannot assess the intended content from a vanilla login page;
also regarding legality, empty pages are impossible to judge, and hence not considered.
17
Label Classification
In Figure 10, we present the top thirteen labels of darknet pages, while the complete list can
be found in Appendix C.1. We notice that, quite unexpectedly, Blog is the top category
in the classification per page. However, as we discussed earlier, we classified in Blog all
the pages that did not fit in any other label, such as a company blog presenting the new
employees or a news article about an arbitrary topic.
On the other hand, the classification per hidden service was more challenging since
typically most hidden services have multiple pages with sometimes different labels. To
deal with this issue we decided on the following classification strategy: we counted how
many pages each hidden service hosts from each label and assigned to the hidden service
the label with the highest count. Consolidating all the labeled data per hidden service, as
depicted in Figure 11, we observe only minor changes in the relative occurrence of labels.
The complete list can be found in Appendix C.2.
The top category in classification per hidden service is Software, which also seems sur-
prising. Exploring further, however, we observed a specific pattern which justifies the high
occurrence of Software labels: first of all, many pages rely on open source software, which
is later developed further. To manage such developments, the pages introduce a version
control system which is often hosted by the same hidden service as the actual service. In
addition, many pages store information for the operation team in status information pages
which are also classified in the Software category. Both version control systems and opera-
tion tools often contain a large amount of sub pages, so Software may be overrepresented.
Furthermore, the way we accumulate the labels of single pages in order to label the hidden
services also contributes to this result. A different feature selection could probably improve
the results and allow the classifier to distinguish between the main topic of a hidden service
and the git repository of such.
If we compare the results by page and by hidden service, we see two major differences.
Fraud is only making up for 1.82% when grouped by page, however, it is the second largest
category when grouping by hidden services with 8.13%. We find that many hidden services
that try to scam their users only contain very few sub pages and rely mainly on external,
more trusted services to place their links and faked testimonies. In contrast, hubs make up
for 11.1% of all pages, while they only constitute 3.8% of all hidden services. We observe
that many link lists have a high amount of sub pages, as to be expected.
Legal Classification
After excluding all empty pages, we find that about 60% of the collected pages is legal,
as illustrated in Figure 12; this seems an unexpectedly high number for the darknet. To
explain this high number of legal pages, we look more closely into the assigned labels. For
example, Software is often considered legal because the development of software such as a
market place is legal; however the goods sold in such a marketplace might be illegal. Thus,
the label we assigned might be misleading in assessing the purpose of the page.
18
Fig. 12. Approximately 60% of all pages are legal in the visible part of the darknet.
Fig. 13. When accumulating the pages by hidden service, we find that above 50% of all hidden services
contain illegal content.
If we accumulate the labels by hidden service, the number of hidden services contain-
ing illegal content slightly increases compared to the corresponding number of pages, as
depicted in Figure 13. However, we found that a lot of forums have dedicated threads for
the illegal content to enable users to choose in which discussion to participate. When clas-
sifying the content of the forum, some part contributes to the legal and some to the illegal
category. However, we classify the hidden service as illegal if at least 10% of the content is
illegal. Thus, the classification by hidden service is biased towards the illegal category; this
is the most probable cause of the percentage difference between pages and hidden services
regarding legality.
Note that all the results apply to content written in English only, c.f., Figure 8.
6 Related Work
Starting with the rise of the web, crawling has been a topic of interest for the academic
community as well as society in general. Some of the earliest crawlers [14,21,10,32] were
presented just shortly after the web became public. However, since the early web was
smaller than the visible darknet today, clearnet crawlers did not face the same issues as
19
they do today. Castillo [6] and Najork and Heydon [28] addressed a wide variety of issues,
e.g. scheduling. Castillo found that applying different scheduling strategies works best.
Both acknowledged the need to limit the number of requests sent to the targets, and
described several approaches similar to ours.
Contrary to aforementioned work, we are interested in exploring the darknet. Previous
work [12,34] has estimated the size of the Tor darknet using non-invasive methods that
allow us to measure what portion of the darknet we have explored. Similarly to our work, Al
Nabki et al. [27] want to understand the Tor network; they created a dataset of active Tor
hidden services and explored different methods for content classification. However, instead
of using a real crawler, they browsed a big initial dataset as seed, and then followed all
links they found exactly twice. In turn, Moore and Rid [25] went to depth five, which seems
to be good enough to catch most of the darknet, as their results are similar to ours. Our
crawler does a real recursive search and as such understands the darknet topology in more
detail, in contrast to both of these previous works.
Other work applied different approaches to data collection for classification. Owen et
al. [30] measured the activity on each page of the darknet by operating a large number of
Tor relays, a measure which is not possible with our methodology. Similarly to our work,
Biryukov et al. [4] have classified darknet content with findings comparable to ours. Their
methodology relied on a Tor bug which was fixed in 0.2.4.10-alpha and can no longer be
used for data collection. Unlike both of these works which can have adverse effects on the
anonymity of Tor users, our method is less invasive as it only utilizes publicly available
information and does not require operating any Tor relays or participating in the DHT.
Furthermore, Terbium labs [13] proposed content classification via a small but random
sample from a crawled collection. Despite the small size of the dataset, their results are
also comparable to ours.
While we focus on the web, Chaabane et al. [7] wanted to know what other protocols are
using Tor, by analyzing the traffic on Tor’s exit routers. They concluded that the largest
amount of transmitted data stems from peer-to-peer file sharing traffic. As shown by McCoy
et al. [22], measurements of traffic can be easily skewed by data-heavy applications such
as file sharing. This work estimates that HTTP makes up for approximately 92% of all
connections.
Content classification is also a topic in the clearnet, and depending on the size of the
collected dataset, automated content classification is required. Both Verma [38] and Kim
and Kuljis [18] analyzed existing techniques and proposed different methods to improve
content analysis to Web based content. Xu et al. [41] compared different methods to improve
SVM performance, including the active learning approach. They showed that an active
learning approach coupled with an SVM classifier can result in better accuracy with less
samples than other compared methods. These results were applied in the machine learning
process we employed, combining active learning and SVM estimators.
Specifically for the darknet, Fidalgo et al. [11] described how to classify illegal activ-
ity on the darknet based on image classification. While adding images as an additional
feature could improve the performance of our classifier (especially for pages that did not
contain enough textual information), we abstained from it because it requires downloading
potentially illegal images.
A more specific content analysis was performed on marketplaces residing on the darknet.
Christin et al. [9] and Soska et al. [37] concentrated on a few large markets and developed
strategies to circumvent barriers such as captchas and password entries. They used this
data to estimate sales volumes, product categories and activity of vendors.
20
7 Conclusion & Future Work
We performed a thorough analysis of both the structure and the content of the visible
part of the darknet. To the best of our knowledge, we were the first to systematically
(recursively) explore the network topology of the visible darknet. To that end, we developed
a novel open source spider which captures the graph of hyperlinks between web pages.
Remarkably, we found that the darknet is well connected and behaves like a scale-
free network, similarly to the clearnet. The degree distribution follows a power law and
the diameter of the network is 11. Furthermore, we identified the existence of large hubs
connecting the network. Although these hubs have high degrees, their impact on the con-
nectivity of the network is as significant as anticipated; at least half of the darknet remains
well connected even if we remove all such hubs. Last but not least, we observed that the
darknet is highly volatile, and experience a churn of about 30% within a single week.
We employed supervised learning techniques to classify the content of the darknet
both per page and per hidden service. Specifically, we used active learning and an SVM
classifier to label the content in terms of physical language, topic and legality. Surprisingly,
60% of the pages and approximately half of the hidden services in the darknet are legal.
Furthermore, the label classification indicates that (contrary to popular belief) about half
of the pages in the darknet concern licit topics such as blogs, forums, hubs, activism etc.
On the other hand, we revealed that a significant amount of hidden services perform fraud.
Our crawler was able to uncover a significant but small (15%) percentage of the darknet
as compared to Tor Metrics estimates. Future work can address the reason we were unable
to penetrate the rest of the darknet. We already recommended several mechanisms towards
this direction, i.e. looking for non-HTTP(S) services that could be running on the darknet
as well as employing the crawler repeatedly for a longer period of time to increase the
probability of discovering services which are often offline.
The machine learning techniques can also be improved to obtain better results. In
particular, we classified hidden services into labels by taking a majority of labels of their
individual pages. However, treating hidden services independently could probably yield
better classification results. Furthermore, taking into account additional features in the
classification process could significantly improve the outcome. Explicitly handling special
cases such as git repositories and excluding them from the accumulation of the labels could
have significant impact on the results. Moreover, notable improvement could arise from
the use of images as an additional classification feature when the content is insufficient;
however, this method can only be adopted if the legality of the image can be automatically
determined.
Lastly, our collected data is comprehensive enough to be useful if made available to
the public. To that end, future work could build an open source Tor search engine that
allows searching the indexed data, similarly to rudimentary closed source solutions such as
TORCH [3]. Our indexing techniques could also benefit existing open source search engine
attempts such as Ahmia [1]. An interactively browseable version of our graph that presents
an in-depth apprehensible view on the topology of the darknet could also be a valuable
extension for future work.
21
References
1. Ahmia.fi - hidden service search engine. Online [Search Engine].
2. Facebook. Online [Tor network].
3. Torch tor search. Online [Search Engine].
4. A. Biryukov, I. Pustogarov, F. Thill, and R.-P. Weinmann. Content and popularity analysis of tor
hidden services. In 2014 IEEE 34th International Conference on Distributed Computing Systems
Workshops. IEEE, 2014.
5. Z. Boyd. tor-router. Online [Software].
6. C. Castillo. Effective web crawling. ACM SIGIR Forum, 39(1):55, jun 2005.
7. A. Chaabane, P. Manils, and M. A. Kaafar. Digging into anonymous traffic: A deep analysis of the
tor anonymizing network. In 2010 Fourth International Conference on Network and System Security.
IEEE, sep 2010.
8. cheeriojs. cheerio. Online [Software].
9. N. Christin. Traveling the silk road. In Proceedings of the 22nd international conference on World
Wide Web - WWW ’13. ACM Press, 2013.
10. D. Eichmann. The RBSE spider — balancing effective search against web load. Computer Networks
and ISDN Systems, 27(2):308, nov 1994.
11. E. Fidalgo, E. Alegre, V. González-Castro, and L. Fernández-Robles. Illegal activity categorisation
in DarkNet based on image classification using CREIC method. In International Joint Confer-
ence SOCO’17-CISIS’17-ICEUTE’17 León, Spain, September 6–8, 2017, Proceeding, pages 600–609.
Springer International Publishing, 2017.
12. K. L. George Kadianakis. Extrapolating network totals from hidden-service statistics. resreport 2015-
01-001, Tor Project, 2015.
13. C. Gollnick and E. Wilson. Separating fact from fiction: The truth about the dark web. Online.
14. M. Gray. World wide web wanderer, 1993. Discontinued.
15. D. Harman. How effective is suffixing? Journal of the American Society for Information Science,
42(1):7–15, jan 1991.
16. D. A. Hull. Stemming algorithms: A case study for detailed evaluation. Journal of the American
Society for Information Science, 47(1):70–84, 1996.
17. G. Kadianakis, C. V. Roberts, L. M. Roberts, and P. Winter. âĂIJmajor key alert!âĂİ anomalous
keys in tor relays. 2017.
18. I. Kim and J. Kuljis. Applying content analysis to web-based content. Journal of Computing and
Information Technology, 18(4):369, 2010.
19. R. J. Krovetz. Word sense disambiguation for large text databases. phdthesis, University of Mas-
sachusetts, 1996.
20. A. Loconte. Orbot-logo.svg. Online, CC BY-SA 4.0.
21. O. McBryan. GENVL and WWWW: Tools for taming the web. Computer Networks and ISDN
Systems, 27(2):308, nov 1994.
22. D. McCoy, K. Bauer, D. Grunwald, T. Kohno, and D. Sicker. Shining light in dark places: Under-
standing the tor network. In Privacy Enhancing Technologies, pages 63–76. Springer Berlin Heidelberg,
2008.
23. F. McDowall. stopword. Online [Software].
24. C. Monteiro. Pirate dot london - assassination. Online [Blog].
25. D. Moore and T. Rid. Cryptopolitik and the darknet. Survival, 58(1):7–38, jan 2016.
26. mscdex. mmmagic. Online [Software].
27. M. W. A. Nabki, E. A. Eduardo Fidalgo, and I. de Paz. Classifying illegal activities on tor network
based on web textual contents. In Proceedings of the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational
Linguistics, 2017.
28. M. Najork and A. Heydon. High-performance web crawling. In Massive Computing, pages 25–45.
Springer US, 2002.
29. H. F. Nielsen, J. Mogul, L. M. Masinter, R. T. Fielding, J. Gettys, P. J. Leach, and T. Berners-Lee.
Hypertext Transfer Protocol – HTTP/1.1. RFC 2616, June 1999.
30. G. Owen and N. Savage. Empirical analysis of tor hidden services. IET Information Security,
10(3):113–118, 2016.
31. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the
web. Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-
1999-0120.
32. B. PINKERTON. Finding what people want : Experiences with the webcrawler. Proc. of the Second
International WWW Conference, 1994.
33. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
22
34. T. T. Project. Tor metrics - onion services. Online.
35. T. T. Project. Tor metrics - performance. Online.
36. G. Salton. Automatic text processing: the transformation, analysis, and retrieval of information by
computer. Choice Reviews Online, 27(01):27–0351–27–0351, sep 1989.
37. K. Soska and N. Christin. Measuring the longitudinal evolution of the online anonymous marketplace
ecosystem. In USENIX Security Symposium, 2015.
38. T. Verma. Automatic text classification and focused crawling. In 2013 International Journal of
Emerging Trends & Technology in Computer Science (IJETTCS), volume 2, 2013.
39. T. Wormer. franc. Online [Software].
40. T. Wormer. stemmer. Online [Software].
41. H. Xu, X. Wang, Y. Liao, and C. Zheng. An uncertainty sampling-based active learning approach for
support vector machines. In 2009 International Conference on Artificial Intelligence and Computa-
tional Intelligence. IEEE, 2009.
23
A Appendix: MIME Types
MIME Types of collected content. Either the server returned a MIME type header or we
detected the type by data inspection through mmmagic[26]
MIME Type Share of pages
text/html 85.33%
image/jpeg 5.28%
image/png 3.23%
text/plain 0.85%
application/octet-stream 0.72%
application/zip 0.55%
application/rss+xml 0.48%
application/json 0.47%
image/gif 0.44%
application/epub+zip 0.27%
application/xml 0.25%
application/javascript 0.25%
application/x-fictionbook+xml 0.20%
application/x-bittorrent 0.19%
application/atom+xml 0.19%
application/x-mpegURL 0.18%
text/xml 0.17%
application/pdf 0.15%
text/css 0.15%
audio/x-wav 0.13%
application/x-download 0.08%
video/mp4 0.08%
video/webm 0.08%
application/vnd.comicbook+zip 0.05%
text/csv 0.03%
audio/mpeg 0.02%
application/rdf+xml 0.02%
application/pgp-signature 0.02%
image/jpg 0.01%
text/tab-separated-values 0.01%
application/x-tar 0.01%
application/vnd.comicbook-rar 0.01%
audio/x-scpls 0.01%
text/calendar 0.01%
application/owl+xml 0.01%
video/x-matroska 0.01%
image/svg+xml 0.01%
application/x-empty 0.01%
Other 0.06%
24
B Appendix: Languages in the Darknet
We found content in 60 languages. The exact number of documents per language is listed
below.
Language Share of pages
English 66.14%
Russian 11.56%
French 5.82%
German 4.49%
Spanish 3.00%
Bulgarian 1.78%
Portuguese 1.20%
Italian 0.75%
Mandarin 0.68%
Iloko 0.61%
Undetermined 0.56%
Dutch 0.53%
Swedish 0.40%
Ukrainian 0.34%
Turkish 0.20%
Polish 0.20%
Hausa 0.14%
Madurese 0.14%
Korean 0.13%
Standard Arabic 0.11%
Japanese 0.10%
Czech 0.10%
Romanian 0.10%
Malay 0.08%
Indonesian 0.08%
Other 0.79%
25
C Appendix: Complete List of Labels
26
Hub 3.83%
Laundry 3.48%
Forum 3.45%
Whistleblowing 3.19%
Hosting 3.07%
Hacking 2.88%
Search Engine 2.02%
Pedophilic 1.64%
Gambling 1.64%
Exchange 1.26%
Seized 0.95%
Murder 0.47%
Security 0.44%
Social Network 0.28%
This work is licensed under the Creative Commons Attribution 4.0 International License.
To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ or
send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
27