0% found this document useful (0 votes)
327 views19 pages

Seminar Report 2021-22 Deep Web

The document discusses different levels and aspects of the internet, including the surface web and deep web. It provides definitions and descriptions of: 1. The surface web, which is indexed by search engines and comprises about 4% of internet content. 2. The deep web, which is not indexed and comprises about 96% of internet content including dynamically generated pages and databases. 3. Five levels of the internet ranging from the most public surface web to the dark web accessible only through networks like Tor where illegal activities may take place. 4. The Tor network which allows anonymous communication and access to dark web sites by routing traffic through other computers in the network.

Uploaded by

Classic Printers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
327 views19 pages

Seminar Report 2021-22 Deep Web

The document discusses different levels and aspects of the internet, including the surface web and deep web. It provides definitions and descriptions of: 1. The surface web, which is indexed by search engines and comprises about 4% of internet content. 2. The deep web, which is not indexed and comprises about 96% of internet content including dynamically generated pages and databases. 3. Five levels of the internet ranging from the most public surface web to the dark web accessible only through networks like Tor where illegal activities may take place. 4. The Tor network which allows anonymous communication and access to dark web sites by routing traffic through other computers in the network.

Uploaded by

Classic Printers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Seminar Report 2021-22 Deep Web

CHAPTER 1

INTRODUCTION

The deep Web (also called Deepnet, the invisible Web, dark Web or the
hidden Web) refers to World Wide Web content that is not part of the surface Web,
which is indexed by standard search engines.

Mike Bergman, credited with coining the phrase, has said that searching on the
Internet today can be compared to dragging a net across the surface of the ocean; a
great deal may be caught in the net, but there is a wealth of information that is deep
and therefore missed. Most of the Web's information is buried far down on
dynamically generated sites, and standard search engines do not find it. Traditional
search engines cannot "see" or retrieve content in the deep Web – those pages do not
exist until they are created dynamically as the result of a specific search. The deep
Web is several orders of magnitude larger than the surface web

Fig.1.1 Surface web VS Deep web

Dept. of Computer Engg. 1 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

1.1 SURFACE WEB

Surface Web is the top-most layer of the Internet. Everything we surf on the
Internet for our daily needs are lying under the Surface Web. This surface web is the
most commonly used surfing area, everything here is publicly accessible, always
traffic-crowded by all types of visitors daily and the comparatively safer portion of
the Internet, built for every regular user/common people. Surface Web is an open-
portal, easily available & accessible for anyone from any corner of the World using
our regular browsers and this is the only layer whose almost every information is
indexed by all the popular Search Engines. In-short, Surface Web is the only Default
user-accessible part of the Internet. We can access the Surface Web using the Popular
Search Engines like Google, Yahoo, Bing, DuckDuckGo, using the popular browsers
like Google Chrome, Mozilla Firefox, Internet Explorer, Netscape Navigator,
Microsoft Edge, Opera, UC Browser and more. Surface Web is the purest part of the
World Wide Web. Here, the HTTP/HTTPS, TCP or UDP protocols and SSL(Secure
Socket Layer)/TLS(Transport Layer Security) Certified websites are found.

The Surface Web is filled with plenty of contents indexed by its automated
AI-based Web-Crawlers, those contents are the several types of website URLs, lots of
images, videos, GIFs, and such data or information are indexed in the large databases
of those Search Engines respectively. These contents are indexed legally and verified
by Search Engines.

The big Irony is- It is impossible for a person in one life to surf everything, to
know, to learn, to see and to catch in mind or to understand all those information
which are available and accessible on the Surface Web!!!!.Because the Surface Web
itself is a large area of the Internet, increasing constantly by its huge amount of data-
indexing. But the most surprising factor is: As per the basic measurement of the
Internet Layers, concerning its data quantities, this Surface Web is the only 4% of the
Total Internet!!. Yes, only this 4% is considered as the Surface Web and still, it seems
to be like a limitless Internet Web Layer for us, having an uncountable amount of data
indexed!!. After the surface web, the rests 96% which is referred to be as the deeper
side and hidden part of the Internet is called the Deep Web.

Dept. of Computer Engg. 2 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

1.2 DEEP WEB

The Deep Web is the deeper part of the Internet.The deep web,invisible web,
or hidden web are parts of the World Wide Web whose contents are not indexed by
standard web search-engines. This is in contrast to the "surface web", which is
accessible to anyone using the Internet. Computer-scientist Michael K. Bergman is
credited with coining the term in 2001 as a search-indexing term.

It is the most sensitive part of the Internet, not indexed by Search Engines. In
true words, this part is not allowed to be indexed on Search Engines to show publicly.
It is the only accessible area to its respective owners who has the proof of having the
needed credentials or permissions to access any database information..

The content of the deep web can be located and accessed by a direct URL or
IP address, but may require a password or other security access to get past public-
website pages. The deep web is also the mass storage of all the website-
oriented(belongs to Surface Web) server-side information and the huge collection of
the database-stacks which are filled with sensitive data.

Fig.2.1 Internet and TOR network

Dept. of Computer Engg. 3 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

1.3 NAMING

Bergman, in a seminal, early paper on the deep Web published in the Journal of
Electronic Publishing, mentioned that Jill Ellsworth used the term invisible Web in
1994 to refer to websites that are not registered with any search engine. Bergman
cited a January 1996 article by Frank Garcia:

"It would be a site that's possibly reasonably designed, but they didn't bother to
register it with any of the search engines. So, no one can find them! You're hidden. I
call that the invisible Web."

Another early use of the term invisible Web was by Bruce Mount and Matthew B.
Koll of Personal Library Software, in a description of the @1 deep Web tool found in
a December 1996 press release.

The first use of the specific term deep Web, now generally accepted, occurred in the
aforementioned 2001 Bergman study.

1.4 SIZE

In 2000, it was estimated that the deep Web contained approximately 7,500 terabytes
of data and 550 billion individual documents. Estimates based on extrapolations from
a study done at University of California, Berkeley, show that the deep Web consists of
about 91,000 terabytes. By contrast, the surface Web (which is easily reached by
search engines) is only about 167 terabytes; the Library of Congress, in 1997, was
estimated to have perhaps 3,000 terabytes.

Dept. of Computer Engg. 4 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 2

LEVELS OF THE INTERNET

Level 1: It is the most common web or internet. We use it pretty much daily as well
as know and understand it. It generally comprises of the ‘Open to the Public’ part of
the Web.

Level 2: It is commonly known as the Surface Web. Services such as Reddit, Digg,
and temporary email are included in it. Chat boards and other social enabling content
can be found in this level as it is essentially a communications platform. To reach it in
any fashion is not difficult.

Level 3: level 3 is called Bergie Web. Services besides WWW or Web services are
been included here. It consists of Internet newsgroups, Google locked results, FTP
sites, honeypots and other sites such as 4Chan. If you know where you are going, this
level is relatively simple to reach.

Fig.2.1 Levels of the Internet

Level 4: Level 4 is known as Charter Web or Deep Web. Hacker groups, activist
communications, banned media and other darker layers of the online society are
included in this websites. This is what we basically refer to as Deep Web. Typical

Dept. of Computer Engg. 5 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

Web search engines cannot find the sites on this layer . You have to be invited by an
existing member. In order to be able to access these sites

Level 5: In this level, things get a little creepy. The level is known as Dark Web
through the normal Internet

These websites are not accessible to this level. You will need to get on the TOR
network, or some other private networks .Dark Web sites are also referred to as TOR
Hidden Services or onion sites. On the TOR network there is a variety of legal and
illegal content. Illegal materials such as bounty hunters, drugs, human trafficking,
hacker exploits, rare animal trade and other black market items are been included in
these sites. Whenever we refer to Dark Web, we are normally referring to the TOR
network.

Dept. of Computer Engg. 6 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 3

TOR NETWORK

Fig.3.1 Representation of TOR network

TOR stands for "The Onion Router". To run TOR software installed on your
computer and TOR connections managed by network of computers both are
considered. In simple terms, TOR permits you to route web traffic through several
other computers in the TOR network so that the party on the other side of the
connection can’t trace the traffic back to you. More TOR users means more protection
for your information, since you are using other computers to route your connections
and sessions . It creates a number of layers that conceal your identity from the rest of
the world as the name implies

Dept. of Computer Engg. 7 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 4

INDEXING METHODS

Methods that prevent web pages from being indexed by traditional search
engines may be categorized as one or more of the following:

1.Contextual web: pages with content varying for different access contexts (e.g.,
ranges of client IP addresses or previous navigation sequence).

2.Dynamic content: dynamic pages, which are returned in response to a submitted


query or accessed only through a form, especially if open-domain input elements
(such as text fields) are used; such fields are hard to navigate without domain
knowledge.

3.Limited access content: sites that limit access to their pages in a technical way (e.g.,
using the Robots Exclusion Standard or CAPTCHAs, or no-store directive, which
prohibit search engines from browsing them and creating cached copies).

4.Non-HTML/text content: textual content encoded in multimedia (image or video)


files or specific file formats not handled by search engines.

5.Private web: sites that require registration and login (password-protected resources).

6.Scripted content: pages that are only accessible through links produced by
JavaScript as well as content dynamically downloaded from Web servers via Flash or
Ajax solutions.

7.Software: certain content is intentionally hidden from the regular Internet,


accessible only with special software, such as Tor, I2P, or other darknet software. For
example, Tor allows users to access websites using the .onion server address
anonymously, hiding their IP address.

8.Unlinked content: pages which are not linked to by other pages, which may prevent
web crawling programs from accessing the content. This content is referred to as

Dept. of Computer Engg. 8 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

pages without backlinks (also known as inlinks). Also, search engines do not always
detect all backlinks from searched web pages.

9.Web archives: Web archival services such as the Wayback Machine enable users to
see archived versions of web pages across time, including websites which have
become inaccessible, and are not indexed by search engines such as Google. The
Wayback Machine may be called a program for viewing the deep web, as web
archives that are not from the present cannot be indexed, as past versions of websites
are impossible to view through a search. All websites are updated at some point,
which is why web archives are considered Deep Web content.

10.robots.txt files: A robots.txt file can advise search engine bots not to crawl
websites using user-agent: * then disallow: /. This will tell all search engine bots not
to crawl the entire website and adding it to the search engine.

Dept. of Computer Engg. 9 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 5

ACCESSING DEEP WEB

To discover content on the Web, search engines use web crawlers that follow
hyperlinks. This technique is ideal for discovering resources on the surface Web but is
often ineffective at finding deep Web resources. For example, these crawlers do not
attempt to find dynamic pages that are the result of database queries due to the infinite
number of queries that are possible.It has been noted that this can be (partially)
overcome by providing links to query results, but this could unintentionally inflate the
popularity (e.g., PageRank) for a member of the deep Web.

One way to access the deep Web is via federated search based search engines. Search
tools such as Science.gov are being designed to retrieve information from the deep
Web. These tools identify and interact with searchable databases, aiming to provide
access to deep Web content.

Fig.5.1 Accessing Deep web

Another way to explore the deep Web is by using human crawlers instead of
algorithmic crawlers. In this paradigm, referred to as Web harvesting, humans find
interesting links of the deep Web that algorithmic crawlers can't find. This human-

Dept. of Computer Engg. 10 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

based computation technique to discover the deep Web has been used by the
StumbleUpon service since February 2002.

In 2005, Yahoo! made a small part of the deep Web searchable by releasing
Yahoo! Subscriptions. This search engine searches through a few subscription-only
Web sites. Some subscription websites display their full content to search engine
robots so they will show up in user searches, but then show users a login or
subscription page when they click a link from the search engine results page.

Dept. of Computer Engg. 11 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 6

CRAWLING THE DEEP WEB

Researchers have been exploring how the deep Web can be crawled in an
automatic fashion. In 2001, SriramRaghavan and Hector Garcia-Molina presented an
architectural model for a hidden-Web crawler that used key terms provided by users
or collected from the query interfaces to query a Web form and crawl the deep Web
resources. AlexandrosNtoulas, PetrosZerfos, and Junghoo Cho of UCLA created a
hidden-Web crawler that automatically generated meaningful queries to issue against
search forms.Their crawler generated promising results, but the problem is far from
being solved, as the authors recognized. Another effort is DeepPeep, a project of the
University of Utah sponsored by the National ScienceFoundation, which gathered
hidden-Web sources (Web forms) in different domains based on novel focused
crawler techniques.

Commercial search engines have begun exploring alternative methods to crawl


the deep Web. The Sitemap Protocol (first developed by Google) and mod oai are
mechanisms that allow search engines and other interested parties to discover deep
Web resources on particular Web servers. Both mechanisms allow Web servers to
advertise the URLs that are accessible on them, thereby allowing automatic discovery
of resources that are not directly linked to the surface Web. Google's deep Web
surfacing system pre-computes submissions for each HTML form and adds the
resulting HTML pages into the Google search engine index. The surfaced results
account for a thousand queries per second to deep Web content.. In this system, the
pre-computation of submissions is done using three algorithms:

(1) selecting input values for text search inputs that accept keywords

(2) identifying inputs which accept only values of a specific type (e.g., date)

(3) selecting a small number of input combinations that generate URLs suitable for
inclusion into the Web search index.

Dept. of Computer Engg. 12 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 7

CLASSIFYING RESOURCES

It is difficult to automatically determine if a Web resource is a member of the


surface Web or the deep Web. If a resource is indexed by a search engine, it is not
necessarily a member of the surface Web, because the resource could have been found
using another method (e.g., the Sitemap Protocol, mod oai, OAIster) instead of
traditional crawling. If a search engine provides a backlink for a resource, one may
assume that the resource is in the surface Web. Unfortunately, search engines do not
always provide all backlinks to resources. Even if a backlink does exist, there is no
way to determine if the resource providing the link is itself in the surface Web without
crawling all of the Web. Furthermore, a resource may reside in the surface Web, but it
has not yet been found by a search engine. Therefore, if we have an arbitrary resource,
we cannot know for sure if the resource resides in the surface Web or deep Web
without a complete crawl of the Web.

The concept of classifying search results by topic was pioneered by Yahoo!


Directory search and is gaining importance as search becomes more relevant in day-
to-day decisions. However, most of the work here has been in categorizing the surface
Web by topic. For classification of deep Web resources, Ipeirotis et al. presented an
algorithm that classifies a deep Web site into the category that generates the largest
number of hits for some carefully selected, topically-focused queries. Deep Web
directories under development include as OAIster at the University of Michigan,
Intute at the University of Manchester, INFOMINE at the University of California at
Riverside, and DirectSearch (by Gary Price). This classification poses a challenge
while searching the deep Web whereby two levels of categorization are required. The
first level is to categorize sites into vertical topics (e.g., health, travel, automobiles)
and sub-topics according to the nature of the content underlying their databases.

The more difficult challenge is to categorize and map the information


extracted from multiple deep Web sources according to end-user needs. Deep Web
search reports cannot display URLs like traditionalsearch reports. End users expect

Dept. of Computer Engg. 13 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

their search tools to not only find what they are looking for quickly, but to be intuitive
and user-friendly. In order to be meaningful, the search reports have to offer some
depth to the nature of content that underlie the sources or else the end-user will be lost
in the sea of URLs that do not indicate what content lies underneath them. The format
in which search results are to be presented varies widely by the particular topic of the
search and the type of content being exposed. The challenge is to find and map similar
data elements from multiple disparate sources so that search results may be exposed
in a unified format on the search report irrespective of their source.

7.1 CONTENT ON THE DEEP WEB

When we refer to the deep Web, we are usually talking about the following:

* The content of databases. Databases contain information stored in tables created by


such programs as Access, Oracle, SQL Server, and MySQL. (There are other types of
databases, but we will focus on database tables for the sake of simplicity.)
Information stored in databases is accessible only by query. In other words, the
database must somehow be searched and the data retrieved and then displayed on a
Web page. This is distinct from static, self-contained Web pages, which can be
accessed directly. A significant amount of valuable information on the Web is
generated from databases.

* Non-text files such as multimedia, images, software, and documents in formats such
as Portable Document Format (PDF). For example, see Digital Image Resources on
the Deep Web for a good indication of what is out there for images.

* Content available on sites protected by passwords or other restrictions. Some of this


is fee-based content, such as subscription content paid for by libraries and available to
their users based on various authentication schemes.

* Special content not presented as Web pages, such as full text articles and books

* Dynamically-changing, updated content

Dept. of Computer Engg. 14 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 8

DEEP WEB VS DARK WEB

Fig.8.1 Deep Web VS Dark Web

The Internet: Regular type of internet everyone uses to read news, visit social media
sites, and shop.

The Deep Web: The deep web is not indexed by the major search engines and is a
subset of the Internet . This means that instead of being able to search for places you
have to visit those directly . They’re waiting if you have an address but there aren’t
directions to get there . The Internet is too large for search engines to cover
completely thus Deep web is largely present.

Deep web commonly refers to all web pages that search engines cannot find. Hence it
includes the 'Dark Web' along with all web mail pages, user databases,
registrationrequired web forums and pages behind pay walls.

The Dark Web: The Dark Web also known as Dark net that is indexed, but to be able
to access it, requires something special

Dept. of Computer Engg. 15 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

e.g. authentication to gain access.

The Dark Web often resides on top of additional subnetworks such as Tor,
Free net and I2P .It is often associated with criminal activity of various degrees,
including pornography, buying and selling drugs, , gambling, etc.

While the Dark Web is used for criminal purposes more than the standard Internet or
the Deep Web, there are many legitimate uses for the Dark Web as well. Things like
using Tor to analyze reports of domestic abuse, government oppression, and other
crimes that have serious consequences for those calling out the issues are included in
Legitimate uses.

Fig.8.2 Contents of World Wide Web

8.1 BENEFITS OF DEEP WEB

There are multiple purposes of the dark web sites but the important reason behind its
usage is to remain anonymous. For privacy concern the hidden web is been used .

Benefits

• Information remains safe and You won’t let your information out.

• Dark web sites are not crawled by any spiders.

• Tor browser is used to access hidden web to talk anonymously.

•services like emails, I2P, Free net, Tor, P2P, Tail OS and VPN are been provided.

Dept. of Computer Engg. 16 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 9

FUTURE

The lines between search engine content and the deep Web have begun to blur,
as search services start to provide access to part or all of once-restricted content. An
increasing amount of deep Web content is opening up to free search as publishers and
libraries make agreements with large search engines. In the future, deep Web content
may be defined less by opportunity for search than by access fees or other types of
authentication.

Dept. of Computer Engg. 17 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

CHAPTER 10

CONCLUSION

A huge amount of new and rare knowledge is been contained by deep web that
would help us evolve in various fields and thus connects to other side of information.
It allots us with the ability to be fully free and independent in our own views without
any censorship. Finally, deep web is the way of distribution of many illegal goods and
activities.

Dept. of Computer Engg. 18 S.R.G.P.T.C Thriprayar


Seminar Report 2021-22 Deep Web

REFERENCE

 http://transparint.com/blog/2016/01/26/introduction-to-the-surface-dark-and-
deep-web/

 https://en.wikipedia.org/wiki/Deep_web

 http://www.doctorchaos.com/the-ultimate-guide-tothe-deep-dark-invisible-
web-darknet-unleased/

 https://www.quora.com/How-can-someone-accessthe-deep-web

 https://danielmiessler.com/study/internet-deepdark-web

 https://www.deepweb-sites.com/positive-uses-ofthe-dark-deep-web

 https://www.seminar4u.blogspot.com/2009/07/deep-web

Dept. of Computer Engg. 19 S.R.G.P.T.C Thriprayar

You might also like