Seminar Report 2021-22 Deep Web
Seminar Report 2021-22 Deep Web
CHAPTER 1
INTRODUCTION
The deep Web (also called Deepnet, the invisible Web, dark Web or the
hidden Web) refers to World Wide Web content that is not part of the surface Web,
which is indexed by standard search engines.
Mike Bergman, credited with coining the phrase, has said that searching on the
Internet today can be compared to dragging a net across the surface of the ocean; a
great deal may be caught in the net, but there is a wealth of information that is deep
and therefore missed. Most of the Web's information is buried far down on
dynamically generated sites, and standard search engines do not find it. Traditional
search engines cannot "see" or retrieve content in the deep Web – those pages do not
exist until they are created dynamically as the result of a specific search. The deep
Web is several orders of magnitude larger than the surface web
Surface Web is the top-most layer of the Internet. Everything we surf on the
Internet for our daily needs are lying under the Surface Web. This surface web is the
most commonly used surfing area, everything here is publicly accessible, always
traffic-crowded by all types of visitors daily and the comparatively safer portion of
the Internet, built for every regular user/common people. Surface Web is an open-
portal, easily available & accessible for anyone from any corner of the World using
our regular browsers and this is the only layer whose almost every information is
indexed by all the popular Search Engines. In-short, Surface Web is the only Default
user-accessible part of the Internet. We can access the Surface Web using the Popular
Search Engines like Google, Yahoo, Bing, DuckDuckGo, using the popular browsers
like Google Chrome, Mozilla Firefox, Internet Explorer, Netscape Navigator,
Microsoft Edge, Opera, UC Browser and more. Surface Web is the purest part of the
World Wide Web. Here, the HTTP/HTTPS, TCP or UDP protocols and SSL(Secure
Socket Layer)/TLS(Transport Layer Security) Certified websites are found.
The Surface Web is filled with plenty of contents indexed by its automated
AI-based Web-Crawlers, those contents are the several types of website URLs, lots of
images, videos, GIFs, and such data or information are indexed in the large databases
of those Search Engines respectively. These contents are indexed legally and verified
by Search Engines.
The big Irony is- It is impossible for a person in one life to surf everything, to
know, to learn, to see and to catch in mind or to understand all those information
which are available and accessible on the Surface Web!!!!.Because the Surface Web
itself is a large area of the Internet, increasing constantly by its huge amount of data-
indexing. But the most surprising factor is: As per the basic measurement of the
Internet Layers, concerning its data quantities, this Surface Web is the only 4% of the
Total Internet!!. Yes, only this 4% is considered as the Surface Web and still, it seems
to be like a limitless Internet Web Layer for us, having an uncountable amount of data
indexed!!. After the surface web, the rests 96% which is referred to be as the deeper
side and hidden part of the Internet is called the Deep Web.
The Deep Web is the deeper part of the Internet.The deep web,invisible web,
or hidden web are parts of the World Wide Web whose contents are not indexed by
standard web search-engines. This is in contrast to the "surface web", which is
accessible to anyone using the Internet. Computer-scientist Michael K. Bergman is
credited with coining the term in 2001 as a search-indexing term.
It is the most sensitive part of the Internet, not indexed by Search Engines. In
true words, this part is not allowed to be indexed on Search Engines to show publicly.
It is the only accessible area to its respective owners who has the proof of having the
needed credentials or permissions to access any database information..
The content of the deep web can be located and accessed by a direct URL or
IP address, but may require a password or other security access to get past public-
website pages. The deep web is also the mass storage of all the website-
oriented(belongs to Surface Web) server-side information and the huge collection of
the database-stacks which are filled with sensitive data.
1.3 NAMING
Bergman, in a seminal, early paper on the deep Web published in the Journal of
Electronic Publishing, mentioned that Jill Ellsworth used the term invisible Web in
1994 to refer to websites that are not registered with any search engine. Bergman
cited a January 1996 article by Frank Garcia:
"It would be a site that's possibly reasonably designed, but they didn't bother to
register it with any of the search engines. So, no one can find them! You're hidden. I
call that the invisible Web."
Another early use of the term invisible Web was by Bruce Mount and Matthew B.
Koll of Personal Library Software, in a description of the @1 deep Web tool found in
a December 1996 press release.
The first use of the specific term deep Web, now generally accepted, occurred in the
aforementioned 2001 Bergman study.
1.4 SIZE
In 2000, it was estimated that the deep Web contained approximately 7,500 terabytes
of data and 550 billion individual documents. Estimates based on extrapolations from
a study done at University of California, Berkeley, show that the deep Web consists of
about 91,000 terabytes. By contrast, the surface Web (which is easily reached by
search engines) is only about 167 terabytes; the Library of Congress, in 1997, was
estimated to have perhaps 3,000 terabytes.
CHAPTER 2
Level 1: It is the most common web or internet. We use it pretty much daily as well
as know and understand it. It generally comprises of the ‘Open to the Public’ part of
the Web.
Level 2: It is commonly known as the Surface Web. Services such as Reddit, Digg,
and temporary email are included in it. Chat boards and other social enabling content
can be found in this level as it is essentially a communications platform. To reach it in
any fashion is not difficult.
Level 3: level 3 is called Bergie Web. Services besides WWW or Web services are
been included here. It consists of Internet newsgroups, Google locked results, FTP
sites, honeypots and other sites such as 4Chan. If you know where you are going, this
level is relatively simple to reach.
Level 4: Level 4 is known as Charter Web or Deep Web. Hacker groups, activist
communications, banned media and other darker layers of the online society are
included in this websites. This is what we basically refer to as Deep Web. Typical
Web search engines cannot find the sites on this layer . You have to be invited by an
existing member. In order to be able to access these sites
Level 5: In this level, things get a little creepy. The level is known as Dark Web
through the normal Internet
These websites are not accessible to this level. You will need to get on the TOR
network, or some other private networks .Dark Web sites are also referred to as TOR
Hidden Services or onion sites. On the TOR network there is a variety of legal and
illegal content. Illegal materials such as bounty hunters, drugs, human trafficking,
hacker exploits, rare animal trade and other black market items are been included in
these sites. Whenever we refer to Dark Web, we are normally referring to the TOR
network.
CHAPTER 3
TOR NETWORK
TOR stands for "The Onion Router". To run TOR software installed on your
computer and TOR connections managed by network of computers both are
considered. In simple terms, TOR permits you to route web traffic through several
other computers in the TOR network so that the party on the other side of the
connection can’t trace the traffic back to you. More TOR users means more protection
for your information, since you are using other computers to route your connections
and sessions . It creates a number of layers that conceal your identity from the rest of
the world as the name implies
CHAPTER 4
INDEXING METHODS
Methods that prevent web pages from being indexed by traditional search
engines may be categorized as one or more of the following:
1.Contextual web: pages with content varying for different access contexts (e.g.,
ranges of client IP addresses or previous navigation sequence).
3.Limited access content: sites that limit access to their pages in a technical way (e.g.,
using the Robots Exclusion Standard or CAPTCHAs, or no-store directive, which
prohibit search engines from browsing them and creating cached copies).
5.Private web: sites that require registration and login (password-protected resources).
6.Scripted content: pages that are only accessible through links produced by
JavaScript as well as content dynamically downloaded from Web servers via Flash or
Ajax solutions.
8.Unlinked content: pages which are not linked to by other pages, which may prevent
web crawling programs from accessing the content. This content is referred to as
pages without backlinks (also known as inlinks). Also, search engines do not always
detect all backlinks from searched web pages.
9.Web archives: Web archival services such as the Wayback Machine enable users to
see archived versions of web pages across time, including websites which have
become inaccessible, and are not indexed by search engines such as Google. The
Wayback Machine may be called a program for viewing the deep web, as web
archives that are not from the present cannot be indexed, as past versions of websites
are impossible to view through a search. All websites are updated at some point,
which is why web archives are considered Deep Web content.
10.robots.txt files: A robots.txt file can advise search engine bots not to crawl
websites using user-agent: * then disallow: /. This will tell all search engine bots not
to crawl the entire website and adding it to the search engine.
CHAPTER 5
To discover content on the Web, search engines use web crawlers that follow
hyperlinks. This technique is ideal for discovering resources on the surface Web but is
often ineffective at finding deep Web resources. For example, these crawlers do not
attempt to find dynamic pages that are the result of database queries due to the infinite
number of queries that are possible.It has been noted that this can be (partially)
overcome by providing links to query results, but this could unintentionally inflate the
popularity (e.g., PageRank) for a member of the deep Web.
One way to access the deep Web is via federated search based search engines. Search
tools such as Science.gov are being designed to retrieve information from the deep
Web. These tools identify and interact with searchable databases, aiming to provide
access to deep Web content.
Another way to explore the deep Web is by using human crawlers instead of
algorithmic crawlers. In this paradigm, referred to as Web harvesting, humans find
interesting links of the deep Web that algorithmic crawlers can't find. This human-
based computation technique to discover the deep Web has been used by the
StumbleUpon service since February 2002.
In 2005, Yahoo! made a small part of the deep Web searchable by releasing
Yahoo! Subscriptions. This search engine searches through a few subscription-only
Web sites. Some subscription websites display their full content to search engine
robots so they will show up in user searches, but then show users a login or
subscription page when they click a link from the search engine results page.
CHAPTER 6
Researchers have been exploring how the deep Web can be crawled in an
automatic fashion. In 2001, SriramRaghavan and Hector Garcia-Molina presented an
architectural model for a hidden-Web crawler that used key terms provided by users
or collected from the query interfaces to query a Web form and crawl the deep Web
resources. AlexandrosNtoulas, PetrosZerfos, and Junghoo Cho of UCLA created a
hidden-Web crawler that automatically generated meaningful queries to issue against
search forms.Their crawler generated promising results, but the problem is far from
being solved, as the authors recognized. Another effort is DeepPeep, a project of the
University of Utah sponsored by the National ScienceFoundation, which gathered
hidden-Web sources (Web forms) in different domains based on novel focused
crawler techniques.
(1) selecting input values for text search inputs that accept keywords
(2) identifying inputs which accept only values of a specific type (e.g., date)
(3) selecting a small number of input combinations that generate URLs suitable for
inclusion into the Web search index.
CHAPTER 7
CLASSIFYING RESOURCES
their search tools to not only find what they are looking for quickly, but to be intuitive
and user-friendly. In order to be meaningful, the search reports have to offer some
depth to the nature of content that underlie the sources or else the end-user will be lost
in the sea of URLs that do not indicate what content lies underneath them. The format
in which search results are to be presented varies widely by the particular topic of the
search and the type of content being exposed. The challenge is to find and map similar
data elements from multiple disparate sources so that search results may be exposed
in a unified format on the search report irrespective of their source.
When we refer to the deep Web, we are usually talking about the following:
* Non-text files such as multimedia, images, software, and documents in formats such
as Portable Document Format (PDF). For example, see Digital Image Resources on
the Deep Web for a good indication of what is out there for images.
* Special content not presented as Web pages, such as full text articles and books
CHAPTER 8
The Internet: Regular type of internet everyone uses to read news, visit social media
sites, and shop.
The Deep Web: The deep web is not indexed by the major search engines and is a
subset of the Internet . This means that instead of being able to search for places you
have to visit those directly . They’re waiting if you have an address but there aren’t
directions to get there . The Internet is too large for search engines to cover
completely thus Deep web is largely present.
Deep web commonly refers to all web pages that search engines cannot find. Hence it
includes the 'Dark Web' along with all web mail pages, user databases,
registrationrequired web forums and pages behind pay walls.
The Dark Web: The Dark Web also known as Dark net that is indexed, but to be able
to access it, requires something special
The Dark Web often resides on top of additional subnetworks such as Tor,
Free net and I2P .It is often associated with criminal activity of various degrees,
including pornography, buying and selling drugs, , gambling, etc.
While the Dark Web is used for criminal purposes more than the standard Internet or
the Deep Web, there are many legitimate uses for the Dark Web as well. Things like
using Tor to analyze reports of domestic abuse, government oppression, and other
crimes that have serious consequences for those calling out the issues are included in
Legitimate uses.
There are multiple purposes of the dark web sites but the important reason behind its
usage is to remain anonymous. For privacy concern the hidden web is been used .
Benefits
• Information remains safe and You won’t let your information out.
•services like emails, I2P, Free net, Tor, P2P, Tail OS and VPN are been provided.
CHAPTER 9
FUTURE
The lines between search engine content and the deep Web have begun to blur,
as search services start to provide access to part or all of once-restricted content. An
increasing amount of deep Web content is opening up to free search as publishers and
libraries make agreements with large search engines. In the future, deep Web content
may be defined less by opportunity for search than by access fees or other types of
authentication.
CHAPTER 10
CONCLUSION
A huge amount of new and rare knowledge is been contained by deep web that
would help us evolve in various fields and thus connects to other side of information.
It allots us with the ability to be fully free and independent in our own views without
any censorship. Finally, deep web is the way of distribution of many illegal goods and
activities.
REFERENCE
http://transparint.com/blog/2016/01/26/introduction-to-the-surface-dark-and-
deep-web/
https://en.wikipedia.org/wiki/Deep_web
http://www.doctorchaos.com/the-ultimate-guide-tothe-deep-dark-invisible-
web-darknet-unleased/
https://www.quora.com/How-can-someone-accessthe-deep-web
https://danielmiessler.com/study/internet-deepdark-web
https://www.deepweb-sites.com/positive-uses-ofthe-dark-deep-web
https://www.seminar4u.blogspot.com/2009/07/deep-web