0% found this document useful (0 votes)
202 views64 pages

Deep Web

Network forensics involves analyzing network traffic to investigate cybercrimes. Some key techniques for network forensics include monitoring network traffic, inspecting routers and switches, and capturing and analyzing packet data to retrieve information such as IP addresses and timestamps. Network forensics is challenging but can provide valuable evidence for investigations.

Uploaded by

denny siregar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
202 views64 pages

Deep Web

Network forensics involves analyzing network traffic to investigate cybercrimes. Some key techniques for network forensics include monitoring network traffic, inspecting routers and switches, and capturing and analyzing packet data to retrieve information such as IP addresses and timestamps. Network forensics is challenging but can provide valuable evidence for investigations.

Uploaded by

denny siregar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Deep Web

Part II.B. Techniques and Tools:


Network Forensics

CSF: Forensics Cyber-Security


Fall 2015
Nuno Santos
Summary

}  The Surface Web

}  The Deep Web

2 CSF - Nuno Santos 2015/16


Remember were we are
}  Our journey in this course:

}  Part I: Foundations of digital forensics

}  Part II: Techniques and tools

}  A. Computer forensics

}  B. Network forensics Current focus

}  C. Forensic data analysis

3 CSF - Nuno Santos 2015/16


Previously: Three key instruments in cybercrime

Anonymity systems
How criminals hide their IDs

Tools of
cybercrime

Botnets Digital currency


How to launch large scale attacks How to make untraceable payments

4 CSF - Nuno Santos 2015/16


Today: One last key instrument – The Web itself

Offender

}  Web allows for accessing services for criminal activity


}  E.g., drug selling, weapon selling, etc.

}  Provides huge source of information, used in:


}  Crime premeditation, privacy violations, identity theft, extortion, etc.

}  To find services and info, there are powerful search engines
}  Google, Bing, Shodan, etc.

5 CSF - Nuno Santos 2015/16


The Web: powerful also for crime investigation

Investigator

}  Powerful investigation tool about suspects


}  Find evidence in blogs, social networks, browsing activity, etc.

}  The playground where the crime itself is carried out


}  Illegal transactions, cyber stalking, blackmail, fraud, etc.

6 CSF - Nuno Santos 2015/16


An eternal cat & mouse race (who’s who?)

}  The sophistication of offenses (and investigations) is driven


by the nature and complexity of the Web

7 CSF - Nuno Santos 2015/16


The web is deep, very deep…
}  What’s “visible” through typical search engines is minimal

8 CSF - Nuno Santos 2015/16


What can be found in the Deep Web?

}  Deep Web is not


necessarily bad: it’s
just that the content
is not directly
indexed

}  Part of the deep


web where criminal
activity is carried
out is named the
Dark Web

9 CSF - Nuno Santos 2015/16


Some examples of services in the Web “ocean”

10 CSF - Nuno Santos 2015/16


Offenders operate at all layers

}  Investigators too!

11 CSF - Nuno Santos 2015/16


Roadmap

}  The Surface Web

}  The Deep Web

12 CSF - Nuno Santos 2015/16


The Surface Web

13 CSF - Nuno Santos 2015/16


The Surface Web

}  The Surface Web is that portion of the World Wide Web
that is readily available to the general public and
searchable with standard web search engines
}  AKA Visible Web, Clearnet, Indexed Web, Indexable Web or Lightnet

}  As of June 14, 2015, Google's index of the surface web


contains about 14.5 billion pages

14 CSF - Nuno Santos 2015/16


Surface Web characteristics
}  Distributed data
}  80 million web sites (hostnames responding) in April 2006
}  40 million active web sites (don’t redirect, …)

}  High volatility


}  Servers come and go …

}  Large volume


}  One study found 11.5 billion pages in January 2005 (at that
time Google indexed 8 billion pages)

15 CSF - Nuno Santos 2015/16


Surface Web characteristics
}  Unstructured data
}  Lots of duplicated content (30% estimate)
}  Semantic duplication much higher

}  Quality of data


}  No required editorial process
}  Many typos and misspellings (impacts IR)

}  Heterogeneous data


}  Different media
}  Different languages

16 CSF - Nuno Santos 2015/16


Surface Web composition by file type
To hide

}  As of 2003,
about 70% of
Web content is
images, HTML,
PHP, and PDF files

17 CSF - Nuno Santos 2015/16


How to find content and services?
}  Using search engines

1. A web crawler gathers a


snapshot of the Web
3. User submits a search
query

4. Search engine ranks pages that match


the query and returns an ordered list
2. The gathered pages are
indexed for easy retrieval

18 CSF - Nuno Santos 2015/16


How a typical search engine works
}  Architecture of a typical search engine
Lots and lots of computers

Users Interface Query Engine

Index

Crawler Indexer

Web
19 CSF - Nuno Santos 2015/16
What a Web crawler does

}  The Web crawler is a foundational species


}  Without crawlers, there would be nothing to search

}  Creates and repopulates


search engines data by
navigating the web,
fetching docs and files

20 CSF - Nuno Santos 2015/16


What a Web crawler is
}  In general, it’s a program for downloading web pages
}  Crawler AKA spider, bot, harvester

}  Given an initial set of seed URLs, recursively download


every page that is linked from pages in the set
}  A focused web crawler downloads only those pages whose
content satisfies some criterion

}  The next node to crawl is the URL frontier


}  Can include multiple pages from the same host
21 CSF - Nuno Santos 2015/16
Crawling the Web: Start from the seed pages

URLs crawled
and parsed
Unseen Web

Seed URLs frontier


pages
Web

22 CSF - Nuno Santos 2015/16


Crawling the Web: Keep expanding URL frontier

URLs crawled
and parsed
Unseen Web

Seed
Pages
URL frontier
Crawling thread
23 CSF - Nuno Santos 2015/16
Web crawler algorithm is conceptually simple

}  Basic Algorithm


Initialize queue (Q) with initial set of known URL’s
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q
If L is not to an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
continue loop
If already visited L, continue loop
Download page, P, for L
If cannot download P (e.g. 404 error, robot excluded)
continue loop
Index P (e.g. add to inverted index or store cached copy)
Parse P to obtain list of new links N
Append N to the end of Q

24 CSF - Nuno Santos 2015/16


But not so simple to build in practice

}  Performance: How do you crawl 1,000,000,000 pages?

}  Politeness: How do you avoid overloading servers?

}  Failures: Broken links, time outs, spider traps.

}  Strategies: How deep to go? Depth first or breadth first?

}  Implementations: How do we store and update the URL


list and other data structures needed?

25 CSF - Nuno Santos 2015/16


Crawler performance measures
}  Completeness
Is the algorithm guaranteed to find a solution when
there is one?

}  Optimality
Is this solution optimal?

}  Time complexity


How long does it take?

}  Space complexity


How much memory does it require?

26 CSF - Nuno Santos 2015/16


No single crawler can crawl the entire Web
}  Crawling technique may depend on goal

}  Types of crawling goals:


}  Create large broad index
}  Create a focused topic or domain-specific index
}  Target topic-relevant sites
}  Index preset terms
}  Create subset of content to model characteristics
of the Web
}  Need to survey appropriately
}  Cannot use simple depth-first or breadth-first
}  Create up-to-date index
}  Use estimated change frequencies

27 CSF - Nuno Santos 2015/16


Crawlers also be used for nefarious purposes

}  Spiders can be used to collect email addresses for


unsolicited communication
}  From: http://spiders.must.die.net

28 CSF - Nuno Santos 2015/16


Crawler code available for free

29 CSF - Nuno Santos 2015/16


Spider traps
}  A spider trap is a set of web pages that may be used to
cause a web crawler to make an infinite number of
requests or cause a poorly constructed crawler to crash
}  To “catch” spambots or similar that waste a website's bandwidth

}  Common techniques used are:


•  Creation of indefinitely deep directory
structures like
•  http://foo.com/bar/foo/bar/foo/bar/foo/
bar/.....
•  Dynamic pages like calendars that
produce an infinite number of pages for
a web crawler to follow
•  Pages filled with many chars, crashing
the lexical analyzer parsing the page
30 CSF - Nuno Santos 2015/16
Search engines run specific and benign crawlers

}  Search engines obtain their listings in two ways:


}  The search engines “crawl” or “spider” documents by following one
hypertext link to
}  Authors may submit their own Web pages

}  As a result, only static Web content can be found on public


search engines

}  Nevertheless, a lot of info can be retrieved by criminals and


investigators, especially when using “hidden” features of the
search engine

31 CSF - Nuno Santos 2015/16


Google hacking
}  Google provides keywords for advanced searching
}  Logic operators in search expressions
}  Advanced query attributes: “login password filetype:pdf”
}  Intitle,  allintitle   }  Related  
}  Inurl,  allinurl   }  Phonebook  
}  Filetype   }  Rphonebook  
}  Allintext   }  Bphonebook  
}  Site   }  Author  
}  Link   }  Group  
}  Inanchor   }  Msgid  
}  Daterange   }  Insubject  
}  Cache   }  Stocks  
}  Info   }  Define  

32 CSF - Nuno Santos 2015/16


There’s entire books dedicated to Google hacking
Dornfest, Rael, Google Hacks 3rd ed,
O’Rielly, (2006)

Ethical Hacking,
http://www.nc-net.info/2006conf/
Ethical_Hacking_Presentation_October_
2006.ppt

A cheat sheet of Google search


features:
http://www.google.com/intl/en/help/
features.html

A Cheat Sheet for Google Search Hacks


-- how to find information fast and
efficiently
http://www.expertsforge.com/Security/
hacking-everything-using-google-3.asp

33 CSF - Nuno Santos 2015/16


Google hacking examples: Simple word search

}  A simple search: “cd ls .bash_history ssh”

}  Can return surprising


results: this is the
contents of a
live .bash_history file
34 CSF - Nuno Santos 2015/16
Google hacking examples: URL searches
}  inurl: find the search
inurl:admin term within the URL

inurl:admin
users mbox
inurl:admin users
passwords

35 CSF - Nuno Santos 2015/16


Google hacking examples: File type searches

}  filetype: narrow down search results to specific file type


filetype:xls “checking
account” “credit card”

36 CSF - Nuno Santos 2015/16


Google hacking examples: Finding servers

intitle:"Under
construction" "does not
currently have"

intitle:"Welcome to Windows
2000 Internet Services"

37 CSF - Nuno Santos 2015/16


Google hacking examples: Finding webcams

}  To find open unprotected Internet


webcams that broadcast to the
web, use the following query:
}  inurl:/view.shtml

}  Can also search by manufacturer-specific URL patterns


}  inurl:ViewerFrame?Mode=
}  inurl:ViewerFrame?Mode=Refresh
}  inurl:axis-cgi/jpg
}  ...
38 CSF - Nuno Santos 2015/16
Google hacking examples: Finding webcams

}  How to Find and View Millions of Free Live Web Cams
http://www.traveltowork.net/2009/02/how-to-find-view-
free-live-web-cams/

}  How to Hack Security Cameras,


http://www.truveo.com/How-To-Hack-Security-Cameras/
id/180144027190129591

}  How to Hack Security Cams all over the World


http://www.youtube.com/watch?
v=9VRN8BS02Rk&feature=related

39 CSF - Nuno Santos 2015/16


And we’re just scratching the surface…

What can be found in the depths of the Web?

40 CSF - Nuno Santos 2015/16


The Deep Web

41 CSF - Nuno Santos 2015/16


The Deep Web

}  Deep Web is the part


of the Web which is
not indexed by
conventional search
engines and therefore
don’t appear in
search results

}  Why is it not indexed


by typical search
engines?

42 CSF - Nuno Santos 2015/16


Some content can’t be found through URL traversal

•  Dynamic web pages and searchable databases


–  Response to a query or accessed only through a form
•  Unlinked contents
–  Pages without any backlinks
•  Private web
–  Sites requiring registration and login
•  Limited access web
–  Sites with captchas, no-cache pragma http headers
•  Scripted pages
–  Page produced by javascrips, Flash, etc.

43 CSF - Nuno Santos 2015/16


In other times, content won’t be found
}  Crawling restrictions by site owner
}  Use a robots.txt file to keep files off limits from spiders

}  Crawling restrictions by the search engine


}  E.g.: a page may be found this way:
http://www.website.com/cgi-bin/getpage.cgi?name=sitemap
}  Most search engines will not read past the ? in that URL

}  Limitations of the crawling engine


}  E.g., real-time data – changes rapidly – too “fresh”

44 CSF - Nuno Santos 2015/16


How big is Deep Web?
}  Studies suggest it’s approx. 500x the surface Web
}  But cannot be determined accurately

}  A 2001 study showed that 60 deep sites exceeded the


size of the surface web (at that time) by 40x

45 CSF - Nuno Santos 2015/16


Distribution of Deep Web sites by content type

}  Back in 2001,


biggest fraction
goes to
databases

46 CSF - Nuno Santos 2015/16


Approaches for finding content in Deep Web

1.  Specialized search engines

2.  Directories

47 CSF - Nuno Santos 2015/16


Specialized search engines

}  Crawl deeper


}  Go beyond top page, or homepage

}  Crawl focused


}  Choose sources to spider—topical sites only

}  Crawl informed


}  Indexing based on knowledge of the specific subject

48 CSF - Nuno Santos 2015/16


Specialized search engines abound
}  There’s hundreds of specialized search engines for almost
every topic

49 CSF - Nuno Santos 2015/16


Directories

}  Collections of pre-screened web-sites into categories


based on a controlled ontology
}  Including access to content in databases

}  Ontology: classification of human knowledge into


topics, similar to traditional library catalogs

}  Two maintenance models: open or closed


}  Closed model: paid editors; quality control (Yahoo)
}  Open model: volunteer editors; (Open Directory Project)

50 CSF - Nuno Santos 2015/16


Example of ontology
}  Ontologies allow for adding structure to Web content

51 CSF - Nuno Santos 2015/16


A particularly interesting search engine

}  Shodan lets the user find specific types of computers connected
to the internet using a variety of filters
}  Routers, servers, traffic lights, security cameras, home heating systems
}  Control systems for water parks, gas stations, water plants, power grids,
nuclear power plants and particle-accelerating cyclotrons

}  Why is it interesting?


}  Many devices use "admin" as user name and "1234" as password, and
the only software required to connect them is a web browser

52 CSF - Nuno Santos 2015/16


How does Shodan work?

“Google crawls URLs – I don’t do that at all.The only thing I


do is randomly pick an IP out of all the IPs that exist,
whether it’s online or not being used, and I try to connect
to it on different ports. It’s probably not a part of the visible
web in the sense that you can’t just use a browser. It’s not
something that most people can easily discover, just because
it’s not visual in the same way a website is.”

John Matherly, Shodan's creator

}  Shodan collects data mostly on HTTP servers (port 80)


}  But also from FTP (21), SSH (22) Telnet (23), and SNMP (161)

53 CSF - Nuno Santos 2015/16


One can see through the eye of a webcam

54 CSF - Nuno Santos 2015/16


Play with the controls for a water treatment facility

55 CSF - Nuno Santos 2015/16


Find the creepiest stuff…
}  Controls for a crematorium; accessible from your computer

56 CSF - Nuno Santos 2015/16


No words needed
}  Controls of Caterpiller trucks connected to the Internet

57 CSF - Nuno Santos 2015/16


A Deep Web’s particular case

Dark Web

58 CSF - Nuno Santos 2015/16


Dark Web

}  Dark Web is the Web content that exists on darknets

}  Darknets are overlay nets which use the public Internet
but require specific SW or authorization to access
}  Delivered over small peer-to-peer networks
}  As hidden services on top of Tor

}  The Dark Web forms a small part of the Deep Web,
the part of the Web not indexed by search engines

59 CSF - Nuno Santos 2015/16


The Dark Web is a haven for criminal activities
}  Hacking services

}  Fraud and fraud


services

}  Markets for


illegal products

}  Hitmen

}  …
60 CSF - Nuno Santos 2015/16
Surface Web vs. Deep Web

Surface Web Deep Web


}  Size: Estimated to be 8+ billion }  Size: Estimated to be 5 to 500x
(Google) to 45 billion larger (BrightPlanet)
(About.com) web pages
}  Dynamically generated content
}  Static, crawlable web pages that lives inside databases

}  Large amounts of unfiltered }  High-quality, managed, subject-


information specific content

}  Limited to what is easily found }  Growing faster than surface
by search engines web (BrightPlanet)

61 CSF - Nuno Santos 2015/16


Conclusions

}  The Web is a major source of information for both


criminal and legal investigation activities

}  The Web content that is typically accessible through


conventional search engines is named the Surface Web
and represents only a small fraction of the whole Web

}  The Deep Web includes the largest bulk of the Web, a
small part of it (the Dark Web), being used
specifically for carrying out criminal activities

62 CSF - Nuno Santos 2015/16


References

}  Primary bibliography


}  Michael K. Bergman, The Deep Web: Surfacing Hidden Value
http://brightplanet.com/wp-content/uploads/2012/03/12550176481-
deepwebwhitepaper1.pdf

63 CSF - Nuno Santos 2015/16


Next class
}  Flow analysis and intrusion detection

64 CSF - Nuno Santos 2015/16

You might also like