Untangling The Web: Alena Kaltunevich
Untangling The Web: Alena Kaltunevich
ALENA KALTUNEVICH
1
Contents
• Introduction to Searching
• Search Engines
• Specialized Search
2
Document unclassified by NSA in 2013
3
Introduction to Searching
2. The search engine index, catalog, or database, where everything the spider
found is stored.
3. The search engine is software that actually sifts through everything in the
index to find matches and then ranks or sorts them into a list of results or hits.
When you use a search engine, you are searching the index or database, not
the web pages themselves. This is important to remember because no
search engine operates in "real time. “
4
Most search engines use statistical interfaces. The search engine assigns
relative weights to each search term, depending on:
When you query the database, the search engine adds up all the weights
that match your query terms and returns the documents with the highest
weight first. Each search engine has its own algorithm for assigning
weights, and they tweak these frequently. In general, rare, unusual terms
are easier to find than common ones because of the weighting system.
However, remember that "popularity" measured by various means often
trumps any statistical interface
Search engines are not the only and often not even
the best way to access information on the Internet. 5
The growth in the number of search engines has led to the creation of "meta" search
sites. These services allow you to invoke several or even many search engines
simultaneously.
http://c1usty.com/
it employs its own
clustering engine, software that organizes unstructured information into
hierarchical folders.
Clusty is especially useful for searching ambiguous terms, such as cardinal,
because it clusters them by logical categories, as shown below.
Ex Iran (clusters on the left)
http://www.dogpile.com/
https://mamma.com/
6
Use the right tool for the job
the best starting places for general information on broad topics are web
directories/subject guides, virtual libraries, and reference desks
http://www.about.com/
http://www.encyclopedia.com/
http://www.britannica.com/
While directories and virtual libraries contain information selected by people, search
engine databases are mostly unfiltered, that is, no human being is looking at the
data being indexed to determine its value, authenticity, and reliability.
7
Search engines
Google
Google first gained fame and widespread use because of its single-minded
focus on search, exemplified by its "clean" interface, and its PageRank
weighted link popularity."
8
Google assumes as its default that multiple search terms are joined by the AND
operator, so that a search on the keywords [windows explorer] will find all the
webpages that contain both search terms. Furthermore, Google will first try to find
all the webpages that contain the phrase ["windows explorer"].
While Google assumes that multiple keywords are a phrase, searchers can
delimit phrases using double-quotes. For example, if I search on [the last king of france]
without double-quotes, Google will ignore the "the" and the "of' in its search. The
results I get include many irrelevant hits, such as music from a group called ''The
Last King" and an article about Lance Armstrong. However, if I enclose the same
query in double-quotes, Google will search on exactly the phrase ["the last king of
france"], and return a result with the name of the last king of France. Enclosing
searches in double-quotes is much more effective for finding precise results than
relying on automatic phrase searching. 9
It is unnecessary to use the plus sign (+) with any terms except stop words because
by default Google searches for all keywords.
However, there are many times when searchers need to exclude certain terms
that are commonly associated with a keyword but irrelevant to their search.
Using the minus sign in front of a keyword ensures that Google excludes that term
from the search. For example, the results for the search ["pearl harbor" -movie] are
very different from the results for ["pearl harbor"].
To force Google to search only for the term with the diacritic , put a plus sign
in front of the term: [+façade].
10
Google Advanced
[cirrus -site:mastercard.com] finds pages about the keyword cirrus that are not at
the Mastercard.com site
intitle: restricts the results to documents containing the keyword in the title.
[intitle:amazon "rain forest"] finds all pages that include the word amazon in their
title and mention the phrase "rain forest" anywhere in the document (title or text
or anywhere in the document)
inurl: restricts the results to documents containing the keyword in the urI
[inurl:nasa -site:gov] finds all pages that include nasa anywhere in the uri of sites
that are not in the .gov top-level domain
11
link: restricts the results to documents that have links to a specific webpage.
[Iink:www.noaa.gov] finds all pages linking to the NOAA homepage.
Video search:
genre, duration
[is:free sharknado]
Contrary to popular opinion, everything is not on the Internet. In fact, much of the kind
of information you are used to working with is not and never will be on the Internet.
Unrealistic expectations about the kinds of information you may find on the Internet can
lead to frustration and wasted time and effort. A general rule of thumb:
the more sensitive, rare, or expensive the information, the less likely it is to be on the
12
Internet. Also, much valuable data on the Internet requires payment.
Word Order Matters. Google gives more weight to the first term in a query, so put
the most important search term(s) first Try these two queries and you'll see how
different the results are: [new york city] vs. [city york new]
YAHOO
https://fr.search.yahoo.com/
Boolean operator queiries can give results that are different from returned by google
Here is an interesting twist on link searching, that is, finding sites that link to a
specific address. This search, which works with Yahoo finds pages that link to a specific
domain or domains but not to another specific domain or domains.
13
Gigablast
http://www.gigablast.com/
Strengths
• simple interface
• cached copies with date indexed [archived copies]
• cached copies of webpages without images [stripped]
• links to Internet Archives [older copies]
• clusters results by default (can be turned off)
• no limit on number of search terms
Weaknesses
• most obviously, the Gigablast index is still smaller than those of Google or Yahoo
• no truncation
• is not case sensitive
• no wildcard
• limited file type searches
• limited language options
• poor documentation
14
Exalead
http://www.exalead.com/search/
The French search engine Exalead, which introduced a new look in 2006, has
features that make it worth special mention. Exalead offers both proximity searches
and truncation, two options no other major search engine offers anymore. In
addition, Exalead presents thumbnail images of websites in the results list (if you want them)
• Exalead refreshes its index continuously, not on a schedule (this is a good thing)
• default operator is AND; users may use OR.
• Exalead does not publish a search term limit
• as of now, Exalead has no sponsored links.
There are two other operators that can be used in a boolean query:
NEAR and OPT. NEAR finds search terms within 16 words of each other and OPT
makes a query term preferable but does not require it.
For example: [(football NEAR cardinals) OPT "st louis"]
This is nice to know because most search engines use AND as their default, and will
not return results unless all terms are found
Ask
http://fr.ask.com/ 15
Specialized Search
The whole problem of keeping information on the Internet private dramatically
worsened almost overnight a couple of years ago when Google quietly started
indexing whole new types of data.
Originally, most of what got spidered and indexed was HTML webpages and documents,
with some plain text thrown in for good measure.
Problem was, lots of folks had assumed these file types were "immune" to spidering
not because it couldn't be done but because no one had yet done it.
As a result, many companies,
organizations, and even governments had quite a lot of egg on their faces when
sensitive documents began turning up in the Google database
16
What kinds of sensitive information can routinely be found using search engines?
The types of data most commonly discovered by Google hackers usually falls into
one of these categories:
Even when Google removes your data, there are literally hundreds of other
search engines around the world, and who knows what they have indexed from your
site. It will not be an easy task finding out. And I'll hazard a guess that not all of them
will be quite so accommodating as Google in removing pages..
Wikipedia
http://www.wikiwax.com/
To search all Wikipedias:
[site:wikipedia.org]
http://a9.com/
Amazon search
18
Google book search
[inpublisher:o-reilly]
[inauthor:patrick-o-brian]
[intitle:"nutmeg of consolation"]
[isbn:0393030326]
Answers
http://www.answers.com/
Wayback Machine
http://archive.org/web/
Using the Wayback Machine, you may very well be able to retrieve a page or an entire site
even if it disappeared from the web years ago.
Wikis, custom search engines, and directories are generally better when researching a
broad topic
21
Tip 8: Learn Two Words in Any Non-English Language in Which You are Searching
Those two words are search and links. You need to be able to push the search or
find button on a non-English web page, and you need to be able to find the links Page
23
Tip 18: Always look at a Website's Native language Version
Usually, the native language version of a website will differ from the English version,
sometimes a little, sometimes a great deal.
24
Questions?
25