0% found this document useful (0 votes)
144 views

Untangling The Web: Alena Kaltunevich

This document provides an overview of searching the internet and various search tools. It discusses how search engines work by having software "spiders" crawl the web and index pages in their databases. When a search is performed, the engine checks its index to find relevant results. Different engines use different algorithms to rank results. The document also describes various search engines like Google, Yahoo, and specialized tools for images, videos or other data types. It provides tips for effective searching using operators, quotation marks, and excluding terms.

Uploaded by

tezla76
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views

Untangling The Web: Alena Kaltunevich

This document provides an overview of searching the internet and various search tools. It discusses how search engines work by having software "spiders" crawl the web and index pages in their databases. When a search is performed, the engine checks its index to find relevant results. Different engines use different algorithms to rank results. The document also describes various search engines like Google, Yahoo, and specialized tools for images, videos or other data types. It provides tips for effective searching using operators, quotation marks, and excluding terms.

Uploaded by

tezla76
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

UNTANGLING THE WEB

ALENA KALTUNEVICH

1
Contents

• Introduction to Searching

• Search Engines

• Specialized Search

• Research Tips & Techniques

2
Document unclassified by NSA in 2013

3
Introduction to Searching

1. The spider/robot/crawler is software that "visits" sites on the Internet (each


search engine does this differently). The spider reads what is there, follows
links at the site, and ultimately brings all that data back to:

2. The search engine index, catalog, or database, where everything the spider
found is stored.

3. The search engine is software that actually sifts through everything in the
index to find matches and then ranks or sorts them into a list of results or hits.
When you use a search engine, you are searching the index or database, not
the web pages themselves. This is important to remember because no
search engine operates in "real time. “

4
Most search engines use statistical interfaces. The search engine assigns
relative weights to each search term, depending on:

• its rarity in their database


• how frequently the term occurs on the webpage
• whether or not the term appears in the uri
• how close to the top of the page the term appears
• (sometimes) whether or not the term appears in the metatags.

When you query the database, the search engine adds up all the weights
that match your query terms and returns the documents with the highest
weight first. Each search engine has its own algorithm for assigning
weights, and they tweak these frequently. In general, rare, unusual terms
are easier to find than common ones because of the weighting system.
However, remember that "popularity" measured by various means often
trumps any statistical interface

Search engines are not the only and often not even
the best way to access information on the Internet. 5
The growth in the number of search engines has led to the creation of "meta" search
sites. These services allow you to invoke several or even many search engines
simultaneously.

metasearch engines do serve a purpose. If you are unsure if a term will be


found anywhere on the web, try a metasearch engine first to "size" the problem:

http://c1usty.com/
it employs its own
clustering engine, software that organizes unstructured information into
hierarchical folders.
Clusty is especially useful for searching ambiguous terms, such as cardinal,
because it clusters them by logical categories, as shown below.
Ex Iran (clusters on the left)

http://www.dogpile.com/
https://mamma.com/
6
Use the right tool for the job

the best starting places for general information on broad topics are web
directories/subject guides, virtual libraries, and reference desks

http://www.about.com/
http://www.encyclopedia.com/
http://www.britannica.com/

While directories and virtual libraries contain information selected by people, search
engine databases are mostly unfiltered, that is, no human being is looking at the
data being indexed to determine its value, authenticity, and reliability.

However, no single search engine is


best. Each has its own advantages and drawbacks.
Use more than one search engine.

7
Search engines
Google

Google first gained fame and widespread use because of its single-minded
focus on search, exemplified by its "clean" interface, and its PageRank
weighted link popularity."

In simple terms, Google gives each webpage a rank based on the


number of other pages linking to it and the "importance" of those pages,
where importance is derived from an overall link count.

While PageRank is imperfect, it works better than most other approaches to


ranking search results and, indeed, is one of the primary reasons for
Google's success.

8
Google assumes as its default that multiple search terms are joined by the AND
operator, so that a search on the keywords [windows explorer] will find all the
webpages that contain both search terms. Furthermore, Google will first try to find
all the webpages that contain the phrase ["windows explorer"].

Google will search:


first, for phrases (keywords as one long phrase)
second, for webpages containing all the keywords with the greatest
adjacency (closest together),
third, for webpages containing all the keywords, regardless of where they appear on the page

While Google assumes that multiple keywords are a phrase, searchers can
delimit phrases using double-quotes. For example, if I search on [the last king of france]
without double-quotes, Google will ignore the "the" and the "of' in its search. The
results I get include many irrelevant hits, such as music from a group called ''The
Last King" and an article about Lance Armstrong. However, if I enclose the same
query in double-quotes, Google will search on exactly the phrase ["the last king of
france"], and return a result with the name of the last king of France. Enclosing
searches in double-quotes is much more effective for finding precise results than
relying on automatic phrase searching. 9
It is unnecessary to use the plus sign (+) with any terms except stop words because
by default Google searches for all keywords.

However, there are many times when searchers need to exclude certain terms
that are commonly associated with a keyword but irrelevant to their search.

That's where the minus sign (-) comes in.

Using the minus sign in front of a keyword ensures that Google excludes that term
from the search. For example, the results for the search ["pearl harbor" -movie] are
very different from the results for ["pearl harbor"].

To force Google to search only for the term with the diacritic , put a plus sign
in front of the term: [+façade].

10
Google Advanced

site: restricts search to a domain


[shuttle site:www.nasa.gov] finds pages about the space shuttle at the NASA website.

[cirrus -site:mastercard.com] finds pages about the keyword cirrus that are not at
the Mastercard.com site

intitle: restricts the results to documents containing the keyword in the title.
[intitle:amazon "rain forest"] finds all pages that include the word amazon in their
title and mention the phrase "rain forest" anywhere in the document (title or text
or anywhere in the document)

inurl: restricts the results to documents containing the keyword in the urI
[inurl:nasa -site:gov] finds all pages that include nasa anywhere in the uri of sites
that are not in the .gov top-level domain

11
link: restricts the results to documents that have links to a specific webpage.
[Iink:www.noaa.gov] finds all pages linking to the NOAA homepage.

filetype: Google will search the content of many file types


[filetype:doc bulletin]

Microsoft filetypes are potentially dangerous to open in their native formats .

Define: ex define blog

Video search:
genre, duration
[is:free sharknado]

Contrary to popular opinion, everything is not on the Internet. In fact, much of the kind
of information you are used to working with is not and never will be on the Internet.
Unrealistic expectations about the kinds of information you may find on the Internet can
lead to frustration and wasted time and effort. A general rule of thumb:
the more sensitive, rare, or expensive the information, the less likely it is to be on the
12
Internet. Also, much valuable data on the Internet requires payment.
Word Order Matters. Google gives more weight to the first term in a query, so put
the most important search term(s) first Try these two queries and you'll see how
different the results are: [new york city] vs. [city york new]

YAHOO
https://fr.search.yahoo.com/
Boolean operator queiries can give results that are different from returned by google

[cardinals AND (bird OR catholic) AND NOT (baseball OR football)]

Here is an interesting twist on link searching, that is, finding sites that link to a
specific address. This search, which works with Yahoo finds pages that link to a specific
domain or domains but not to another specific domain or domains.

[Iinkdomain:mod.ir linkdomain:ieimil.com -Iinkdomain:cia.gov]


this technique has obvious applicability for search engine optimization
("who is linking to my competitors but not linking to me?")

13
Gigablast

http://www.gigablast.com/

Strengths
• simple interface
• cached copies with date indexed [archived copies]
• cached copies of webpages without images [stripped]
• links to Internet Archives [older copies]
• clusters results by default (can be turned off)
• no limit on number of search terms
Weaknesses
• most obviously, the Gigablast index is still smaller than those of Google or Yahoo
• no truncation
• is not case sensitive
• no wildcard
• limited file type searches
• limited language options
• poor documentation
14
Exalead
http://www.exalead.com/search/
The French search engine Exalead, which introduced a new look in 2006, has
features that make it worth special mention. Exalead offers both proximity searches
and truncation, two options no other major search engine offers anymore. In
addition, Exalead presents thumbnail images of websites in the results list (if you want them)

• Exalead refreshes its index continuously, not on a schedule (this is a good thing)
• default operator is AND; users may use OR.
• Exalead does not publish a search term limit
• as of now, Exalead has no sponsored links.
There are two other operators that can be used in a boolean query:
NEAR and OPT. NEAR finds search terms within 16 words of each other and OPT
makes a query term preferable but does not require it.
For example: [(football NEAR cardinals) OPT "st louis"]
This is nice to know because most search engines use AND as their default, and will
not return results unless all terms are found

Ask
http://fr.ask.com/ 15
Specialized Search
The whole problem of keeping information on the Internet private dramatically
worsened almost overnight a couple of years ago when Google quietly started
indexing whole new types of data.

Originally, most of what got spidered and indexed was HTML webpages and documents,
with some plain text thrown in for good measure.

However, the ever-innovative Google decided this wasn't good enough


and started to index PDF, PostScript, and-most importantly-a whole range of
Microsoft file types: Word, Excel, PowerPoint, and Access.

Problem was, lots of folks had assumed these file types were "immune" to spidering
not because it couldn't be done but because no one had yet done it.
As a result, many companies,
organizations, and even governments had quite a lot of egg on their faces when
sensitive documents began turning up in the Google database
16
What kinds of sensitive information can routinely be found using search engines?

The types of data most commonly discovered by Google hackers usually falls into
one of these categories:

• personal and/or financial information


• userids, computer or account logins, passwords
• private, confidential, or proprietary company data
• sensitive government information
• vulnerabilities in websites and servers that could facilitate breaking into the site

Ex: search by file type , site type, and keyword:


many organizations store financial, inventory, personnel, etc., data in Excel spreadsheet
format and often mark the information "Confidential," so a Google hacker looking for
sensitive information about a company in South Africa might use a query such as:

[filetype:xls site:za confidential]


17
Other examples: not for distribution,login, password etc
Getting private information "back" is harder than preventing
its disclosure in the first place.

Even when Google removes your data, there are literally hundreds of other
search engines around the world, and who knows what they have indexed from your
site. It will not be an easy task finding out. And I'll hazard a guess that not all of them
will be quite so accommodating as Google in removing pages..

Wikipedia
http://www.wikiwax.com/
To search all Wikipedias:
[site:wikipedia.org]

http://a9.com/
Amazon search

18
Google book search
[inpublisher:o-reilly]
[inauthor:patrick-o-brian]
[intitle:"nutmeg of consolation"]
[isbn:0393030326]

Answers
http://www.answers.com/

Wayback Machine
http://archive.org/web/
Using the Wayback Machine, you may very well be able to retrieve a page or an entire site
even if it disappeared from the web years ago.

Europeaen search engine


http://www.searchenginesoftheworld.com/search_engines_of_europe/

International directory of search engines


http://www.searchenginecolossus.com/ 19
Research Tips & Techniques

Tip 1: Use the Right Tool


The single biggest mistake researchers make is using the wrong search tool.
For example, search engines are generally not useful for finding current news
(use a specialized news search service).

Wikis, custom search engines, and directories are generally better when researching a
broad topic

Tip 2: Search for the Most Obscure Term

Tip 3: Put the Most Important Search Term First


While it's not always true, search engines usually give more weight to the first term
you list because the search software assumes it's the most important term
(otherwise, why would you list it first)? Try these two queries in Google one after the
other: [gardening roses] then [roses gardening]. The results are similar but not identical.
20
Tip 4: Search on the Singular Form First
While it is not always the case that search engines automatically search for plural
forms of search terms, many (including Yahoo and Google) do.
The converse, however, is not true, i.e., a search on [rose] will find roses, but a search on
[roses] will not find rose.
Therefore, it makes sense to search first on the singular form of a term.

Tip 5: Use Regional Search Services, Directories, and Databases

Tip 6: Search in the Native Language

Tip 7: Follow Those links


Whenever you find a good website, always check its links. While in theory links at a
web page that is indexed by a search engine should also have been indexed, the
reality is often different. "Links" pages are often a gold mine of sites with similar
information.

21
Tip 8: Learn Two Words in Any Non-English Language in Which You are Searching
Those two words are search and links. You need to be able to push the search or
find button on a non-English web page, and you need to be able to find the links Page

Tip 9: Search on the LINK Field

Tip 10: Look Beyond Search Engines and the Web


Search engines and directories index only a tiny portion of the Internet. With some notable
exceptions, they are basically designed to index web pages. A vast amount of data is stored,
for example, in online databases, many of which are free and open to the public.

Tip 11: Configure and Use Two Browsers

Tip 12: Try URL Guessing


It works more frequently than you would imagine. For example, I found the Iranian
Ministry of Foreign Affairs by guessing www.mfa.gov.ir.

Tip 13: Change URLs to Find "Hidden" Webpages


22
Tip 14: Be on the Lookout for URL Errors
Not surprisingly, many uris listed on webpages are incorrect. Among the most
common mistakes are misspellings, putting a backslash (\) where a slash (I) should
be, including or excluding the L in HTML, e.g.:
http://www.examlpe~com/pathl1ame\bigmistake.html

Tip 15: Take a Look at the "Site Map“

Tip 16: Try Using the "Mouseover"


For non-English sites where you don't know the language, try the "mouseover" trick,
i.e., move your mouse over hyperlinks. Often, the link information is in English or, if it
isn't, quite often the uri that appears in the too/bar at the bottom of the browser is
revealing because it is likely to be written in English.

Tip 17: Try Alternative Spellings. Especially of Non-English Names or Terms.

23
Tip 18: Always look at a Website's Native language Version
Usually, the native language version of a website will differ from the English version,
sometimes a little, sometimes a great deal.

Tip 19: Use Wildcards to Maximize Effectiveness

Tip 20: Examine Page Source Code


In addition to often revealing the webpage's language encoding, page source can
provide other helpful details, including names, dates, email addresses, type of
software used to create the page, etc.

Tip 21: Ask for Help

24
Questions?

25

You might also like