0% found this document useful (0 votes)
14 views26 pages

IT Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views26 pages

IT Unit 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT-2

Hypertext Transfer Protocol (HTTP)


2.1 Hypertext Transfer Protocol

Web browsers interact with web servers with a simple application-level protocolcalled HTTP
(Hypertext Transfer Protocol), which runs on top of TCP/IPnetwork connections. HTTP is a
client-server protocol that defines how messages are formatted and transmitted, and what
action web servers andbrowsers should take in response to various commands For example,
when theuser enters a URL in the browser, the browser sends a HTTP command to the web
server directing it to fetch and transmit the requested web page. Some of the fundamental
characteristics of the HTTP protocol are:

A typical Web paradigm using the request/response HTTP.

 The HTTP protocol uses the request/response paradigm, which is an HTTP client
program sends an HTTP request message to an HTTP server that returns an HTTP
response message.
• HTTP is a pull protocol; the client pulls information from the server
(instead of server pushing information down to the client).
 HTTP is a stateless protocol, that is, each request-response exchange is treated
independently. Clients and servers are not required to retain a state.An HTTP transaction
consists of a single request from a client to a server, followed by a single response from
the server back to the client. The server does not maintain any information about the
transaction. Some transactionsrequire the state to be maintained.
 HTTP is media independent: Any type of data can be sent by HTTP if boththe client and
the server know how to handle the data content. It is requiredfor the client as well as
the server to specify the content type usingappropriate MIME-type.
2.2 Hypertext Transfer Protocol Version

HTTP uses a <major>.<minor> numbering scheme to indicate versions of the protocol. The
version of an HTTP message is indicated by an HTTP-Version field in the first line. Here is the
general syntax of specifying HTTP version number:

HTTP-Version = "HTTP" "/" 1*DIGIT "." 1*DIGIT

The initial version of HTTP was referred to as HTTP/0.9, which was a simple protocol for raw
data transfer across the Internet. HTTP/1.0, as defined by RFC(Request for Comments) 1945,
improved the protocol. In 1997, HTTP/1.1 was formally defined, and is currently an Internet
Draft Standard [RFC-2616].Essentially all operational browsers and servers support HTTP/1.1.

2.3 Hypertext Transfer Protocol Connections

How a client will communicate with the server depends on the type of connection
established between the two machines. Thus, an HTTP connection can either be persistent
or non-persistent. Non-persistent HTTP was used by HTTP/1.0. HTTP/1.1 uses the persistent
type of connection, which is also known as a kept-alive type connection with multiple
messages or objects being sent over a single TCP connection between client and server.
2.3.1 Non-Persistent Hypertext Transfer Protocol
HTTP/1.0 used a non-persistent connection in which only one object can be sentover a
TCP connection. transmitting a file from one machine to other required two Round
Trip Time (RTT)—the time taken to send a small packet to travel from client to server
and back.
• One RTT to initiate TCP connection
• Second for HTTP request and first few bytes of HTTP response to return
• Rest of the time is taken in transmitting the file

Fig: RTT in a non-persistent HTTP.

While using non-persistent HTTP, the operating system has an extra overhead for maintaining
each TCP connection, as a result many browsers often open parallel TCP connections to fetch
referenced objects. The steps involved in setting up of a connection with non-persistent HTTP
are:

1. Client (Browser) initiates a TCP connection to www.anyCollege.edu (Server):


Handshake.
2. Server at host www.anyCollege.edu accepts connection and
acknowledges.
3. Client sends HTTP request for file /someDir/file.html.
4. Server receives message, finds and sends file in HTTP response.
5. Client receives response. It terminates connection, parseObject.
6. Steps 1–5 are repeated for each embedded object.

2.3.2 Persistent Hypertext Transfer Protocol

To overcome the issues of HTTP/1.0, HTTP/1.1 came with persistent connections through
which multiple objects can be sent over a single TCP connection between the client and server.
The server leaves the connection open after sending the response, so subsequent HTTP
messages between same client/server are sent over the open connection. Persistent connection
alsoovercomes the problem of slow start as in non-persistent each object transfer suffers from
slow start, and overall number of RTTs required for persistent is much less than in non-
persistent (fig)
The steps involved in setting the connection of non-persistent HTTP are:

1. Client (Browser) initiates a TCP connection to www.sfgc.ac.in(Server): Handshake.


2. Server at host www.sfgc.ac.in accepts connection and
acknowledges.
3. Client sends HTTP request for file /someDir/file.html.
4. Server receives request, finds and sends object in HTTP response.
5. Client receives response. It terminates connection, parseobject.
6. Steps 3–5 are repeated for each embedded object.

FIG:RTT in a persistent HTTP.


Thus, the overhead of HTTP/1.0 is 1 RTT for each start (each request/response),that is if there
are 10 objects, then the Total Transmission Time is as follows:

TTT = [10 * 1 TCP/RTT] + [10 * 1 REQ/RESP RTT] = 20RTT

Whereas for HTTP/1.1, persistent connections1 are very helpful with multi- object requests as
the server keeps TCP connection open by default.

TTT = [1 * 1 TCP/RTT] + [10 * 1 REQ/RESP RTT] = 11 RTT

2.4 Hypertext Transfer Protocol Communication

In its simplest form, the communication model of HTTP involves an HTTP client, usually a
web browser on a client machine, and an HTTP server, more commonly known as a web
server. The basic HTTP communication model has four steps:

• Handshaking: Opening a TCP connection to the web server.


• Client request: After a TCP connection is created, the HTTP client sends a request
message formatted according to the rules of the HTTP standard—an HTTP Request. This
message specifies the resource that the client wishes to retrieve or includes information to
be provided to the server.
• Server response: The server reads and interprets the request. It takes actionrelevant to
the request and creates an HTTP response message, which it sends back to the client. The
response message indicates whether the requestwas successful, and it may also contain
the content of the resource that the client requested, if appropriate.
• Closing: Closing the connection (optional).
Handshaking:
For opening a TCP connection, the user on client-side inputs the URL containing the name
of the web server in the web browser. Then, the web browser asks theDNS (Domain Name
Server) for the IP address of the given URL. If the DNS fails to find the IP address of the
URL, it shows an error (for example, “Netscape (Browser) is unable to locate error.”) on the
client’s screen. If the DNS finds theIP address of the URL, then the client browser opens a
TCP connection to port 80 (default port of HTTP, although one can specify another port
numberexplicitly in the URL) of the machine whose IP address has been found.
Request Message

After handshaking in the first step, the client (browser) requests an object (file)from the
server. This is done with a human-readable message. Every HTTP request message has the
same basic structure

Start Line:
Request method: It indicates the type of the request a client wants to send. They are also called
methods
Method = GET | HEAD | POST | PUT| DELETE| TRACE | OPTIONS| CONNECT
| COPY| MOVE
GET: Request server to return the resource specified by the Request-URIas the body of a response.
HEAD: Requests server to return the same HTTP header fields that would be returned if a GET
method was used, but not return the message body that would be returned to a GET method.
Post:The most commonform of the POST method is to submit an HTML form to the server. Since
the information is included in the body, large chunks of data such as an entire file can be sent to
the server.
Put: . It is used to upload a new resource or replace anexisting document. The actual document is
specified in the body part.

DELETE: Request server to respond to future HTTP request messages that contain the specified
Request-URI with a response indicating that there is no resource associated with this Request-
URI.

TRACE: Request server to return a copy of the complete HTTP request message, including start
line, header fields, and body, received by the server.
MOVE: It is similar to the COPY method except that it deletes the sourcefile.
CONNECT: It is used to convert a request connection into the transparentTCP/IP tunnel.
COPY: The HTTP protocol may be used to copy a file from one locationto another.
Headers:
The HTTP protocol specification makes a clear distinction between general headers,
request headers, response headers, and entity headers. Both request and response messages
have general headers but have no relation to the data eventually transmitted in the body. The
headers are separated by an empty line from the request and response body. The format of a
request header is shown inthe following table:

General
Header
Request
Header
Entity Header

A header consists of a single line or multiple lines. Each line is a single header of the following
form:

Header-name: Header-value
General Headers
General headers do not describe the body of the message. They provide information
about the messages instead of what content they carry.
• Connection: Close
This header indicates whether the client or server, which generated themessage,
intends to keep the connection open.
• Warning: Danger. This site may be hacked!
This header stores text for human consumption, something that would beuseful when
tracing a problem.
• Cache-Control: no-cache
This header shows whether the caching should be used.

Request Header:

It allows the client to pass additional information about themselves and about therequest, such as
the data format that the client expects.

• User-Agent: Mozilla/4.75
Identifies the software (e.g., a web browser) responsible for making the request.
• Host: www.netsurf.com
This header was introduced to support virtual hosting, a feature that allows aweb server to
service more than one domain.
• Referer: http://wwwdtu.ac.in/∼akshi/index.html
This header provides the server with context information about the request.If the request
came about because a user clicked on a link found on a web page, this header contains
the URL of that referring page.
• Accept: text/plain
This header specifies the format of the media that the client can accept.
Entity Header:

• Content-Type: mime-type/mime-subtype
This header specifies the MIME-type of the content of the message body.
• Content-Length: 546
This optional header provides the length of the message body. Although it isoptional, it is
useful for clients such as web browsers that wish to impart information about the
progress of a request. Last-Modified: Sun, 1 Sept 2016 13:28:31 GMT
This header provides the last modification date of the content that istransmitted in the
body of the message. It is critical for the proper functioning of caching mechanisms.
• Allow: GET, HEAD, POST
This header specifies the list of the valid methods that can be applied on a URL.
Message Body:
The message body part is optional for an HTTP message, but, if it is available, then it is used to
carry the entity-body associated with the request. If the entity-body is associated, then usually
Content-Type and Content-Length header linesspecify the nature of the associated body.

Response Message: Similar to an HTTP request message, an HTTP response message consists
of a status line, header fields, and the body of the response, in the following format
Fig:A sample HTTP Request Message

Fig: HTTP Response Message


Status Line:
Status line consists of three parts: HTTP version, Status code, and Status phrase.Two consecutive
parts are separated by a space.

HTTP version Status Code Status phrase

• HTTP version: This field specifies the version of the HTTP protocol beingused by the
server. The current version is HTTP/1.1.
• Status code: It is a three-digit code that indicates the status of the response.The status
codes are classified with respect to their functionality into five groups as follows:
• 1xx series (Informational)—This class of status codes represents provisional
responses.
• 2xx series (Success)—This class of status codes indicates that the client’srequest are
received, understood, and accepted successfully.
• 3xx series (Re-directional)—These status codes indicate that additional actions must
be taken by the client to complete the request.
• 4xx series (Client error)—These status codes are used to indicate that theclient request
had an error and therefore it cannot be fulfilled
• 5xx series (Server error)—This set of status codes indicates that the server
encountered some problem and hence the request cannot be satisfied at this time. The
reason of the failure is
• embedded in the message body. It is also indicated whether the failure is temporary or
permanent. The user agent should accordingly display a message on the screen to
make the user aware of the server failure.
Status phrase: It is also known as Reason-phrase and is intended to give a short textual description
of status code.
Example:
403
Not Found The requested resource could not be found but may be available in the
future. Subsequent requests by the client are permissible.

404

Request Timeout The server timed-out waiting for the request.


Headers:
Headers in an HTTP response message are similar to the one in a request message except
for one aspect, in place of request header in the headers it contains a response header.

General
Header
Response
Header
Entity Header

• Response Header
Response headers help the server to pass additional information about the response that
cannot be inferred from the status code alone, like the information about the server and
the data being sent
• Location: http://www.mywebsite.com/relocatedPage.html
This header specifies a URL towards which the client should redirect its original
request.
It always accompanies the “301” and “302” status codes that direct clients to try a
new location.
• WWW-Authenticate: Basic
This header accompanies the “401” status code that indicates an authorization
challenge. It specifies the authentication scheme which should be used to access the
requested entity. Server: Apache/1.2.5
This header is not tied to a particular status code. It is an optional headerthat identifies
the server software.
• Age:22
This header specifies the age of the resource in the proxy cache in seconds.
Message Body
Similar to HTTP request messages, the message body in an HTTP response message is
also optional. The message body carries the actual HTTP response data from the server
(including files, images, and so on) to the client.:
Figure:
A sample HTTP response message.

2.5 Hypertext Transfer Protocol Secure


HTTPS is a protocol for secure communication over the Internet. It wasdeveloped by
Netscape. It is not a protocol, but it is just the result of the combination of the HTTP
and SSL/TLS (Secure Socket Layer/Transport LayerSecurity) protocol. It is also called secure
HTTP, as it sends and receives everything in the encrypted form, adding the element of safety.
HTTPS is often used to protect highly confidential online transactions like online banking and
online shopping order forms. The use of HTTPS protects against eavesdropping and man-in-
the-middle attacks. While using HTTP, servers and clients still speak exactly the same HTTP
to each other, but over a secure SSL connection that encrypts and decrypts their requests and
responses.The SSL layer has two main purposes:
• Verifying that you are talking directly to the server that you think you aretalking to.
• Ensuring that only the server can read what you send it, and only you canread what it
sends back.
2.6 Hypertext Transfer Protocol State Retention:Cookies
HTTP is a stateless protocol. Cookies are an application-based solution to provide state
retention over a stateless protocol. They are small pieces ofinformation that are sent in
response from the web server to the client. Cookiesare the simplest technique used for storing client
state. A cookie is also known as HTTP cookie, web cookie, or browser cookie. Cookies are not
software; they cannot be programmed, cannot carry viruses, and cannot install malware on
thehost computer. However, they can be used by spyware to track a user’s browsingactivities.
Cookies are stored on a client’s computer. They have a lifespan and are destroyed by the client
browser at the end of that lifespan.

Fig:HTTP Cookie
Creating Cookies
When receiving an HTTP request, a server can send a Set-Cookie header with the response. The
cookie is usually stored by the browser and, afterwards, the cookie value is sent along with every
request made to the same server as the content of a Cookie HTTP header.
A simple cookie can be set like this:

Set-Cookie: <cookie-name>=<cookie-value>
There are various kinds of cookies that are used for different scenarios depending on the need.
These different types of cookies are given with their brief description in
Types of Cookies

Cookies Description
Session cookie A session cookie only lasts for the duration of users using the website. The web
browser normally deletes session cookies when it quits.
Persistent A persistent cookie will outlast user sessions. If a persistent cookie has its max-age set
to 1 year,
cookie/traci
ng cookie then, within the year, the initial value set in that cookie would be sent back to the
server every time the user visited the server.
Secure cookie A secure cookie is used when a browser is visiting a server via HTTPS, ensuring
that the cookie is always encrypted when transmitting from client to server.
Zombie cookie A zombie cookie is any cookie that is automatically recreated after the user has
deleted it.

• Persistence: One of the most powerful aspects of cookies is their persistence. When a
cookie is set on the client’s browser, it can persist for days, months, or even years. This
makes it easy to save user preferences andvisit information and to keep this information
available every time the userreturns to a website. Moreover, as cookies are stored on
the client’s hard disk they are still available even if the server crashes.
• Transparent: Cookies work transparently without the user being aware thatinformation
needs to be stored.
They lighten the load on the server’s memory.
2.7 Hypertext Transfer Protocol Cache

Caching is the term for storing reusable responses in order to make subsequent requests
faster. The caching of web pages is an important technique to improve the Quality of Service
(QoS) of the web servers. Caching can reduce network latency experienced by clients. For
example, web pages can be loaded more quickly in the browser. Caching can also conserve
bandwidth on the network, thus increasing the scalability of the network with thehelp of an
HTTP proxy cache (also known as web cache). Caching also increases the availability of web
pages.
• Browser cache: Web browsers themselves maintain a small cache. Typically, the browser
sets a policy that dictates the most important items to cache. This may be user-specific
content or content deemed expensive to downloadand likely to be requested again.
• Intermediary caching proxies (Web proxy): Any server in between the client and your
infrastructure can cache certain content as desired. These caches may be maintained by
ISPs or other independent parties.
• Reverse cache: Your server infrastructure can implement its own cache for backend
services. This way, content can be served from the point-of-contact instead of hitting
backend servers on each request.
Cache Consistency

Cache consistency mechanisms ensure that cached copies of web pages are eventually updated
to reflect changes in the original web pages. There are basically, two cache consistency
mechanisms currently in use for HTTP proxies:

• Pull method: In this mechanism, each web page cached is assigned a time-to-serve field,
which indicates the time of storing the web page in the cache.An expiration time of one
or two days is also maintained. If the time is expired, a fresh copy is obtained when user
requests for the page.

• Push method: In this mechanism, the web server is assigned the responsibility of
making all cached copies consistent with the server copy.
2.8 Evolution of Web
Web 1.0:
The First generation of the web ,web 1.0 was introduced by tim Berners Lee in late 1990, as a
technology based solution for business to broad cast their information to people.the core elements of
web were HTTP,HTML,URL. The Web 1.0, as an unlimited source of.

Web 2.0:
Web 2.0 is the term used to describe the second generation of the world wide web that emerged in
early 2000s.unlike web 1.0 which was primarily focused on the one_way dissemination of
information,web 2.0 is characterized by a more collaborative and interactive approach to web content
and user engagement.

Web 2.0 Technologies:


Web 2.0 encourages a wider range of expression, facilitates more collaborative
ways of working, enables community creation, fosters dialogue and knowledge
sharing, and creates a setting for learners with various tools and technologies. A
 Blogging
Social Networking Sites
Weblog or Blog: A Weblog, or “blog,” is a personal journal or newsletter on
the Web. Some blogs are highly influential and have enormous readership,
while others are mainly intended for a close circle of family and friends.
The power of Weblogs is that they allow millions of people to easily publish

 Social Networking Sites:


Social networking sites, with Facebook being the best-known, allow users toset up a personal
profile page where they can post regular status updates, maintain links to contacts known as
“friends” through a variety of interactive channels, and assemble and display their interests in the
form of texts, photos, videos, group memberships, and so on.
 Podcasts:
A Podcast is basically just an audio (or video) file. A podcast is different from
other types of audio on the Internet because a podcast can be subscribed to by
the listeners, so that when new podcasts are released, they are automatically
delivered, or fed, to a subscriber’s computer or mobile device.
 Wikis
A single page in a wiki website is referred to as a wiki page.
The entire collection of wiki pages, which are usually interconnected with hyperlinks, is “the
wiki.” A wiki is essentially a database for creating, browsing, and searching through information
 Micro-blogging
Micro-blogging is the practice of posting small pieces of digital content—which could be text,
pictures, links, short videos, or other media—on the Internet. Micro-blogging enables users
to write brief messages, usually limited to less than 200 characters. and publish them via web
browser-based services, email, or mobile phones.
Web 3.0:
Web 3.0 also know as the semantic web, is the next generation of the world wide web that aims to
create a more intelligent, interconnected, and contextualized web experience. While web 2.0 focused
on user generated content and social interaction, web 3.0 aims to bring more automated and
personalized experience to the web. Web 3.0 is based on a specific set of principles, technical
parameters, and values that distinguish it from earlier iterations of the World Wide Web: Web 2.0 and
Web 1.0. Web 3.0 envisions a world without centralized companies, where people are in control of
their own data and transactions are transparently recorded on blockchains, or databases searchable by
anyone.
Features of Web 3.0

Semantic Web
Semantic means “relating to meaning in language or logic.” The Semantic Web improves the abilities
of web technologies to generate, share, and connect content through search and analysis
by understanding the meaning of language beyond simple keywords.
Artificial intelligence
Web 3.0 leans on artificial intelligence (AI) to develop computers that can understand the meaning or
context of user requests and answer complex requests more quickly. The artificial intelligence of the
Web 3.0 era goes beyond the interactivity of Web 2.0 and creates experiences for people that feel
curated, seamless, and intuitive — a central aim behind the development of the metaverse.
Decentralization
Web 3.0 envisions a truly decentralized internet, where connectivity is based completely on peer-
to-peer network connections. This decentralized web will rely on blockchain to store data and
maintain digital assets without being tracked.
Ubiquity
Ubiquity means appearing everywhere or being very common. The definition of ubiquity in terms of
Web 3.0 refers to the idea that the internet should be accessible from anywhere, through any
platform, on any device. Along with digital ubiquity comes the idea of equality. If Web 3.0 is
ubiquitous, it means it is not limited. Web 3.0 is not meant for the few, it is meant for the many.

Web 1.0 Web 2.0 Web 3.0

Despite only providing Because of developments in web Web 3.0 is the next break in the
limited information and technologies such evolution of the Internet,
little to no user interaction, as Javascript, HTML5, CSS3, etc., allowing it to understand data in
it was the first and most and Web 2.0 made the internet a lot a human-like manner.
reliable internet in the more interactive.
1990s.

Social networks and user-generated It will use AI


Before, there was no such
content production have flourished technology, Machine Learning,
thing as user pages or just
because data can now be distributed and Blockchain to provide users
commenting on articles.
and shared. with smart applications.

Consumers struggled to
locate valuable Many web inventors, including the This will enable the intelligent
information in Online 1.0 above-mentioned Jeffrey Zeldman, creation and distribution of
since there were no pioneered the set of technologies highly tailored content to every
algorithms to scan through used in this internet era. internet user.
websites.
2.9 Big Data:

Big Data is a trending set of techniques that demand new ways of consolidation
of the various methods to uncover hidden information from themassive and complex raw
supply of data. User-generated content on the Web hasbeen established as a type of Big Data,
and, thus, a discussion about Big data isinevitable in any description of the evolution and
growth of the Web. Following are the types of Big Data that have been identified across
literature:
Social Networks (human-sourced information): Human-sourced information is now almost
entirely digitized and stored everywhere from personalcomputers to social networks. Data are
loosely structured and often ungoverned.
Big Data Characteristics
o Volume
o Veracity
o Variety
o Value
o Velocity

Volume

The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated
from many sources daily, such as business processes, machines, social media platforms, networks,
human interactions, and many more.

Variety

Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

The data is categorized as below:

a. Structured data: In Structured schema, along with all the required columns. It is in a tabular
form. Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to
work with semi-structured data. It is stored in relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
Veracity

Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity
is the process of being able to handle and manage data efficiently. Big Data is also essential in business
development.

Value

Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

Velocity

Velocity plays an important role compared to others. Velocity creates the speed by which the data
is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.

Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.

2.10 Web IR: Information Retrieval on the Web


Web Information Retrieval Tools
These are automated methods for retrieving information on the Web and can be broadly
classified as search tools or search services.
Search Tools
A search tool provides a user interface (UI) where the user can specify queries and browse the
results.
Class 1 search tools: General purpose search tools completely hide the organization and content
of the index from the user.
Class 2 search tools: Subject directories feature a hierarchically organized subject catalog or
directory of the Web, which is visible to users as they browse and search.

Search Services
Search services provide a layer of abstraction over several search tools anddatabases, aiming
to simplify web search. Search services broadcast user queriesto several search engines and various
other information sources simultaneously.
2.11 Web Information Retrieval Architecture (SearchEngine Architecture)
A search engine is an online answering machine, which is used to search, understand, and
organize content's result in its database based on the search query (keywords) inserted by the end-users
(internet user). To display search results, all search engines first find the valuable result from their
database, sort them to make an ordered list based on the search algorithm, and display in front of end-
users. The process of organizing content in the form of a list is commonly known as a Search Engine
Results Page (SERP).

There are the following four basic components of Search Engine -

1. Web Crawler

Web Crawler is also known as a search engine bot, web robot, or web spider. It plays an essential
role in search engine optimization (SEO) strategy. It is mainly a software component that traverses on
the web, then downloads and collects all the information over the Internet.

There are the following web crawler features that can affect the search results –

o Included Pages
o Excluded Pages
o Document Types
o Frequency of Crawling

Database

The search engine database is a type of Non-relational database. It is the place where all the web
information is stored. It has a large number of web resources. Some most popular search engine
databases are Amazon Elastic Search Service and Splunk.

There are the following two database variable features that can affect the search results:

o Size of the database


o The freshness of the database

3. Search Interfaces

Search Interface is one of the most important components of Search Engine. It is an interface between
the user and the database. It basically helps users to search for queries using the database.

There are the following features Search Interfaces that affect the search results -

o Operators
o Phrase Searching
o Truncation

4. Ranking Algorithms

The ranking algorithm is used by Google to rank web pages according to the Google search algorithm.

There are the following ranking features that affect the search results -

o Location and frequency


o Link Analysis
o Clickthrough measurement

How do search engines work

There are the following tasks done by every search engines -

1. Crawling

Crawling is the first stage in which a search engine uses web crawlers to find, visit, and download the
web pages on the WWW (World Wide Web). Crawling is performed by software robots, known as
"spiders" or "crawlers." These robots are used to review the website content.

2. Indexing

Indexing is an online library of websites, which is used to sort, store, and organize the content that we
found during the crawling. Once a page is indexed, it appears as a result of the most valuable and most
relevant query.

3. Ranking and Retrieval

The ranking is the last stage of the search engine. It is used to provide a piece of content that will be
the best answer based on the user's query. It displays the best content at the top rank of the website.

To know more about how the search engine works click on the following link -
Search Engine Processing

There are following two major Search Engine processing functions -

1. Indexing process

Indexing is the process of building a structure that enables searching.

Indexing process contains the following three blocks -

i. Text acquisition

It is used to identify and store documents for indexing.

ii. Text transformation

It is the process of transform documents into index or features.

iii. Index creation

Index creation takes the output from text transformation and creates the indexes or data searches that
enable fast searching.

2. Query process

The query is the process of producing the list of documents based on a user's search query.

There are the following three tasks of the Query process -

i. User interaction
User interaction provides an interface between the users who search the content and the search engine.

ii. Ranking

The ranking is the core component of the search engine. It takes query data from the user interaction
and generates a ranked list of data based on the retrieval model.

iii. Evaluation

Evaluation is used to measure and monitor the effectiveness and efficiency. The evaluation result helps
us to improve the ranking of the search engine.

2.12 Web Information Retrieval Performance Metrics

Like in the information retrieval community, system evaluation in Web IR (search engines) also
revolves around the notion of relevant and not relevant documents. In a binary decision
problem, a classifier labels examples as either positive ornegative. The decision made by the
classifier can be represented in a structure
known as a confusion matrix or contingency table. The confusion matrix has four
categories: True positives (TP) are examples correctly labeled as positives.False positives
(FP) refer to negative examples incorrectly labeled as positive—they form Type-I errors.
True negatives (TN) correspond to negatives correctly labeled as negative. And false
negatives (FN) refer to positive examples incorrectly labeled as negative—they form
Type-II errors.
Fig: confusion matrix

• Precision: This is defined as the number of relevant documents retrieved by a


search divided by the total number of documents retrieved by that search:

Recall: This is also known as true positive rate or sensitivity or hit rate.
It is defined as the number of relevant documents retrieved by a search dividedby the total
number of existing relevant documents.

• F-measure (in information retrieval): This can be used as a single measure of


performance. The F-measure is the harmonic mean of precision and recall. It is a
weighted average of the true positive rate (recall) and precision:
n=165 Predicted: NO Predicted: YES
Actual: NO TN=50 FP=10 60
Actual: YES FN=5 TP=100 105
55 110

The performance measures are thus computed from the confusion matrix for abinary
classifier as follows:

• Accuracy: Overall, how often is the classifier correct?


(TP + TN)/total = (100 + 50)/165 = 0.91 implies 91% accuracy
• Misclassification rate or the error rate: Overall, how often is it wrong?
(FP + FN)/total = (10 + 5)/165 = 0.09 implies 9% error rate (equivalent to 1minus
Accuracy)
• Recall or true positive rate or sensitivity: When it’s actually yes, how oftendoes it
predict yes?
TP/actual yes = 100/105 = 0.95 implies 95% recall
• False positive rate or fall-out: When it’s actually no, how often does itpredict
yes?
FP/actual no = 10/60 = 0.17
• Specificity: When it’s actually no, how often does it predict no? TN/actual
no = 50/60 = 0.83 (equivalent to 1 minus false positive rate)
• Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91 implies 91% precision
• F-measure: 2 * precision * recall/precision + recall
2 * 0.91 * 0.95/0.91 + 0.95 = 1.729/1.86 = 0.92956989 ~ 0.93

2.13 Web Information Retrieval Models


Standard Boolean Model:
The standard Boolean model is based on Booleanlogic and classical set theory where both
the documents to be searched andthe user’s query are conceived as sets of terms. Retrieval
is based on whether the documents contain the query terms. A query is represented as a
Boolean expression of terms in which terms are combined with the logical operators AND,
OR, and NOT.
Algebraic Model: Documents are represented as vectors, matrices or tuples. Using
algebraic operations, these are transformed to a one-dimensional similarity measure.
Implementations include the vector space model and the generalized vector space model,
(enhanced) topic-based vector space model, and latent semantic indexing
• Vector Space Model (VSM): The VSM is an algebraic model used for information
retrieval where the documents are represented through the words that they contain.
It represents natural language documents in a formal manner by the use of vectors
in a multi-dimensional space
Model:

– Each document is broken down into a word frequency table. Thetables are called
vectors and can be stored as arrays.
– A vocabulary is built from all the words in all the documents in thesystem.
– Each document and user query is represented as a vector basedagainst the
vocabulary.
– Calculating similarity measure.
– Ranking the documents for relevance.
Extended Boolean Model:
The idea of the extended model is to make use of partial matching andterm weights
as in the vector space model. It combines the characteristics ofthe vector space model with
the properties of Boolean algebra and ranks thesimilarity between queries and documents.
Documents are returned by ranking them on the basis of frequency of query terms (ranked
Boolean). The concept of term weights was introduced to reflect the (estimated) importance
of each term.
Probabilistic models: . Probability theory seems to be the most natural way to quantify
uncertainty. A document’s relevance is interpreted as a probability. Document and query
similarities are computed as probabilities for a given query. The probabilistic model takes
these term dependencies and relationships into account and, in fact, specifies major
parameters, such as the weights of the query terms and the form of the query-document
similarity. Common models are the basic probabilistic model, Bayesian inference
networks, and language models.

Hyper Link Induced topic Research(HITS)


This is an algorithm developed by Kleinberg in 1998. It defines authorities as pages
that are recognized as providing significant, trustworthy, and useful information on a
topic. In- degree (number of pointers to a page) is one simple measure of authority.
However, in-degree treats all links as equal. Hubs are index pages that provide lots of
useful links to relevant content pages (topic authorities). It attempts to computationally
determine hubs and authorities on a particular topic through analysis of a relevant sub-
graph of the Web. This is based onmutually recursive facts that hubs point to lots of
authorities and authoritiesare pointed to by lots of hubs. Together, they tend to form a
bipartite graph.
Algorithm:
Computes hubs and authorities for a particular topic specified by a

normal query.

First determines a set of relevant pages for the query called the base
set S.

Analyze the link structure of the web sub-graph defined by S to find
authority and hub pages in this set.
Page Rank (Google): An alternative link-analysis method is used by Google,
known as the PageRank given by Brin and Page in 1998. It does not attempt to capture
the distinction between hubs and authorities. It rankspages just by authority and is applied
to the entire web rather than a local neighborhood of pages surrounding the results of a
query.
2.14 Google PageRank

Google used PageRank to determine the ranking of pages in its search results. As
Google became the dominant search engine, it sparked the massive demand for backlinks. n
the original paper on PageRank, the concept was defined as "a method for computing a ranking
for every web page based on the graph of the web. PageRank is an attempt to see how good an
approximation to importance can be obtained just from the link structure."
We assume page A has pages T1...Tn which point to it (i.e., are citations). The
parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85.
There are more details about d in the next section. Also C(A) is defined as the number of links
going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

• PR(A): PageRank of page A.


• PR(Tn): PageRank of pages Tn, which link to page A. Each page has a notion of
its own self-importance. That’s PR(T1) for the first page in the Web all the way
up to PR(Tn) for the last page.
• C(Tn): Number of outbound links on page Ti. Each page spreads its vote outevenly
among all of its outgoing links. The count, or number, of outgoing links for page
1 is C(T1), C(Tn) for page n, and so on for all pages.
• C(Tn) is the PR(Tn)/C(Tn): If page A has a back link from page n, the shareof the
vote page A will get is PR(Tn)/C(Tn).
• All these fractions of votes are added together, but to stop the other pages having too
much influence, this total vote is “damped down” by multiplying it by 0.85 (the factor
“d”).
• (1 – d): The PageRanks form a probability distribution over web pages so the sum of
PageRanks of all web pages will be one.
The (1 – d) bit at the beginning is a bit of probability math magic so the sum of all
web pages’ PageRanks will be one, it adds in the bit lost by the d. It also means that if a
page has no links to it (no back links) even then it will still get a small PR of0.15 (i.e., 1 –
0.85).

Note that the PageRank’s form a probability distribution over web pages, so the sum of all web
pages' PageRank’s will be one. This formula calculates the PageRank for a page by summing
a percentage of the PageRank value of all pages that link to it. Therefore, backlinks from pages
with greater PageRank have more value. In addition, pages with more outbound links pass a
smaller fraction of their PageRank to each linked web page.

According to this formula, three primary factors that impact a page's PageRank are:

 The number of pages that backlink to it


 The PageRank of the pages that backlink to it
 The number of outbound links on each of the pages that backlink to it

In the example above, web page A has a backlink that points to web page B and web page C.
Web page B has a backlink that points to web page C, and web page C has no outbound links.
Based upon this, we already know that A will have the lowest PageRank and C will have the
greatest PageRank. Here's the PageRank formulas and results for the first iteration assuming
d=0.85:

 Page A: (1 - 0.85) = 0.15


 Page B: (1 - 0.85) + (0.85) * (0.15 / 2) = 0.213745
 Page C: (1 - 0.85) + (0.85) * (0.15 / 2) + (0.85) * (0.21375 / 1) = 0.3954375
This is just the first iteration of the calculation. To get the final PageRank of each page, the
calculation must be repeated until the average PageRank for all pages is 1.0.

You might also like