Lecture 1 - On Internet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

AcSIR Course : Computer Applications and

Informatics

October 2021

Elizabeth Jacob
Chief Scientist
CSIR-NIIST
The Internet Past to Present

 Short History
 The Web
 Web Information Retrieval by Search Engines
 Search Engine Optimization
 Case Study Google
 About other Search Engines
History
Internet
• 1962 Joseph Licklider of MIT proposed the
earliest ideas of global networking. He had a
vision of Internet as we see it today.

• The dream was realized in 1969 ,funded by


Defence Dept., the first internet was called
ARPANet (Advanced Research Project Agency)
• Leonard Klienrock developed the theory of packet
switching. His laboratory’s UCLA Host computer
became the first ARPANET node in September
1969.
• Packet switching is a method of grouping data that
is transmitted over a digital network as packets
made of a header and a payload. Data in the header
is used by networking hardware to direct the packet
to its destination, where the payload is extracted
and used by application software.
History of Internet
• University of California at Los Angeles, University
of California at Santa Barbara, University of Utah,
and Stanford Research Institute were linked
together in the first ever truly wide-area-
network.
System crashed as they typed the G in LOGIN
• The Internet, consists of a complex network of
computers connected by high-speed
communication technologies (wired and wireless)
• The term “Internet” was finally coined in 1995 by
the FNC (Federal Networking Council, USA)
• E-mail was adapted for ARPANET by Ray Tomlinson
in 1972. He picked the @ symbol to link the
username and the address.
• TELNET or TELetype NETwork is commonly used by
terminal emulation programs that allow you to log
into a remote host computer. Can send text
messages, no encryption.
• 1973 FTP or File Transfer Protocol transfers files
between computers over a network.
1983 ARPANET adopts TCP/IP. 200 routers to direct
the traffic.
1984 NSF funds a TCP/IP based backbone
network.This backbone grows into the
NSFNET, which becomes the successor of the
ARPANET.

• In 1989 Tim Berners-Lee invents the concept of


hypertext systems that can run across the
Internet independent of a computer’s operating
system. (This is the idea of a Browser and WWW)
• 1995 NSF stops funding of NSFNET. The Internet
becomes completely commercial.
– Search Engines Yahoo and Altavista appear
• 1996
– Microsoft releases Internet Explorer
– AOL Instant Messenger (AIM) released, changing
the way people communicate over the Internet
• 1998
– Google arrives, with a new kind of search
mechanism using ranking rather than categories.
• The Internet Society decide on the rules, known
as protocols, for communication over the
Internet.
• As of January 2021 there were 4.66 billion active
internet users worldwide - 59.5 percent of the
global population. Of this total, 92.6 percent
(4.32 billion) accessed the internet via mobile
devices.
• In 2020, India had over 749 million internet users
across the country. This figure is projected to
grow to over 1.5 billion users by 2040
The Present - Surge towards universal
wireless access
• Travellers search for wi-fi hot spots to connect
their gadgets. City-wide access, wiMAX, 4g, 5g
will battle for dominance.
• Responsive web design - Small devices like
smart phones, tablets, GPS devices want to
tap into the web.
• Internet of Things is adding devices –
refrigerators, personal robots, VR headsets,
cameras
Global/Galactic Information
Infrastructure
• As it grows and becomes accessible to non-
technical communities, social networking and
services are boosting sites like Facebook,
Twitter, Linked-In, YouTube, Instagram
• Internet is driving Businesses
• Protecting privacy and data breaches is a
challenge for cybersecurity
Internet and the World Wide Web
• Internet is a huge network of computers all
connected together.
• the World Wide Web is a global collection of
documents and other resources, linked by hyperlinks
and URIs.
• World Wide Web (WWW) is defined as a
system of interlinked hypertext documents
accessed via the internet.
• Anyone who has internet connection can see
web pages which involve multimedia tools
such as text, images or videos. The proposal of
Tim Berners-Lee in 1989 and Robert Cailliau,
was to use hypertext to integrate information
into a web as nodes where users can view.
Web 1.0
• Coined by Tim Berners-Lee as “read only”
web. It is the first generation of WWW and
lasted from 1989 and 2005. Internet users
were only reading information presented to
them.
• The primary aim of the websites was to make
information public for anyone, and set up an
online presence.
• The focus was on content delivery rather than
interaction and production.
Web 2.0
Web 2.0 (2000-2010 and continuing) is described
as people-centric, participative, and read- write
web. Unlike 1.0 version, Web 2.0 allows more
control to users and is also called the social web.
facilitates interaction between web users and sites
which in turn allows users to communicate with
other users.
Web 2.0 applications are Facebook, Youtube,
Flickr, Twitter.
Web 3.0
• Web 3.0 was suggested by John Markoff as a new
kind of web in 2006. It is defined as semantic web
and includes integration, automation, discovery, and
machine-based understanding of data
• It encourages mobility and globalization. The Web
3.0, Semantic Web or intelligent web is the era (2010
and above) which refers to the future of web. In this
era computers can interpret information like humans
via Artificial Intelligence and Machine Learning.
• Examples of Web 3.0 are Apple’s Siri, Wolfram
Alpha.
Information Retrieval from the Web
The Search Process
Search Engines to Surf the Web
• YouTube is not simply a website; it is a search
engine. YouTube's user-friendliness, combined
with the soaring popularity of video content,
makes it the 2nd largest search engine with 3
billion searches per month.
• It aims to find the most relevant videos and
channels according to what people type in the
search box.
• The videos are ranked on how well the title,
descriptions and the video match the query and
which videos have had the most watch time.
• Facebook has acquired Whatsapp the messaging
service and Instagram the photo sharing app
Facebook allows search engines like Google to
index your profile and publicly available
information.
• Twitter a social networking and blogging service.
Search through words and hashtags to find what
you're looking for. You can search a date range to
get old tweets.
How does a Search Engine work ?
Information Retrieval by Search Engines
Web Crawlers
• Search engine crawlers, also called spiders (why?),
robots or just bots, are programs or scripts that
systematically and automatically browse pages
on the web.
• Gather information from across hundreds of
billions of webpages and organize it in the Search
index.
• The number of Internet pages is extremely large;
even the largest crawlers fall short of making a
complete index.
The process of crawling
• Web pages are connected to each other by
hyperlinks (what are these links ?)
• The spider follows the hyperlinks till it visits
every page.
• The URL is fetched and parsed. Each time it visits
a page, it adds information about it to a database
called the Search Index.
• As the Net has a billion websites, crawlers work
in advance to give results fast when we search.
Indexing
• The crawler crawls the web, building the list of
documents, figuring out which words appear in
each page. Documents are then indexed.
• Indexing is the process by which search engines
organise information before a search to enable
instantaneous responses to queries.
• Search engines use an inverted index, also
known as a reverse index for fast fetching of
results.
Creating Index
• The inverted index is the list of words, and the
documents in which they appear. Words are
indexed.
• Forward indexing from documents->to->words,
• Inverted indexing from words->to->documents.
• In web search example, you provide the list of
words (your search query), and SE produces
the documents (search result links).
Break up text of the doc into words and
sentences by tokenization.
Case Study
• Google was kick-started by two Stanford Univ.
students in 1998
• 'Googol' in mathematics means 10^100.
Google is a play on the term 'Googol,' which
means a number of nearly incomprehensible
size.
• Google’s website index contains billions of
pages and 100,000,000 gigabytes of data
• From noun to verb
• Larry Page and Sergey Brin developed
PageRank at Stanford University in 1998 as part
of a SE research project. Eponymous ?

• Three years prior, in 1995, an undergrad in


Brown’s Cognitive and Linguistic Sciences
program, Bradly Love and Steven Sloman
published an identical algorithm to PageRank, the
centrality algorithm.
Page Rank Algorithm
• To find the importance of a page to estimate
how good a website is.
• Page rank of a page is determined by other
pages.
• An inbound link increases
the B page’s rank
• Outbound link increases
rank of another page C
Page Rank Algorithm
• We assume page A has pages T1…Tn which point to it (i.e.,
are citations). The parameter d is a damping factor which
can be set between 0 and 1. We usually set d to 0.85.
• Also C(Ti) is defined as the number of links going out of
page Ti. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))


• Iterate till the PRs converge
• PageRanks form a probability distribution over web pages.
Iteration A B C
PR(A)=PR(B)=PR(C)
0 1 1 1 D=0.85
1 1 .575 1.06375

PR(A)=(1-d)+d[PR(C)/C(C)]
= (1-0.85)+0.85[1/1]
=1
PR(B)=(1-d)+d[PR(A)/C(A)]
= (1-0.85)+0.85[1/2]
= 0.575
PR(C)=(1-d)+d[PR(A)/C(A)+
PR(B)/C(B)]
= (1-0.85)+0.85[1/2
+ 0.575/1]
= 1.06375
Iteration A B C
0 1 1 1
1 1 .575 1.06375

PR(A)=(1-d)+d[PR(C)/C(C)]
= (1-0.85)+0.85[1.06375/1]
= 1.054187
PR(B)=(1-d)+d[PR(A)/C(A)]
= (1-0.85)+0.85[1.054187/2]
= 0.598029
PR(C)=(1-d)+d[PR(A)/C(A)+
PR(B)/C(B)]
= (1-0.85)+0.85[1.054187/2
+ 0.598029/1]
= 1.0635
https://dnschecker.org/pagerank.php
Exercise to calculate Page Rank
A A

A A
Refining Google Search
• Search for “exact phrase” good for referencing
• Boolean Search AND OR operators solar system
• To exclude a term from search –term
• Search for a phrase with missing words use *
• Reverse image search to find the origin of a
special image. Go to images.google.com
Camera icon>upload image from computer OR right
click on image hosted online and copy its URL and
paste in search field -> similar matches
Choose Search by image option for exact match
• Search within a single website term site:URL
e.g. RTI site:niist.res.in
• Search for similar websites related:URL
e.g. related:myntra.com
• Search for a filetype
e.g. big data filetype:pdf
• No space before search term site: bbc.com
• Search with Startpage.com to get google results
protecting privacy – no tracking of IP address,
personal info, cookies, SSL encryption
Search Engine Optimization
• SEO or “Search Engine Optimization.” is the process of
improving your site to increase its visibility when people
search for products, services or information related to
their work.
• Organic search, also known as natural search, refers to
unpaid search results. In contrast to paid search results
(pay-per-click advertising), which are populated via
an auction system,
• Organic search results are based on relevance to the
user's search query.
• Unlike paid search ads, you can’t pay search engines to
get higher organic search rankings.
7 Simple Steps for SEO
• Know Your Keywords.
• Write High Quality Content (Naturally)
• Use Keywords in Your Website Page URLs.
• Don't Overlook Page Titles.
• Review Every Page for Additional Keyword Placement.
• Improve User Experience.
• Hire an Expert.
SEO Tools : Google Search Console, Semrush,
BuzzStream. DreamHost SEO Toolkit, Moz Pro, Linkody
Search Engines of 2020
• Google
• Microsoft Bing
• Yahoo
• Baidu
• Yandex
• DuckDuckGo
• Ask.com
• Ecosia
Study and
• Aol.com
• Internet Archive
compare these
SEs
AI-powered Engines

• Semantic Scholar is an AI- backed search


engine for academic publications developed at
the Allen Institute for AI and publicly released in
November 2015. It uses advanced NLP to provide
summaries for scholarly papers.
• provides one-sentence summary of scientific
literature. Useful options to narrow down search by
field of study, data range, filters by journal, author,
news and sort by recency, relevance, citation count
and landmark papers.
• 200 million papers
• MATSCHOLAR is an AI-based search
engine for information extraction from material
science literature.
• This website uses NLP to power search. It was
created as part of a research effort at Lawrence
Berkeley National Laboratory.
• provides one-sentence summary. Options to filter
search by material, properties, applications, sample
descriptors, synthesis method, characterization
method.
https://matscholar.com
Vulnerability of Internet
• Human Error
• Hardware Software failure,
• Communication disruption due to Natural
Phenomena
Shortly after 9 pm IST on Oct 4th, 2021, Facebook’s
services including WhatsApp, Instagram went down ?

Configuration changes on the backbone routers that


coordinate network traffic between the company’s
data centers. Facebook’s machines stopped
communicating with each other because of a DNS
error.

You might also like