0% found this document useful (0 votes)
51 views8 pages

Data Structure and Files: Name Roll No. GR No

The document contains information about web crawlers and how they work. It discusses how crawlers systematically browse websites by starting with seed pages, fetching and parsing URLs, extracting new URLs, and repeating the process. It provides details on how Google's crawler called Googlebot looks at webpages, follows links, and brings data back to Google servers. It also includes sample Java code for implementing a basic web crawler that takes a starting URL, crawls to a depth of 5 levels, and prints the title of visited pages.

Uploaded by

shraddha mulay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views8 pages

Data Structure and Files: Name Roll No. GR No

The document contains information about web crawlers and how they work. It discusses how crawlers systematically browse websites by starting with seed pages, fetching and parsing URLs, extracting new URLs, and repeating the process. It provides details on how Google's crawler called Googlebot looks at webpages, follows links, and brings data back to Google servers. It also includes sample Java code for implementing a basic web crawler that takes a starting URL, crawls to a depth of 5 levels, and prints the title of visited pages.

Uploaded by

shraddha mulay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Structure and Files

SCE
Name Roll No. Gr No.
Siddhant Jain 221027 21910811
Anish Kataria 221034 21911105
Anjali More 221082 22020114
Shraddha Mulay 221083 22020260
Web Crawler Your
Logo
Here
A Web crawler, sometimes called a Spider or Spiderbot. Its an
Internet bot that systematically browses the World Wide Web,
typically operated by search engines for the purpose of Web
indexing.

Basic Crawler Operation


1. Begin with known “seed” pages
2. Fetch and parse them
3. Extract URLs they point to
4. Place the extracted URLs on a ArrayList
5. Fetch each URL on the ArrayList and repeat
Your
How Google Crawler works? Logo
Here

• Google uses software known as Web Crawlers to discover


publicly available webpages. The most well-known crawler is
called Googlebot.
• Crawlers look at webpages and follow links on those pages and
go from link to link and bring data about those webpages back
to Google’s servers.
Code Your
import java.io.IOException; Logo
import java.util.ArrayList; Here
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import java.util.Scanner;
public class Crawler {
public static void main(String[] args) {
// TODO Auto-generated method stub
Scanner sc= new Scanner(System.in);
System.out.print("Enter a website: ");
String url= sc.nextLine(); //reads string.
//String url = "https://en.wikipedia.org/";
crawl(1, url, new ArrayList<String>());
}
private static void crawl(int level, String url, ArrayList<String> visited) {
if(level <=5)
{
Document doc = request(url, visited);
if(doc != null)
{
for(Element link : doc.select("a[href]")) {
String next_link = link.absUrl("href");
if(visited.contains(next_link) == false) { Your
crawl(level++, next_link, visited); Logo
}
Here
}
}
}
}
private static Document request(String url, ArrayList<String> v) {
try {
Connection con = Jsoup.connect(url);
Document doc = con.get();
if(con.response().statusCode() == 200) {
System.out.println("Link: "+url);
System.out.println(doc.title());
v.add(url);
return doc;
}
return null;
}
catch(IOException e) {
return null;
}
}
}
Output Your
Logo
Here
Your
Logo
Time Complexity Here

Function Name Number of Lines in Time


code Complexity
main 6 O(1)
crawl 15 O(5)
request 16 O(1+1+1)=O(3)
Your
Logo
Here

Thankyou

You might also like