0% found this document useful (0 votes)
40 views20 pages

Creating A Web Crawler in 3 Steps: Issac Goldstand Mirimar Networks

The document describes creating a web crawler in 3 steps: 1) Creating a user agent using LWP::RobotUA to interact with websites politely, 2) Creating a content parser using HTML::Parser to extract links and data from pages, and 3) Combining the user agent and parsers into a program that crawls a queue of URLs, extracts links and author metadata, and adds new links to the queue. The full code example shows implementing each step to create a basic crawling bot that reports the number of pages for each detected author.

Uploaded by

Kriti Gaba
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views20 pages

Creating A Web Crawler in 3 Steps: Issac Goldstand Mirimar Networks

The document describes creating a web crawler in 3 steps: 1) Creating a user agent using LWP::RobotUA to interact with websites politely, 2) Creating a content parser using HTML::Parser to extract links and data from pages, and 3) Combining the user agent and parsers into a program that crawls a queue of URLs, extracts links and author metadata, and adds new links to the queue. The full code example shows implementing each step to create a basic crawling bot that reports the number of pages for each detected author.

Uploaded by

Kriti Gaba
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 20

Creating a Web Crawler in 3

Steps

Issac Goldstand
[email protected]
Mirimar Networks
http://www.mirimar.net/
The 3 steps
• Creating the User Agent
• Creating the content parser
• Tying it together
Step 1 – Creating the User Agent
• Lib-WWW Perl (LWP)
• OO interface to creating user agents for
interacting with remote websites and web
applications
• We will look at LWP::RobotUA
Creating the LWP Object
• User agent
• Cookie jar
• Timeout
Robot UA extras
• Robot rules
• Delay
• use_sleep
Implementation of Step 1
use LWP::RobotUA;

# First, create the user agent - MyBot/1.0

my $ua=LWP::RobotUA->new('MyBot/1.0', \
'[email protected]');

$ua->delay(15/60); # 15 seconds delay


$ua->use_sleep(1); # Sleep if delayed
Step 2 – Creating the content parser
• HTML::Parser
• Event-driven parser mechanism
• OO and function oriented interfaces
• Hooks to functions at certain points
Subclassing HTML::Parser
• Biggest issue is non-persistence
• CGI authors may be used to this, but still
makes for many caveats
• You must implement your own state
preservation mechanism
Implementation of Step 2
package My::LinkParser; # Parser class
use base qw(HTML::Parser);

use constant START=>0; # Define simple constants


use constant GOT_NAME=>1;

sub state { # Simple access methods


return $_[0]->{STATE};
}
sub author {
return $_[0]->{AUTHOR};
}
Implementation of Step 2 (cont)
sub reset { # Clear parser state
my $self=shift;
undef $self->{AUTHOR};
$self->{STATE}=START;
return 0;
}

sub start { # Parser hook


my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ($tagname eq "meta" && lc($attr->{name}) eq
"author") {
$self->{STATE}=GOT_NAME;
$self->{AUTHOR}=$attr->{content};
}
}
Shortcut HTML::SimpleLinkExtor
• Simple package to extract links from
HTML
• Handles many links – we only want HREF
type links
Step 3 – Tying it together
• Simple application
• Instanciate objects
• Enter request loop
• Spit data to somewhere
• Add parsed links to queue
Implementation of Step 3
for (my $i=0;$i<10;$i++) { # Parse loop
my $response=$ua->get(pop @urls); # Get HTTP response
if ($response->is_success) { # If reponse is OK
$p->reset;
$p->parse($response->content); # Parse for author
$p->eof;
if ($p->state==1) { # If state is FOUND_AUTHOR
$authors{$p->author}++; # then add author count
} else {
$authors{'Not Specified'}++; # otherwise add default count
}
$linkex->parse($response->content); # parse for links
unshift @urls,$linkex->a; # and add links to queue
}
}
End result
#!/usr/bin/perl
use strict;
use LWP::RobotUA;
use HTML::Parser;
use HTML::SimpleLinkExtor;

my @urls; # List of URLs to visit


my %authors;
my $ua=LWP::RobotUA->new('AuthorBot/1.0','[email protected]'); # First, create & setup
the user agent
$ua->delay(15/60); # 15 seconds delay
$ua->use_sleep(1); # Sleep if delayed
my $p=My::LinkParser->new; # Create parsers
my $linkex=HTML::SimpleLinkExtor->new;

$urls[0]="http://www.beamartyr.net/"; # Initialize list of URLs


End result
for (my $i=0;$i<10;$i++) { # Parse loop
my $response=$ua->get(pop @urls); # Get HTTP response
if ($response->is_success) { # If reponse is OK
$p->reset;
$p->parse($response->content); # Parse for author
$p->eof;
if ($p->state==1) { # If state is FOUND_AUTHOR
$authors{$p->author}++; # then add author count
} else {
$authors{'Not Specified'}++; # otherwise add default count
}
$linkex->parse($response->content); # parse for links
unshift @urls,$linkex->a; # and add links to queue
}
}
print "Results:\n"; # Print results
map {print "$_\t$authors{$_}\n"} keys %authors;
End result
package My::LinkParser; # Parser class
use base qw(HTML::Parser);

use constant START=>0; # Define simple constants


use constant GOT_NAME=>1;

sub state { # Simple access methods


return $_[0]->{STATE};
}
sub author {
return $_[0]->{AUTHOR};
}
sub reset { # Clear parser state
my $self=shift;
undef $self->{AUTHOR};
$self->{STATE}=START;
return 0;
}
End result
sub start { # Parser hook
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if ($tagname eq "meta" && lc($attr->{name}) eq "author") {
$self->{STATE}=GOT_NAME;
$self->{AUTHOR}=$attr->{content};
}
}
What’s missing?
• Full URLs for relative links
• Non-HTTP links
• Queues & caches
• Persistent storage
• Link (and data) validation
In review
• Create robot user agent to crawl websites
nicely
• Create parsers to extract data from sites,
and links to the next sites
• Create a simple program to parse a queue
of URLs
Thank you!

For more information:


Issac Goldstand
[email protected]
http://www.beamartyr.net/
http://www.mirimar.net/

You might also like