0% found this document useful (0 votes)
54 views14 pages

Perl Project: Siddhant Sanjeev 337/CO/11 Siddharth Saluja 338/CO/11

This document describes a Perl project to mine code from webpages related to a search query. It uses the Bing search API to retrieve URLs, then extracts C/C++ code snippets from the pages through regex pattern matching. The code is organized and output to an HTML file. A GUI allows users to enter queries and see results. Key aspects include using various Perl libraries, building the regex patterns, removing JavaScript, and outputting the filtered code snippets.

Uploaded by

sansid12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views14 pages

Perl Project: Siddhant Sanjeev 337/CO/11 Siddharth Saluja 338/CO/11

This document describes a Perl project to mine code from webpages related to a search query. It uses the Bing search API to retrieve URLs, then extracts C/C++ code snippets from the pages through regex pattern matching. The code is organized and output to an HTML file. A GUI allows users to enter queries and see results. Key aspects include using various Perl libraries, building the regex patterns, removing JavaScript, and outputting the filtered code snippets.

Uploaded by

sansid12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

PERL PROJECT

Submitted by:-
SIDDHANT SANJEEV
337/CO/11
SIDDHARTH SALUJA
338/CO/11





Aim
Data mining of the codes with respect to a topic.
To provide a faster access for the programmers to search
about getting a query.
To reduce the usage of google by inculcating our
application.










Technical procedure
Made a windows live account to use bing search api.
The api was fed with a query and all the relevant URLs
were fetched.
Data was Fetched in JSON format and url and description
of the webpage was extracted and stored.
Each of the urls was pinged and the webpage was
compared by regex to get all the c/c++ from that page.
Regex comprises of pattern matching with int, double,
void and long.
Regex also finds conditions like (if,else) and itereative
loops like for, while do-while etc.
The data was systematically stored in a html file called
out.html.
To make the i/o interactive we also used GUI using tk.pl:
library of perl through which inputs were taken in a text
box and user was timely notified about the completion of
the query.


Header Code
use LWP::UserAgent;
use Data::Dumper;
use JSON;
use Text::Balanced qw(extract_codeblock);
use Tk;
Header code explanation
The LWP::UserAgent is a class implementing a web user
agent. LWP::UserAgent objects can be used to dispatch
web requests.
Data: dumper is basically used to output text on the
screen.
JSON library is used to take the web data in JSON format.
Alternatively xml could also be used
Tk library is the graphics library of perl used to enter
various text boxes and dialog boxes in the code to make
the code more interactive



GUI code
$mw = MainWindow->new;
$mw->title("FINDER");
$frm_name = $mw->Frame();
$lab = $frm_name->Label( -text => "Enter Query:" );
$ent = $frm_name->Entry();
$but = $mw->Button( -text => "Search", -command =>
\&button_handler );
$textarea = $mw->Frame(); #Creating Another Frame
$txt = $textarea->Text( -width => 100, -height => 10 );
$srl_y = $textarea->Scrollbar( -orient => 'v', -command => [
yview => $txt ] );
$srl_x = $textarea->Scrollbar( -orient => 'h', -command => [
xview => $txt ] );
$txt->configure(
-yscrollcommand => [ 'set', $srl_y ],
-xscrollcommand => [ 'set', $srl_x ]
);
$lab->grid( -row => 1, -column => 1 );
$ent->grid( -row => 1, -column => 2 );
$frm_name->grid( -row => 1, -column => 1, -columnspan => 2 );
$but->grid( -row => 4, -column => 1, -columnspan => 2 );
MainLoop;

GUI code explanation
Main window-> new adds a canvas in which we can add
text boxes and dialog boxes.
$mw->title is used to assign title to the canvas.
$lab is the label that is used to take the input query.
$txt gets the input in the text format.
$but Is the button which when clicked invokes the button
handler function which calls the main function in it.








Button handler code
sub button_handler
{
$input = $ent->get();
$txt->insert( "end", "You searched for $input\n" );
$txt->update();

$accnt_key =
'hO+FNgghOI5lq3i5TILA4TFVKHBdtLsXZBFj67UaeMw';
$root =
'https://api.datamarket.azure.com/Bing/Search/v1/Web';
$query = $input;

$offset = 10;
$format = 'JSON';

$url = $root . build_args( $query, 10, $offset, $format );
$ua = LWP::UserAgent->new;
$req = HTTP::Request->new( GET => $url );
$req->authorization_basic( '', $accnt_key );
$response = $ua->request($req);
if ( !$response->is_success ) {
die 'Error connecting to BING API';
}
$json = $response->content;

$perl = from_json($json);
$next_url = $perl->{'d'}->{'__next'};
@results = @{ $perl->{'d'}->{'results'} };
open( $FH, '>', "out.html" )
or die 'Cannot open output file out.html';
print $FH <<ENDHTML;
<HTML>
<HEAD>
<TITLE>CodeFINDER</TITLE>
</HEAD>
<BODY>
<H1 align = "center"><u>RESULTS</u></H2>
ENDHTML

foreach $result (@results) {

code($result);
}
print $FH <<ENDHTML;
</body>
</html>
ENDHTML
close($FH);

$ans = $mw->messageBox(-title=>"done", -type=>"ok", -
message=>"completed.", -icon=>"info");
$txt -> delete('1.0','end');
}

Button Handler code
explanation
$accnt_key is the bing search api key.
$root is the base url that is used to access bing api.
$query is the input.
Function build_args is used to make the final url using
$accnt_key,$root,$query as input strings
sub build_args {
$q = '?Query=%27' . shift(@_) . '%27';
$c = '$top=' . shift(@_);
$o = '$skip=' . shift(@_);
$f = '$format=' . shift(@_);
return join( '&', $q, $c, $o, $f );
}
Top is used to show the particular record from the beginning of
the page.
Format is the format in which the data is extracted i.e. JSON
@result is an array of all the webpage links that the bing api
search returns.
Now we will access each of the link one by one using our
subroutine code ()









Subroutine Code
if ( $url2 =~ /\.(pdf|ppt|doc|docx)$/ )
{
next;
}
$url2 is the url extracted from array @result.
If the url contains the following types: pdf, ppt, doc or
docx then the link is skipped.
$disp_url = $result->{'DisplayUrl'};
$description = $result->{'Description'};
Contains the display url of the site
Each of the url is pinged to gather the data and matches it
to the regex correspondingly





Regex used
$func =
'(?:int|long|double|float|void|long\
double)\s?\*?\s+?\w{1,30}';
$grp1 = '(?:if|else\ if|for|while)';

$grp2 = '(?:class|struct|typedef\
struct)\s*?\w{1,30}\s*?';
$delim = '{}';
pos($page) = 0;
$regex1 = $grp1 . $reg1;
$regex2 = $grp2;
$regex3 = $func . $reg1;

$regex = join( '|', $regex1, $regex2, $regex3);
grp1 handles if else ,for, and while.
grp2 handles classes type definitions and structures.
Regex is the final matching pattern made using the
combination of grp1 and grp2.


Removing Javascript
while ( $page =~ s/<script.*?>.*?<\/script>//gsi ) { }
<script></script>
Tag is used in html to run various scripts.
Here we are using it to run a script that will remove all the data
between the script tags.












Conclusion
When the execution of our program terminates we can
see the output in out.html file.
Hence we have successfully separated the c/c++ codes
from the webpages .

You might also like