Basic PHP Web Scraping Script Tutorial - Oooff
Basic PHP Web Scraping Script Tutorial - Oooff
Menu Alright, I'm sure you're saying to yourself, ok I have all this
data (web page, file data, it's all the same to us) but I really
Home
want to extract some very specific data out of it. Does that
SEO
sound like what you're looking for? Well what we'll do is a
CSS
basic php web scrape just like in the first tutorial, but we're
PHP Scripts
going to take and pull some data out of it. For our example
Design what we'd like to do is find out how many pages of our site is
Money Online indexed by MSN and just return that scraped number. Sound
Blog like something useful? Hopefully this is going to give you the
Dumb Fucks very basics of parsing out data. So lets go!
Suggestions
Script Explanation -
Ok here goes with the basic explanation...
Line 2.
$data = file_get_contents('http://search.msn.com/results.aspx?
q=site%3Afroogle.com');
Now if you studied up on the first tutorial you'll know that
we're pulling data from MSN search using the
file_get_contents command and assigning the data to the $data
variable.
However we're also passing some data in the url to get the
specific page from MSN that we want to scrape. If you already
know about passing variables in the url you can go to Line 3.
You might be asking what is all that stuff after the MSN url?
I'm sure you've seen it a lot of times but might not been sure
what it was. Basically what all that stuff is, is just like passing
a variable in a php script but you're doing it through a url. Lets
take a peak at the url we're using here to get a better
understanding. Our url if you don't remember is
"http://search.msn.com/results.aspx?q=site%3Afroogle.com".
Let's break it into two parts split on the question mark. Why
you ask? That's where the url ends and the data being passed
begins. With is separated we have:
http://search.msn.com/results.aspx
and
q=site%3Afroogle.com
Line 3.
$regex = '/Page 1 of (.+?) results/';
First things first when we're scraping a page we're scraping the
source code of the page, so that's always what we're going to
want to be looking at when we're picking out what we want to
grab. If you know know this and you better or you're probably
lost. Go to view source in your browser then search for what
you're looking to pull out. Here's a chunk of the source code
we're going to pull our value out of.
div id="search_header"><h1>site:froogle.com</h1><h5>Page 1
of 9,138 results</h5> <b>
Now that we have our data we want to to get the result from,
we can get into the meat of the parsing. I know to most of you
regex is big scary thing with all those crazy symbols and
patterns. And well if you want to be a regex master yes, it's
pretty daunting. But don't let all those funny chars scare you
cause there's a real simple way to use regex. The regex guru's
and preachers will mock you and say you're bastardizing it but
I say whatever works.
Pretty easy huh? Yeah I thought so. The only other thing to
note in this is that there is the forward slashes in the '/stuff/';
that's a regex thing. Just know that in php you always need to
let regex know what to match inside of forward slashes.
Of course I can talk about regex all day and type 1000 pages
on it. But for now I'm trying to keep it super simple.
Line 4.
preg_match($regex,$data,$match);
Ah a new function's in town, preg_match(). Preg_match() is
the PHP function to call regex for a single match. So anytime
we want to match one thing in our data we're going to call the
parsing function preg_match().
Line 5.
var_dump($match);
The function var_dump() is your best friend as a programmer.
It says whatever is in this variable or array dump it out onto
the screen so I can see what's happening. So this line will
output this onto the screen.
array(2) {
[0]=>
string(23) "Page 1 of 9,138 results"
[1]=>
string(5) "9,138"
}
Line 6.
echo $match[1];
What's with the new notation? If you hadn't already guessed
that's how we access the cars in our train. We know if we have
a array and what we want is in car 1 we access that by
'referencing' that car which is what the [1] means. We want to
output only what's in the second cell because we don't want the
anchors included. This will output to our screen:
9,138
Click here to see what your parsed result should look like!
2. See if you can get the title of a web of any web page. Hint:
anchors are going to be <title> and </title>.
Conclusion -
You can make some pretty cool tools with just the two very
basic things I've shard with you so far. Pulling data from
somewhere using the file_get_contents() function and the data
parsing preg_match() function. Have fun with it and I'll see you
on the next data scraping tutorial.
Back