0% found this document useful (0 votes)

382 views5 pages

Basic PHP Web Scraping Script Tutorial - Oooff

ENGAGED from #UF is Gay Very basic web data parsing, on. Php web scrape just like in the first tutorial, but we're going to take and pull some data out of it. For our example what we'd like to do is find out how many pages of our site is indexed by MSN and just return that scraped number.

Uploaded by

getit0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

382 views5 pages

Basic PHP Web Scraping Script Tutorial - Oooff

Uploaded by

getit0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

ENGAGED from #UF is Gay Very basic Parsing, on

Remember you heard it here first! returned web data - tutorial

Menu Alright, I'm sure you're saying to yourself, ok I have all this
data (web page, file data, it's all the same to us) but I really
Home
want to extract some very specific data out of it. Does that
SEO
sound like what you're looking for? Well what we'll do is a
CSS
basic php web scrape just like in the first tutorial, but we're
PHP Scripts
going to take and pull some data out of it. For our example
Design what we'd like to do is find out how many pages of our site is
Money Online indexed by MSN and just return that scraped number. Sound
Blog like something useful? Hopefully this is going to give you the
Dumb Fucks very basics of parsing out data. So lets go!
Suggestions

“Friends Don't let

Most Basic Web Data Parsing Script
Friends buy packaged Whole script -
software!” The whole script minus the line numbers of course. Those are
just their for our reference.

Join our Mailing List 1. <?php

Name: 2. $data = file_get_contents('http://search.msn.com/results.aspx?
q=site%3Afroogle.com');
3. $regex = '/Page 1 of (.+?) results/';
Email:
4. preg_match($regex,$data,$match);
5. var_dump($match);
Submit 6. echo $match[1];
7. ?>

Script Explanation -
Ok here goes with the basic explanation...

Line 2.
$data = file_get_contents('http://search.msn.com/results.aspx?
q=site%3Afroogle.com');
Now if you studied up on the first tutorial you'll know that
we're pulling data from MSN search using the
file_get_contents command and assigning the data to the $data
variable.

However we're also passing some data in the url to get the
specific page from MSN that we want to scrape. If you already
know about passing variables in the url you can go to Line 3.

You might be asking what is all that stuff after the MSN url?
I'm sure you've seen it a lot of times but might not been sure
what it was. Basically what all that stuff is, is just like passing
a variable in a php script but you're doing it through a url. Lets
take a peak at the url we're using here to get a better
understanding. Our url if you don't remember is
"http://search.msn.com/results.aspx?q=site%3Afroogle.com".

Let's break it into two parts split on the question mark. Why
you ask? That's where the url ends and the data being passed
begins. With is separated we have:

http://search.msn.com/results.aspx
and
q=site%3Afroogle.com

Now I hope I don't need to go into an explanation on the first

part so I'm really only going to talk about the second. Also I'll
do some basic tutorials on accepting data later so you have an
understand what happens to this url on the other side. When
you look at the second part of the url you'll always see a field
and a value for the field, although sometimes that value is
blank. How do you know which is the field and which is the
value you ask? The field is always going to come before the
equal sign = and the value will come after. Basically think of it
like assigning a variable a value. In this data being passed by
the url our field is "q" if you didn't already guess and our
value is site%3Afroogle.com. The field 'q' that MSN takes
stands for query. So passing data assigned to the 'q' field is
telling MSN, "hey look this search/query up for me."

The value assigned to the field 'q' is site%3Afroogle.com. First

thing you're probably thinking is what in the world is that
%3A, I didn't type that. Well to keep things very simplistic,
there's certain variables that can't be passed through url's things
like colon's, quotes, semi-colon's etc, because these are
protected and mean certain things to a web server when they
see them. So we need to use some other form of formatting. In
this case we're converting the ':' in site:froogle.com to a
encoded value (more on that later). So what we're asking for by
the site: command in MSN is how many pages from site X are
in your search engine. So specifically how many pages from
froogle are indexed in MSN.

Click here to see the page we're scraping

Line 3.
$regex = '/Page 1 of (.+?) results/';
First things first when we're scraping a page we're scraping the
source code of the page, so that's always what we're going to
want to be looking at when we're picking out what we want to
grab. If you know know this and you better or you're probably
lost. Go to view source in your browser then search for what
you're looking to pull out. Here's a chunk of the source code
we're going to pull our value out of.

div id="search_header"><h1>site:froogle.com</h1><h5>Page 1
of 9,138 results</h5> <b>&#01

Now that we have our data we want to to get the result from,
we can get into the meat of the parsing. I know to most of you
regex is big scary thing with all those crazy symbols and
patterns. And well if you want to be a regex master yes, it's
pretty daunting. But don't let all those funny chars scare you
cause there's a real simple way to use regex. The regex guru's
and preachers will mock you and say you're bastardizing it but
I say whatever works.

I'm not going to go into we're just assigning a string to a

variable in this statement. Anytime you see a $varname =
'something here'; or $varname = "something here"; you know
it's just a value being assigned to a variable. Also note you can
use single ' and double " quotes interchangeably.

(.+?) is our best friend when it comes to regex, it basically

means match everything starting from the text ( I'll call that
text anchors too, so be prepared for me to use the
interchangeably) in the beginning and stopping at our end
text/anchor. Something like this:

opening anchor text here ( .+?) closing anchor text here

Pretty easy huh? Yeah I thought so. The only other thing to
note in this is that there is the forward slashes in the '/stuff/';
that's a regex thing. Just know that in php you always need to
let regex know what to match inside of forward slashes.

Of course I can talk about regex all day and type 1000 pages
on it. But for now I'm trying to keep it super simple.

Line 4.
preg_match($regex,$data,$match);
Ah a new function's in town, preg_match(). Preg_match() is
the PHP function to call regex for a single match. So anytime
we want to match one thing in our data we're going to call the
parsing function preg_match().

With preg match we're doing something called passing data to

the function for it to work on. In this case we're passing
$regex, $data, $match. We know what both $regex (parsing
string we just made) and $data (scraped page from MSN) are
but what is the $match variable? It's just the variable that our
parsed data is going to be returned to. In plain english we're
saying take $data and then apply the filter $regex to it. Then
whatever comes through that filter dump out into $match. Make
sense?
I sure hope you said yep, that's easy.

Line 5.
var_dump($match);
The function var_dump() is your best friend as a programmer.
It says whatever is in this variable or array dump it out onto
the screen so I can see what's happening. So this line will
output this onto the screen.

array(2) {
[0]=>
string(23) "Page 1 of 9,138 results"
[1]=>
string(5) "9,138"
}

Array? What's that? Well this is as good a time as any to

introduce what an array is. They're extremely useful tools for
you to know. So lets backup a little we know that a variable is
something that holds 1 thing, right? Well an array is just like a
variable except it holds multiple things. I like to think of it like
this. Stop and imagine a train for a second it has all these cars
on it that hold things right? well a variable is a single car and
can only hold a single thing. Where an array is like a train that
has multiple cars to hold things. In the output above we have a
two (2) cell array, which is just like a 2 car train. In car 0 we
have the string 'Page 1 of 9,138 results' and in car 1 we have
the string '9,138', which is the result we want right? You
might be asking why does preg_match return an array rather
then just a simple string. It does this two give you two options
on how to match things. You'll notice car/cell 0 has the
anchors included as well as the matched text. Where car 1 only
has the text inside the anchors.

Line 6.
echo $match[1];
What's with the new notation? If you hadn't already guessed
that's how we access the cars in our train. We know if we have
a array and what we want is in car 1 we access that by
'referencing' that car which is what the [1] means. We want to
output only what's in the second cell because we don't want the
anchors included. This will output to our screen:

9,138

Which is exactly what we aimed to do.

Click here to see what your parsed result should look like!

Download the file here

Other things to try -

So fun stuff to try using our new skills.
1. Use the link: command in MSN and see if you can get the
number of links for a domain. Don't forget that : = %3A

2. See if you can get the title of a web of any web page. Hint:
anchors are going to be <title> and </title>.

Conclusion -
You can make some pretty cool tools with just the two very
basic things I've shard with you so far. Pulling data from
somewhere using the file_get_contents() function and the data
parsing preg_match() function. Have fun with it and I'll see you
on the next data scraping tutorial.

Next: Parsing Multiple Items from A Data Source

Back

Copyright Me Bitches! - Web 1.0 Style - Represent

6th Central Pay Commission Salary Calculator
100% (436)
6th Central Pay Commission Salary Calculator
15 pages
Adb Android Debug Bridge Cheatsheet
No ratings yet
Adb Android Debug Bridge Cheatsheet
2 pages
12.CEH Module 3 Assignment 3.1
No ratings yet
12.CEH Module 3 Assignment 3.1
7 pages
Manual-Super Ball-English-2022-02-15
No ratings yet
Manual-Super Ball-English-2022-02-15
14 pages
Scan Coin 3003 Spare - Parts - All - 018990 - 101 - 03 - 087
No ratings yet
Scan Coin 3003 Spare - Parts - All - 018990 - 101 - 03 - 087
24 pages
Tutorial - Python Scripting For XBMC
100% (1)
Tutorial - Python Scripting For XBMC
34 pages
Scanitto Pro User's Guide
No ratings yet
Scanitto Pro User's Guide
21 pages
PHP Programming For Beginners: The Simple Guide to Learning PHP Fast!
From Everand
PHP Programming For Beginners: The Simple Guide to Learning PHP Fast!
Tim Warren
No ratings yet
Earn On Autopilot
0% (1)
Earn On Autopilot
4 pages
Android PHP MySQL
No ratings yet
Android PHP MySQL
33 pages
HY-TB4DV-N 4 Axis Driver Board Manual
No ratings yet
HY-TB4DV-N 4 Axis Driver Board Manual
11 pages
Fruit King 3
No ratings yet
Fruit King 3
2 pages
Apex Series 5000 7000 Bill Acceptor Manual PDF
No ratings yet
Apex Series 5000 7000 Bill Acceptor Manual PDF
17 pages
Tool To Make Computer Faster and Safer
No ratings yet
Tool To Make Computer Faster and Safer
4 pages
Manual Fruit King 3
No ratings yet
Manual Fruit King 3
2 pages
Zebra Programming Guide
No ratings yet
Zebra Programming Guide
331 pages
Codeigniter User Guide 1 5 5
No ratings yet
Codeigniter User Guide 1 5 5
362 pages
CashCodeone QuickReference Guide
No ratings yet
CashCodeone QuickReference Guide
30 pages
Coin Validators: Parts For Cashflow Series
No ratings yet
Coin Validators: Parts For Cashflow Series
10 pages
Bone Android Guide
No ratings yet
Bone Android Guide
8 pages
Some Useful Linux Commands
100% (1)
Some Useful Linux Commands
21 pages
PHP and MySQL PDF
No ratings yet
PHP and MySQL PDF
4 pages
Tkinter Python - Tela de Splash e Menu
No ratings yet
Tkinter Python - Tela de Splash e Menu
3 pages
Android SQLite Database Example Tutorial - DigitalOcean
No ratings yet
Android SQLite Database Example Tutorial - DigitalOcean
25 pages
Learn PHP in 14 Days
No ratings yet
Learn PHP in 14 Days
3 pages
How To Approach A Crackme
No ratings yet
How To Approach A Crackme
4 pages
Lecture 2: Unix Structure: Asoc. Prof. Guntis Barzdins Asist. Girts Folkmanis
No ratings yet
Lecture 2: Unix Structure: Asoc. Prof. Guntis Barzdins Asist. Girts Folkmanis
19 pages
Fruit King 4 Manuel
No ratings yet
Fruit King 4 Manuel
8 pages
Computer Fundamentals
No ratings yet
Computer Fundamentals
7 pages
SIM Flash Programmer v2.5: Full Contents
No ratings yet
SIM Flash Programmer v2.5: Full Contents
11 pages
Topic 1 Basics: Add Two Numbers
No ratings yet
Topic 1 Basics: Add Two Numbers
57 pages
Wave+ Technical Manual
No ratings yet
Wave+ Technical Manual
65 pages
Manual Nv9
No ratings yet
Manual Nv9
38 pages
Identification Numbers and Check Digit Schemes
No ratings yet
Identification Numbers and Check Digit Schemes
19 pages
PHP API Documentation
No ratings yet
PHP API Documentation
15 pages
Database Helper
100% (3)
Database Helper
8 pages
2020 01 08 Computeractive
No ratings yet
2020 01 08 Computeractive
78 pages
Rubiks Cube or Magic Cube: The Solution To The Game/puzzle
No ratings yet
Rubiks Cube or Magic Cube: The Solution To The Game/puzzle
6 pages
HTML5 - Web Forms 2
No ratings yet
HTML5 - Web Forms 2
3 pages
POS With Barcode Generator Using PHP - MySQL - Free Source Code, Tutorials and Articles
No ratings yet
POS With Barcode Generator Using PHP - MySQL - Free Source Code, Tutorials and Articles
10 pages
Secure Hash Algorithm: Concepts
No ratings yet
Secure Hash Algorithm: Concepts
6 pages
Software Release Notification: 8NY530333445 SC Advance / EBDS / GDS COP / Colombia / Peso
No ratings yet
Software Release Notification: 8NY530333445 SC Advance / EBDS / GDS COP / Colombia / Peso
10 pages
Core Manual
No ratings yet
Core Manual
58 pages
Android Deep Dive
No ratings yet
Android Deep Dive
57 pages
PHP Help Sheet
No ratings yet
PHP Help Sheet
1 page
2.2-Diagnostic Software
100% (1)
2.2-Diagnostic Software
8 pages
Just the basics of JavaScript
From Everand
Just the basics of JavaScript
Tom Henricksen
No ratings yet
Simplified PHP
From Everand
Simplified PHP
James Blanchette
No ratings yet
Rooba Ahmad: Starting Date 21-Oct-2018 Gujranwala
No ratings yet
Rooba Ahmad: Starting Date 21-Oct-2018 Gujranwala
10 pages
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Beyond the Basics of JavaScript
From Everand
Beyond the Basics of JavaScript
Tom Henricksen
No ratings yet
Chapter 5 PHP Arithmetic Operators and Pre-Defined Functions (And String Functions)
No ratings yet
Chapter 5 PHP Arithmetic Operators and Pre-Defined Functions (And String Functions)
1 page
Four Programming Languages Creating a Complete Website Scraper Application
From Everand
Four Programming Languages Creating a Complete Website Scraper Application
Stephen J Link
No ratings yet
71419-php Functions 1
No ratings yet
71419-php Functions 1
12 pages
PHP BesantTech
No ratings yet
PHP BesantTech
14 pages
Chapter 5 - Server-Side-Scripting
No ratings yet
Chapter 5 - Server-Side-Scripting
72 pages
How To Create A Simple Web Crawler in PHP
No ratings yet
How To Create A Simple Web Crawler in PHP
3 pages
chapter1PHPTUT Nicephotog
No ratings yet
chapter1PHPTUT Nicephotog
1 page
JavaScript for Kids: Start Your Coding Adventure
From Everand
JavaScript for Kids: Start Your Coding Adventure
Abdelfattah Ragab
No ratings yet
PHP Basic
No ratings yet
PHP Basic
31 pages
Javascript Concepts: 1St Edition
From Everand
Javascript Concepts: 1St Edition
Mohammed Ashequr Rahman
No ratings yet
PHP Notes: Arrays
No ratings yet
PHP Notes: Arrays
17 pages
Introduction To PHP
No ratings yet
Introduction To PHP
33 pages
Expert Sleepers Augustus Loop v2.3.0 User Manual
No ratings yet
Expert Sleepers Augustus Loop v2.3.0 User Manual
76 pages
How To Create A Blog With The Recess PHP Framework, Part 2 - New Media Campaigns
No ratings yet
How To Create A Blog With The Recess PHP Framework, Part 2 - New Media Campaigns
14 pages
CDE Official Guide
No ratings yet
CDE Official Guide
7 pages
Ilias SCORM 2004 Editor Manual
100% (1)
Ilias SCORM 2004 Editor Manual
55 pages
Link
No ratings yet
Link
2 pages
X IT LAB Practical FILE 2025-26
No ratings yet
X IT LAB Practical FILE 2025-26
4 pages
300+ Node - Js MCQ Interview Questions and Answers MCQ Format - Manish Salunke
No ratings yet
300+ Node - Js MCQ Interview Questions and Answers MCQ Format - Manish Salunke
303 pages
Lex & Yacc: Lex - A Lexical Analyzer Generator
No ratings yet
Lex & Yacc: Lex - A Lexical Analyzer Generator
5 pages
Curio 742 User Manual
No ratings yet
Curio 742 User Manual
136 pages
OSY Notes Vol 1 - Ur Engineering Friend
No ratings yet
OSY Notes Vol 1 - Ur Engineering Friend
68 pages
Adobe CCT - Datasheet
No ratings yet
Adobe CCT - Datasheet
2 pages
Gartner Vmware Magic Quadrant Jun2011
No ratings yet
Gartner Vmware Magic Quadrant Jun2011
17 pages
CIS-ITSM Exam - Free Actual Q&As, Page 1 - ExamTopics
No ratings yet
CIS-ITSM Exam - Free Actual Q&As, Page 1 - ExamTopics
153 pages
Sap Hana and Intel Optane Configuration Guide
No ratings yet
Sap Hana and Intel Optane Configuration Guide
16 pages
Operating Systems Basics
No ratings yet
Operating Systems Basics
3 pages
CATIA Composer Installation Configuration and Licensing Guide R2015x
No ratings yet
CATIA Composer Installation Configuration and Licensing Guide R2015x
28 pages
9.98 Web Client Compatibility
No ratings yet
9.98 Web Client Compatibility
13 pages
Xssss
No ratings yet
Xssss
23 pages
Processes: Review Questions
No ratings yet
Processes: Review Questions
2 pages
Holberton School Syllabus
No ratings yet
Holberton School Syllabus
47 pages
4.5.2.10 Lab - Exploring Nmap - OK
100% (1)
4.5.2.10 Lab - Exploring Nmap - OK
7 pages
CS1103 Assignment 6
No ratings yet
CS1103 Assignment 6
13 pages
Practice Exam For RHCE 9
No ratings yet
Practice Exam For RHCE 9
8 pages
Embedded System Design Using Vivado: After Completing This Course, You Will Be Able To
No ratings yet
Embedded System Design Using Vivado: After Completing This Course, You Will Be Able To
4 pages
Activity Management System Abstract
No ratings yet
Activity Management System Abstract
5 pages
Fundamentals of Computer Problem Solving (CSC415)
No ratings yet
Fundamentals of Computer Problem Solving (CSC415)
52 pages
ClarioNET 6301
No ratings yet
ClarioNET 6301
117 pages
Error Report 6587959435359218235
No ratings yet
Error Report 6587959435359218235
70 pages
How To Setup ZNN Node
No ratings yet
How To Setup ZNN Node
8 pages
SDLC RUP System Architecture Document
No ratings yet
SDLC RUP System Architecture Document
6 pages
Requerimientos de Software
No ratings yet
Requerimientos de Software
5 pages
OceanStor S2600T&S5500T&S5600T&S5800T&S6800T Storage System V200R001 Command Reference 06
No ratings yet
OceanStor S2600T&S5500T&S5600T&S5800T&S6800T Storage System V200R001 Command Reference 06
487 pages