Scraping data in network traffic using Python
Last Updated :
13 Jul, 2021
In this article, we will learn how to scrap data in network traffic using Python.
Modules Needed
- selenium: Selenium is a portable framework for controlling web browser.
- time: This module provides various time-related functions.
- json: This module is required to work with JSON data.
- browsermobproxy: This module helps us to get the HAR file from network traffic.
There are two ways by which we can scrap the network traffic data.
Method 1: Using selenium's get_log() method
To start with this download and extract the chrome webdriver from here according to the version of your chrome browser and copy the executable path.
Approach:
- Import the DesiredCapabilities from the selenium module and enable performance logging.
- Startup the chrome webdriver with executable_path and default chrome-options or add some arguments to it and the modified desired_capabilities.
- Send a GET request to the website using driver.get() and wait for few seconds to load the page.
Syntax:
driver.get(url)
- Get the performance logs using driver.get_log() and store it in a variable.
Syntax:
driver.get_log("performance")
- Iterate every log and parse it using json.loads() to filter all the Network related logs.
- Write the filtered logs to a JSON file by converting to JSON string using json.dumps().
Example:
Python3
# Import the required modules
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import json
# Main Function
if __name__ == "__main__":
# Enable Performance Logging of Chrome.
desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"}
# Create the webdriver object and pass the arguments
options = webdriver.ChromeOptions()
# Chrome will start in Headless mode
options.add_argument('headless')
# Ignores any certificate errors if there is any
options.add_argument("--ignore-certificate-errors")
# Startup the chrome webdriver with executable path and
# pass the chrome options and desired capabilities as
# parameters.
driver = webdriver.Chrome(executable_path="C:/chromedriver.exe",
chrome_options=options,
desired_capabilities=desired_capabilities)
# Send a request to the website and let it load
driver.get("https://www.geeksforgeeks.org/")
# Sleeps for 10 seconds
time.sleep(10)
# Gets all the logs from performance in Chrome
logs = driver.get_log("performance")
# Opens a writable JSON file and writes the logs in it
with open("network_log.json", "w", encoding="utf-8") as f:
f.write("[")
# Iterates every logs and parses it using JSON
for log in logs:
network_log = json.loads(log["message"])["message"]
# Checks if the current 'method' key has any
# Network related value.
if("Network.response" in network_log["method"]
or "Network.request" in network_log["method"]
or "Network.webSocket" in network_log["method"]):
# Writes the network log to a JSON file by
# converting the dictionary to a JSON string
# using json.dumps().
f.write(json.dumps(network_log)+",")
f.write("{}]")
print("Quitting Selenium WebDriver")
driver.quit()
# Read the JSON File and parse it using
# json.loads() to find the urls containing images.
json_file_path = "network_log.json"
with open(json_file_path, "r", encoding="utf-8") as f:
logs = json.loads(f.read())
# Iterate the logs
for log in logs:
# Except block will be accessed if any of the
# following keys are missing.
try:
# URL is present inside the following keys
url = log["params"]["request"]["url"]
# Checks if the extension is .png or .jpg
if url[len(url)-4:] == ".png" or url[len(url)-4:] == ".jpg":
print(url, end='\n\n')
except Exception as e:
pass
Output:
The image URL's are highlighted above.
network_log.json containing the image URL'sMethod 2: Using browsermobproxy to capture the HAR file from the network tab of the browser
For this, the following requirements need to be satisfied.
- Download and Install Java v8 from here
- Download and extract browsermobproxy from here and copy the path of bin folder.
- Install browsermob-proxy using pip using the command in terminal :
pip install browsermob-proxy
- Download and extract the chrome webdriver from here, according the version of your chrome browser and copy the executable path.
Approach:
- Import the Server module from browsermobproxy and start up the Server with the copied bin folder path and set port as 8090.
- Call the create_proxy method to create the proxy object from Server and set "trustAllServers" parameter as true.
- Startup the chrome webdriver with executable_path and chrome-options discussed in code below.
- Now, create a new HAR file using the proxy object with the domain of the website.
- Send a GET request using driver.get() and wait for few seconds to load it properly.
Syntax:
driver.get(url)
- Write the HAR file of network traffic from the proxy object to a HAR file by converting it to JSON string using json.dumps().
Example:
Python3
# Import the required modules
from selenium import webdriver
from browsermobproxy import Server
import time
import json
# Main Function
if __name__ == "__main__":
# Enter the path of bin folder by
# extracting browsermob-proxy-2.1.4-bin
path_to_browsermobproxy = "C:\\browsermob-proxy-2.1.4\\bin\\"
# Start the server with the path and port 8090
server = Server(path_to_browsermobproxy
+ "browsermob-proxy", options={'port': 8090})
server.start()
# Create the proxy with following parameter as true
proxy = server.create_proxy(params={"trustAllServers": "true"})
# Create the webdriver object and pass the arguments
options = webdriver.ChromeOptions()
# Chrome will start in Headless mode
options.add_argument('headless')
# Ignores any certificate errors if there is any
options.add_argument("--ignore-certificate-errors")
# Setting up Proxy for chrome
options.add_argument("--proxy-server={0}".format(proxy.proxy))
# Startup the chrome webdriver with executable path and
# the chrome options as parameters.
driver = webdriver.Chrome(executable_path="C:/chromedriver.exe",
chrome_options=options)
# Create a new HAR file of the following domain
# using the proxy.
proxy.new_har("geeksforgeeks.org/")
# Send a request to the website and let it load
driver.get("https://www.geeksforgeeks.org/")
# Sleeps for 10 seconds
time.sleep(10)
# Write it to a HAR file.
with open("network_log1.har", "w", encoding="utf-8") as f:
f.write(json.dumps(proxy.har))
print("Quitting Selenium WebDriver")
driver.quit()
# Read HAR File and parse it using JSON
# to find the urls containing images.
har_file_path = "network_log1.har"
with open(har_file_path, "r", encoding="utf-8") as f:
logs = json.loads(f.read())
# Store the network logs from 'entries' key and
# iterate them
network_logs = logs['log']['entries']
for log in network_logs:
# Except block will be accessed if any of the
# following keys are missing
try:
# URL is present inside the following keys
url = log['request']['url']
# Checks if the extension is .png or .jpg
if url[len(url)-4:] == '.png' or url[len(url)-4:] == '.jpg':
print(url, end="\n\n")
except Exception as e:
# print(e)
pass
Output:
The image URL's are highlighted above.
network_log1.har containing the image URL's
Similar Reads
Network Scanning using scapy module - Python Scapy is a library supported by both Python2 and Python3. It is used for interacting with the packets on the network. It has several functionalities through which we can easily forge and manipulate the packet. Through scapy module we can create different network tools like ARP Spoofer, Network Scann
3 min read
Web Scraping Financial News Using Python In this article, we will cover how to extract financial news seamlessly using Python. This financial news helps many traders in placing the trade in cryptocurrency, bitcoins, the stock markets, and many other global stock markets setting up of trading bot will help us to analyze the data. Thus all t
3 min read
Scraping websites with Newspaper3k in Python Web Scraping is a powerful tool to gather information from a website. To scrape multiple URLs, we can use a Python library called Newspaper3k. The Newspaper3k package is a Python library used for Web Scraping articles, It is built on top of requests and for parsing lxml. This module is a modified an
2 min read
Telecommunication Network Traffic Analysis in R Telecommunication network traffic analysis involves studying the data flow within a network to ensure efficient performance, identify bottlenecks, and predict future trends. With the increasing demand for high-speed internet and mobile services, understanding network traffic patterns is crucial for
6 min read
Network Traffic Analysis Visualization in R The goal of a typical network traffic analysis project is to monitor, analyze, and visualize data flow across a network. This helps in identifying patterns and trends, detecting security threats, and making informed decisions about network infrastructure.The Theory Behind Synthetic Data GenerationTi
5 min read
Get Bit Coin price in real time using Python In this article we will see how we can get the current price of the bit coin. Bitcoin is a cryptocurrency. It is a decentralized digital currency without a central bank or single administrator that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries.
2 min read