Scraping data in network traffic using Python Last Updated : 13 Jul, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we will learn how to scrap data in network traffic using Python. Modules Neededselenium: Selenium is a portable framework for controlling web browser.time: This module provides various time-related functions.json: This module is required to work with JSON data.browsermobproxy: This module helps us to get the HAR file from network traffic. There are two ways by which we can scrap the network traffic data. Method 1: Using selenium's get_log() method To start with this download and extract the chrome webdriver from here according to the version of your chrome browser and copy the executable path. Approach: Import the DesiredCapabilities from the selenium module and enable performance logging.Startup the chrome webdriver with executable_path and default chrome-options or add some arguments to it and the modified desired_capabilities.Send a GET request to the website using driver.get() and wait for few seconds to load the page. Syntax: driver.get(url) Get the performance logs using driver.get_log() and store it in a variable. Syntax: driver.get_log("performance") Iterate every log and parse it using json.loads() to filter all the Network related logs.Write the filtered logs to a JSON file by converting to JSON string using json.dumps(). Example: Python3 # Import the required modules from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities import time import json # Main Function if __name__ == "__main__": # Enable Performance Logging of Chrome. desired_capabilities = DesiredCapabilities.CHROME desired_capabilities["goog:loggingPrefs"] = {"performance": "ALL"} # Create the webdriver object and pass the arguments options = webdriver.ChromeOptions() # Chrome will start in Headless mode options.add_argument('headless') # Ignores any certificate errors if there is any options.add_argument("--ignore-certificate-errors") # Startup the chrome webdriver with executable path and # pass the chrome options and desired capabilities as # parameters. driver = webdriver.Chrome(executable_path="C:/chromedriver.exe", chrome_options=options, desired_capabilities=desired_capabilities) # Send a request to the website and let it load driver.get("https://www.geeksforgeeks.org/") # Sleeps for 10 seconds time.sleep(10) # Gets all the logs from performance in Chrome logs = driver.get_log("performance") # Opens a writable JSON file and writes the logs in it with open("network_log.json", "w", encoding="utf-8") as f: f.write("[") # Iterates every logs and parses it using JSON for log in logs: network_log = json.loads(log["message"])["message"] # Checks if the current 'method' key has any # Network related value. if("Network.response" in network_log["method"] or "Network.request" in network_log["method"] or "Network.webSocket" in network_log["method"]): # Writes the network log to a JSON file by # converting the dictionary to a JSON string # using json.dumps(). f.write(json.dumps(network_log)+",") f.write("{}]") print("Quitting Selenium WebDriver") driver.quit() # Read the JSON File and parse it using # json.loads() to find the urls containing images. json_file_path = "network_log.json" with open(json_file_path, "r", encoding="utf-8") as f: logs = json.loads(f.read()) # Iterate the logs for log in logs: # Except block will be accessed if any of the # following keys are missing. try: # URL is present inside the following keys url = log["params"]["request"]["url"] # Checks if the extension is .png or .jpg if url[len(url)-4:] == ".png" or url[len(url)-4:] == ".jpg": print(url, end='\n\n') except Exception as e: pass Output: The image URL's are highlighted above.network_log.json containing the image URL'sMethod 2: Using browsermobproxy to capture the HAR file from the network tab of the browser For this, the following requirements need to be satisfied. Download and Install Java v8 from hereDownload and extract browsermobproxy from here and copy the path of bin folder.Install browsermob-proxy using pip using the command in terminal : pip install browsermob-proxy Download and extract the chrome webdriver from here, according the version of your chrome browser and copy the executable path. Approach: Import the Server module from browsermobproxy and start up the Server with the copied bin folder path and set port as 8090.Call the create_proxy method to create the proxy object from Server and set "trustAllServers" parameter as true.Startup the chrome webdriver with executable_path and chrome-options discussed in code below.Now, create a new HAR file using the proxy object with the domain of the website.Send a GET request using driver.get() and wait for few seconds to load it properly. Syntax: driver.get(url) Write the HAR file of network traffic from the proxy object to a HAR file by converting it to JSON string using json.dumps(). Example: Python3 # Import the required modules from selenium import webdriver from browsermobproxy import Server import time import json # Main Function if __name__ == "__main__": # Enter the path of bin folder by # extracting browsermob-proxy-2.1.4-bin path_to_browsermobproxy = "C:\\browsermob-proxy-2.1.4\\bin\\" # Start the server with the path and port 8090 server = Server(path_to_browsermobproxy + "browsermob-proxy", options={'port': 8090}) server.start() # Create the proxy with following parameter as true proxy = server.create_proxy(params={"trustAllServers": "true"}) # Create the webdriver object and pass the arguments options = webdriver.ChromeOptions() # Chrome will start in Headless mode options.add_argument('headless') # Ignores any certificate errors if there is any options.add_argument("--ignore-certificate-errors") # Setting up Proxy for chrome options.add_argument("--proxy-server={0}".format(proxy.proxy)) # Startup the chrome webdriver with executable path and # the chrome options as parameters. driver = webdriver.Chrome(executable_path="C:/chromedriver.exe", chrome_options=options) # Create a new HAR file of the following domain # using the proxy. proxy.new_har("geeksforgeeks.org/") # Send a request to the website and let it load driver.get("https://www.geeksforgeeks.org/") # Sleeps for 10 seconds time.sleep(10) # Write it to a HAR file. with open("network_log1.har", "w", encoding="utf-8") as f: f.write(json.dumps(proxy.har)) print("Quitting Selenium WebDriver") driver.quit() # Read HAR File and parse it using JSON # to find the urls containing images. har_file_path = "network_log1.har" with open(har_file_path, "r", encoding="utf-8") as f: logs = json.loads(f.read()) # Store the network logs from 'entries' key and # iterate them network_logs = logs['log']['entries'] for log in network_logs: # Except block will be accessed if any of the # following keys are missing try: # URL is present inside the following keys url = log['request']['url'] # Checks if the extension is .png or .jpg if url[len(url)-4:] == '.png' or url[len(url)-4:] == '.jpg': print(url, end="\n\n") except Exception as e: # print(e) pass Output: The image URL's are highlighted above.network_log1.har containing the image URL's Comment More infoAdvertise with us Next Article Scraping data in network traffic using Python anilabhadatta Follow Improve Article Tags : Python Selenium Python-selenium Web-scraping Practice Tags : python Similar Reads Network Scanning using scapy module - Python Scapy is a library supported by both Python2 and Python3. It is used for interacting with the packets on the network. It has several functionalities through which we can easily forge and manipulate the packet. Through scapy module we can create different network tools like ARP Spoofer, Network Scann 3 min read Web Scraping Financial News Using Python In this article, we will cover how to extract financial news seamlessly using Python. This financial news helps many traders in placing the trade in cryptocurrency, bitcoins, the stock markets, and many other global stock markets setting up of trading bot will help us to analyze the data. Thus all t 3 min read Network Scanner in Python A network scanner is one major tool for analyzing the hosts that are available on the network. A network scanner is an IP scanner that is used for scanning the networks that are connected to several computers. To get the list of the available hosts on a network, there are two basic methods - ICMP Ec 3 min read Scraping websites with Newspaper3k in Python Web Scraping is a powerful tool to gather information from a website. To scrape multiple URLs, we can use a Python library called Newspaper3k. The Newspaper3k package is a Python library used for Web Scraping articles, It is built on top of requests and for parsing lxml. This module is a modified an 2 min read Telecommunication Network Traffic Analysis in R Telecommunication network traffic analysis involves studying the data flow within a network to ensure efficient performance, identify bottlenecks, and predict future trends. With the increasing demand for high-speed internet and mobile services, understanding network traffic patterns is crucial for 6 min read Network Traffic Analysis Visualization in R In today's interconnected world, where the internet plays a crucial role in both personal and professional spheres, understanding network traffic becomes paramount. Network traffic analysis involves the monitoring and analysis of data flowing across a network, which helps identify patterns, anomalie 8 min read Get Bit Coin price in real time using Python In this article we will see how we can get the current price of the bit coin. Bitcoin is a cryptocurrency. It is a decentralized digital currency without a central bank or single administrator that can be sent from user to user on the peer-to-peer bitcoin network without the need for intermediaries. 2 min read Web Scraping for Stock Prices in Python Web scraping is a data extraction method that collects data only from websites. It is often used for data mining and gathering valuable insights from large websites. Web scraping is also useful for personal use. Python includes a nice library called BeautifulSoup that enables web scraping. In this a 6 min read Port Scanner using Python Prerequisites: Socket Programming in Python This article is just to provide a sample code to generate a Port Scanner. This Port Scanner will work for both the Web Applications as well as remote Host. This tool has been created to provide the basic functionality of a Port Scanner. The general concept 2 min read Spoofing IP address when web scraping using Python In this article, we are going to scrap a website using Requests by rotating proxies in Python. Modules RequiredRequests module allows you to send HTTP requests and returns a response with all the data such as status, page content, etc. Syntax: requests.get(url, parameter) JSON JavaScript Object No 3 min read Like