Web scraping is the automated process of extracting information from the internet using scripts or programs. It has various applications, including product comparison, review analysis, and data tracking. The document outlines the general process of web scraping using Python, including tools like BeautifulSoup, and emphasizes the importance of legality and permissions when scraping data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
26 views
Web-Scraping-With-Python
Web scraping is the automated process of extracting information from the internet using scripts or programs. It has various applications, including product comparison, review analysis, and data tracking. The document outlines the general process of web scraping using Python, including tools like BeautifulSoup, and emphasizes the importance of legality and permissions when scraping data.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16
Web Scraping with
Python By Zachary King What is Web Scraping? Web Scraping is the process of using a script or computer program to retrieve information from the Internet.
The process is usually automatic but can involve manual
input if desired. Purpose of Web Scraping ➢ Web scraping makes it easy to retrieve exactly what you need from a webpage. ➢ No tedious searching of long--or even short--pages manually. ➢ Statistical programs such as for research, testing, tracking, etc. ➢ Automate common visits to the web Applications ➢ Scrape product pages from retailer or manufacturer websites to show in their own website or provide specs/price comparison ➢ Scrape product reviews from retailers to detect fraudulent reviews ➢ Scrape news websites for analysis, often for providing better targeted news to their audience ➢ Scrape sports pages for stat tracking on individual teams or players ➢ Scrape your Facebook news feed for your own Facebook application! (or other social media) General Process 1. Fetch a web page 2. Download web page content (optional) 3. Parse data (HTML) 4. Apply parsed data (your usage) Using Python Some packages: -bs4 (BeautifulSoup4)** -urllib2 (for Python 2) -urllib (for Python 3)** -requests (for Python 3) -urllib.request (Python 3)** Go Fetch! To simply get the HTML content of a web page and output it: Specific Searches With BeautifulSoup, create a “soup” object that allows for easy searching within the contents of the web page. Output *More Specific Searches Use multiple “soups” to search specific parts of the web page. Output Child Elements An approach to retrieving all the child elements for a given tag are by using the .children attribute of BeautifulSoup objects. Output Extending your Scraper I have my scraped data, now what? ➢ Graphs/charts for visual representation ➢ Output to a file ➢ Store in an organized manner (data structures) ➢ Reformat into a new web page What Now? ➢ Bare in mind the legality of web scraping (it’s a blurry line). ➢ Always get the green light from the owner of the site (preferably recorded/signed), before scraping their data. ➢ Check out the docs for BeautifulSoup at http://www.crummy. com/software/BeautifulSoup/bs4/doc/ ➢ Take a refresher with the bs4 beginner article at http://www. pythonforbeginners.com/python-on-the-web/beautifulsoup-4- python/ Questions? You can download all of my example files from this presentation, as well as my more complete Python web scraping files from my GitHub at https://github.com/zach-king/Python-Web-Scraping