getting data
getting data
The problem!
or
web data
json
xml
html
database servers
mysql
postgres
mongoDB
RESTful Web Services
http://www.epicurious.com
http GET
new url:
http://www.epicurious.com/se
arch/Tofu%20Chili
Example: NYTIMES login
https://myaccount.nytimes.com/auth/login
http POST
new url:
http://www.nytimes.com/
Example: Google GEOCODING API
https://maps.googleapis.com/maps/api/geocode/json?address=Columbia_University,_New_York,_NY
https://maps.googleapis.com/maps/api/geocode/
What we need
The ability to
http://docs.python-requests.org/en/master/
using requests
* Import the library
import requests
* Construct the url
url = “http://www.epicurious.com/search/Tofu+Chili”
* Send the request and get a response
response = requests.get(url)
* Check if the request was successful
if response.status_code == 200:
“SUCCESS”!!!!
else:
“FAILURE”!!!
response status codes
* 200 or 201
the request response cycle worked as planned
* Other 200s
the request response cycle worked but there is
additional information associated with the response
* 400s
there was an error (page not found/malformed
request/etc.)
* General rule of thumb
check if the status code was 200 for accessing data
through a GET or POST HTTP request
* response.content response content
returns the content of the HTTP response
* response.content.decode(‘utf-8’)
if the content is byte encoded (which it usually is!),
converts it into unicode - a python str
* What is unicode?
http://unicode.org/standard/WhatIsUnicode.html
* General rule of thumb
web pages are usually returned as byte strings and need
to be decoded. utf-8 is the usual decoding (but not always!)
* What is utf-8?
https://www.w3schools.com/charsets/ref_html_utf8.asp
Try this
* HTML
the common format when scraping web pages for data
* JSON or XML
usually when accessing data through an API or when
the server is explicitly sharing data with you
JSON
JavaScript Object Notation
JSON Python
number int,float
string str
Null None
true/false True/False
Object dict
Array list
python json library
json.dumps(<python_object>): converts a
python object into a JSON formatted string
python json library
str
import json
json_data = '[{"b": [2, 4], "c": 3.0, "a": "A"}]'
python_data = json.loads(json_data)
list
python json library
list
requests and json
Let’s take a look at the JSON object returned by Google Geocoding API
Working with json
def get_lat_lng(address_string):
#python code goes here
Solution to problem 1
def get_lat_lng(address):
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s"%(address)
import requests
response = requests.get(url)
if response.status_code == 200:
lat = response.json()['results'][0]['geometry']['location']['lat']
lng = response.json()['results'][0]['geometry']['location']['lng']
return lat,lng
else:
return None
xml
- Tree structure
- Tagged elements (nested)
- Attributes
- Text (leaves of the tree)
<Last_Name>Berenholtz</Last_Name>
</Book>
<Remark>
Five Hundred Buildings of New York and over one million other books are available for Amazon Kindle.
</Remark>
<Authors>
<Author Residence="Beijing">
<First_Name>Bill</First_Name>
<Last_Name>Harris</Last_Name>
</Author>
XML Tree
ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5"
$15.23
1.5oz Bookstore
ISBN-13:978-1599620787
Book Book
subtree subtree
Author Author
New York Deco
firstname lastname
Bill Harris
lxml: Python xml library
http://lxml.de/1.3/tutorial.html
lxml: Python xml library
Examining the tree
print(etree.tostring(root, pretty_print=True).decode("utf-8"))
lxml: Iterating over
children of a tag
root is a collection of
children so we can
iterate over it
for child in root:
print(child.tag)
lxml: Iterating over elements
iter is an ‘iterator’. it
generates a sequence of
elements in the order
they appear in the xml
code
lxml: Iterating over elements
Find the last names of all authors in the tree “root” using path
Solution
root.find('Book[@Weight="1.5"]/Authors/Author/First_Name').text
Problem 3
- Formats text
- Tagged elements (nested)
- Attributes
- Derived from SGML (but who
cares!)
- Closely related to XML
- Can contain runnable scripts
HTML/CSS
Web scraping: Automating the process of extracting information from web pages
* for data collection and analysis
* for incorporating in a web app
APIs (Application Programming Interface): Functions and libraries for communicating with
specific web servers
* for data collection and analysis
* for incorporating in a web app
import requests
from bs4 import BeautifulSoup
url = "http://www.epicurious.com/search/Tofu%20Chili"
response = requests.get(url)
page_soup = BeautifulSoup(response.content,'lxml')
print(page_soup.prettify())
page_soup is the object from
which we will extract the
data we need
Unique data identifiers
['gallery-content-card']
print(tag.get(‘class’) ['article-content-card']
['recipe-content-card']
prints
This gets the innermost tags with the recipe name. ['recipe-content-card']
['recipe-content-card']
['article-content-card']
['recipe-content-card']
['recipe-content-card']
looks like class=‘recipe-content-card’ gives us the
['recipe-content-card']
recipes
['recipe-content-card']
['article-content-card']
['article-content-card']
['recipe-content-card']
bs4 functions
<tag>.find(<tag_name>,attribute=value) finds the first matching child tag (recursively)
def get_recipe_detail(url):
import requests
from bs4 import BeautifulSoup
html_data = requests.get(url)
if not html_data.status_code == 200:
return '',[]
recipe_data = BeautifulSoup(html_data.content,'lxml')
description = get_description(recipe_data)
ing_list = get_ingredients(recipe_data)
return description,ing_list
need to write these two functions
Problem 4: Step 2
def get_description(recipe_page_data):
description_tag = recipe_page_data.find('div',itemprop = 'description')
if description_tag:
return description_tag.get_text()
return ''
Problem 4: Step 3
def get_ingredients(ing_page_data):
ing_list = list()
for item in ing_page_data.find_all('li',class_='ingredient'):
ing_list.append(item.get_text())
return ing_list
Problem 4: Step 4
payload = {
'wpName': username,
'wpPassword': password,
'wploginattempt': 'Log in',
'wpEditToken': "+\\",
'title': "Special:UserLogin",
'authAction': "login",
'force': "",
'wpForceHttps': "1",
'wpFromhttp': "1",
#'wpLoginToken': ‘', #We need to read this from the page
}
Extracting token information
login_page_response =
s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto
=Main+Page')
soup = BeautifulSoup(login_page_response.content,'lxml')
token = soup.find('input',{'name':"wpLoginToken"}).get('value')
Finalizing session parameters
username=<your username>
password=<your password>
def get_login_token(response):
soup = BeautifulSoup(response.text, 'lxml')
token = soup.find('input',{'name':"wpLoginToken"}).get('value')
return token
payload = {
'wpName': username,
'wpPassword': password,
'wploginattempt': 'Log in',
'wpEditToken': "+\\",
'title': "Special:UserLogin",
'authAction': "login",
'force': "",
'wpForceHttps': "1",
'wpFromhttp': "1",
#'wpLoginToken': '',
}
Activating session
with requests.session() as s:
response = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page')
payload['wpLoginToken'] = get_login_token(response)
#Send the login request
response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
#Get another page and check if we’re still logged in
response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')