0% found this document useful (0 votes)
9 views

getting data

Uploaded by

zoewong412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

getting data

Uploaded by

zoewong412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Getting data

The problem!

Data exists in different sources and different formats


and we have to work with whatever format we get

or

We go to data analysis with data in the format we have,


not the format we want!
Sources of data
local data
csv files
pdf files
xls files

web data
json
xml
html

database servers
mysql
postgres
mongoDB
RESTful Web Services

REST: Representational State Transfer

“A network of web pages connected through


links and HTTP commands (GET, POST,
etc.)”

RESTful: A web service that conforms to the


REST standards
RESTful Web Services

URLs: RESTful Web Services deliver


resources to the client. Each resources (html,
json, image, etc.) is associated with a URL
and an HTTP method

RESTful: A web service that conforms to the


REST standards
Example: Epicurious

http://www.epicurious.com

type “Tofu Chili” in search


box

http GET

new url:
http://www.epicurious.com/se
arch/Tofu%20Chili
Example: NYTIMES login

https://myaccount.nytimes.com/auth/login

type Email address and


Password in the form

http POST

new url:
http://www.nytimes.com/
Example: Google GEOCODING API

HTTP GET request with


a JSON response

https://maps.googleapis.com/maps/api/geocode/json?address=Columbia_University,_New_York,_NY

All google API requests take the form:


<api_url>/<response_type>?<parameters>
json (or xml)
address=Columbia_University,_New_York,_NY

https://maps.googleapis.com/maps/api/geocode/
What we need

The ability to

* create and send HTTP requests


* receive and process HTTP responses
* convert data residing in JSON/XML/HTML
format into python objects
Python libraries for
getting web data
* Send an http request and get an http response
* requests
* urllib.requests (urllib2 on python2)

* parse the response and extract data


* json
* lxml
* BeautifulSoup, Selenium (for html data)
http requests

requests: Python library for handling http requests and responses

http://docs.python-requests.org/en/master/
using requests
* Import the library
import requests
* Construct the url
url = “http://www.epicurious.com/search/Tofu+Chili”
* Send the request and get a response
response = requests.get(url)
* Check if the request was successful
if response.status_code == 200:
“SUCCESS”!!!!
else:
“FAILURE”!!!
response status codes
* 200 or 201
the request response cycle worked as planned
* Other 200s
the request response cycle worked but there is
additional information associated with the response
* 400s
there was an error (page not found/malformed
request/etc.)
* General rule of thumb
check if the status code was 200 for accessing data
through a GET or POST HTTP request
* response.content response content
returns the content of the HTTP response
* response.content.decode(‘utf-8’)
if the content is byte encoded (which it usually is!),
converts it into unicode - a python str
* What is unicode?
http://unicode.org/standard/WhatIsUnicode.html
* General rule of thumb
web pages are usually returned as byte strings and need
to be decoded. utf-8 is the usual decoding (but not always!)
* What is utf-8?
https://www.w3schools.com/charsets/ref_html_utf8.asp
Try this

1. Open https://en.wikipedia.org/wiki/main_page using the


requests library
2. Check the status code. Did your request work?
3. Get the content. Decode it. Then search the page for the string
“Did you know” using the str find function
4. If your find function returned a positive number - Great!
5. If it returned -1 (that means it was not found), you’ve done
something wrong. Try figuring out what went wrong
web data formats

* HTML
the common format when scraping web pages for data
* JSON or XML
usually when accessing data through an API or when
the server is explicitly sharing data with you
JSON
JavaScript Object Notation

- Standard for "serializing" data objects for


storage or transmission
- Human-readable, useful for data
interchange
- Also useful for representing and storing
semistructured data
- Stored as plain (byte strings or utf-8 strings)
text
JSON constructs and
Python equivalents

JSON Python

number int,float

string str

Null None

true/false True/False

Object dict

Array list
python json library

json.loads(<str>): converts a JSON string to


python objects

json.dumps(<python_object>): converts a
python object into a JSON formatted string
python json library

Converts a json string to an equivalent


python type

str
import json
json_data = '[{"b": [2, 4], "c": 3.0, "a": "A"}]'
python_data = json.loads(json_data)

list
python json library

Converts a json string to an equivalent


python type

str import json


data_string = json.dumps(python_data)

list
requests and json

The response object handles json


the requests library handles
spaces in a url for you

address=“Columbia University, New York, NY"


url="https://maps.googleapis.com/maps/api/geocode/json?address=%s" % (address)
response = requests.get(url).json()

response now points to a


python data object
requests and json

requests.json returns an exception if the


JSON object is ill-formed

note that some http errors are returned as


JSON

always check for exceptions!


requests and json
address="Columbia University, New York, NY"
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s" % (address)
try:
response = requests.get(url)
if not response.status_code == 200:
print("HTTP error",response.status_code)
else:
try:
response_data = response.json()
except:
print("Response not in valid JSON format")
except:
print("Something went wrong with requests.get")
print(type(response_data))
requests and json: example

Let’s take a look at the JSON object returned by Google Geocoding API
Working with json

Problem 1: Write a function that takes an


address as an argument and returns a
(latitude, longitude) tuple

def get_lat_lng(address_string):
#python code goes here
Solution to problem 1

def get_lat_lng(address):
url="https://maps.googleapis.com/maps/api/geocode/json?address=%s"%(address)
import requests
response = requests.get(url)
if response.status_code == 200:
lat = response.json()['results'][0]['geometry']['location']['lat']
lng = response.json()['results'][0]['geometry']['location']['lng']
return lat,lng
else:
return None
xml

Extensible Markup Language

- Tree structure
- Tagged elements (nested)
- Attributes
- Text (leaves of the tree)
<Last_Name>Berenholtz</Last_Name>

</Author> xml: Example


</Authors>

</Book>

<Book ISBN="ISBN-13:978-1579128562" Price="15.80">

<Remark>

Five Hundred Buildings of New York and over one million other books are available for Amazon Kindle.

</Remark>

<Title>Five Hundred Buildings of New York</Title>

<Authors>

<Author Residence="Beijing">

<First_Name>Bill</First_Name>

<Last_Name>Harris</Last_Name>

</Author>
XML Tree
ISBN="ISBN-13:978-1599620787" Price="15.23" Weight="1.5"
$15.23
1.5oz Bookstore
ISBN-13:978-1599620787

Book Book

Title Authors Authors Title

subtree subtree
Author Author
New York Deco

firstname lastname

Bill Harris
lxml: Python xml library

xml tree object definition

from lxml import etree


root = etree.XML(data)
print(root.tag) prints the root tag
(Bookstore in our
example)

http://lxml.de/1.3/tutorial.html
lxml: Python xml library
Examining the tree

print(etree.tostring(root, pretty_print=True).decode("utf-8"))
lxml: Iterating over
children of a tag

root is a collection of
children so we can
iterate over it
for child in root:
print(child.tag)
lxml: Iterating over elements

for element in root.iter():


print(element.tag)

iter is an ‘iterator’. it
generates a sequence of
elements in the order
they appear in the xml
code
lxml: Iterating over elements

iterates over the tree


matching only those
elements that have
“Author” as the tag
for element in root.iter("Author"):
print(element.find('First_Name').text,element.find('Last_Name').text)

returns the first element


that has “First_Name”
as the tag
lxml: using XPath

XPath: expression for


navigating through an
xml tree

for element in root.findall('Book/Title'):


print(element.text)
Try this

Find the last names of all authors in the tree “root” using path
Solution

for element in root.findall('Book/Authors/Author/Last_Name'):


print(element.text)
lxml: Finding by attribute value

root.find('Book[@Weight="1.5"]/Authors/Author/First_Name').text
Problem 3

Print first and last names of all authors who live in


New York City
from lxml import etree
root = etree.XML(data_string)
for author_element in root.findall('Book[@ISBN="ISBN-13:978-1579128562"]/Authors/Author'):
f_name = author_element.find('First_Name').text
l_name = author_element.find('Last_Name').text
print(f_name,l_name)
HTML

HyperText Markup Language

- Formats text
- Tagged elements (nested)
- Attributes
- Derived from SGML (but who
cares!)
- Closely related to XML
- Can contain runnable scripts
HTML/CSS

Study this on your own!


Make sure you’ve reviewed the first two topics (“Intro to html” and
“Intro to CSS”) on Khan Academy
https://www.khanacademy.org/computing/computer-programming/html-css
getting data from the
web
creeping, crawling, pouncing!

Web scraping: Automating the process of extracting information from web pages
* for data collection and analysis
* for incorporating in a web app

APIs (Application Programming Interface): Functions and libraries for communicating with
specific web servers
* for data collection and analysis
* for incorporating in a web app

Web crawling: Automating the process of traversing links on web pages


* for indexing the web
* for collecting data from multiple web sites
Legal and ethical
Legal issues issues
➡ Often against the ‘Terms of Use’ of a web site
➡ but, regardless, murky and and not fully settled
➡ See: https://en.wikipedia.org/wiki/Web_scraping#Legal_issues

➡ probably depends upon three things:


๏ factual, non-proprietary data is generally ok
๏ proprietary data scraping depends
Ethical on what you do with it
issues
๏ potential or actual damage to the scrapee
➡ Public vs. private information. Public information is rarely unethical
➡ Purpose. Ethical or unethical depends on why you’re scraping the web site
➡ Always better to try and get the information openly using APIs or contacting the
owner of the server
➡ Is there a public interest involved? If yes, it’s probably ethical to scrape
Libraries for web scraping

requests: Python library for connecting to a web page, managing


the connection and retrieving contents of the page

Beautiful Soup: A library that utilizes the ‘tag structure’ of


an html page to quickly parse the contents of a page and retrieve data

Selenium: A library that utilizes the ‘tag structure’ of


an html page to execute javascript scripts on the page and retrieve data.
Slower than Beautiful Soup but gets around the ‘javascript’ problem
BeautifulSoup4
➡ HTML (and XML) parser
➡ Uses ‘tags’

➡ Creates a parse tree (using lxml/html5lib or other


python parser)
➡ Can handle incomplete tagging

➡ tags are organized in hierarchical dictionaries


https://www.crummy.com/software/BeautifulSoup/bs4/doc/
bs4

initialize bs4 object: BeautifulSoup(document,parser)


parser: lxml (fast) or html5lib (slower but more robust)

import requests
from bs4 import BeautifulSoup
url = "http://www.epicurious.com/search/Tofu%20Chili"
response = requests.get(url)
page_soup = BeautifulSoup(response.content,'lxml')
print(page_soup.prettify())
page_soup is the object from
which we will extract the
data we need
Unique data identifiers

➡ We want to create a list of recipes and links to the


recipes
➡ We need to figure out how to ‘programmatically’
extract each recipe name and recipe link

➡ Search for the tag with a unique attribute value that


identifies recipes and recipe links
➡ The easiest way is to examine the page source on a
browser
Finding unique data identifiers
['article-content-card']

['gallery-content-card']

for tag in page_soup.find_all(‘article'): ['recipe-content-card']

print(tag.get(‘class’) ['article-content-card']

['recipe-content-card']
prints
This gets the innermost tags with the recipe name. ['recipe-content-card']

['recipe-content-card']

['article-content-card']

['recipe-content-card']

['recipe-content-card']
looks like class=‘recipe-content-card’ gives us the
['recipe-content-card']
recipes
['recipe-content-card']

['article-content-card']

['article-content-card']

['recipe-content-card']
bs4 functions
<tag>.find(<tag_name>,attribute=value) finds the first matching child tag (recursively)

<tag>.find_all(<tag_name>,attribute=value) finds all matching child tags (recursively)

<tag>.get_text() returns the marked up text

<tag>.parent returns the (immediate) parent

<tag>.parents returns all parents (recursively)

<tag>.children returns the (direct) children

<tag>.descendants returns all children (recursively)

<tag>.get(attribute) returns the value of the specified attribute

<tag>.name returns the name of a tag

<tag>.attrs returns all the attributes of a tag


Problem 4: Extract recipes and recipe links

Write a function epicurious_recipes(search_string) that returns


the list of recipes and links associated with search_string

Call the function with a search_string, open the link associated


with the first recipe, then return the ingredients and preparation
instructions associated with that link
Problem 4: Step 1

given a recipe url, get the description and the ingredients

def get_recipe_detail(url):
import requests
from bs4 import BeautifulSoup
html_data = requests.get(url)
if not html_data.status_code == 200:
return '',[]
recipe_data = BeautifulSoup(html_data.content,'lxml')
description = get_description(recipe_data)
ing_list = get_ingredients(recipe_data)
return description,ing_list
need to write these two functions
Problem 4: Step 2

write the get_description function

def get_description(recipe_page_data):
description_tag = recipe_page_data.find('div',itemprop = 'description')
if description_tag:
return description_tag.get_text()
return ''
Problem 4: Step 3

write the get_ingredients function

def get_ingredients(ing_page_data):
ing_list = list()
for item in ing_page_data.find_all('li',class_='ingredient'):
ing_list.append(item.get_text())
return ing_list
Problem 4: Step 4

write the function that:


takes key words as an argument
and returns a list of tuples
(name,description,ingredients)
def get_recipes(keywords):
recipe_list = list()
Problem 4: Step 4 code
import requests
from bs4 import BeautifulSoup
url = "http://www.epicurious.com/search/" + keywords
response = requests.get(url)
if not response.status_code == 200:
return None
try:
results_page = BeautifulSoup(response.content,'lxml')
recipes = results_page.find_all('article',class_="recipe-content-card")
for recipe in recipes:
recipe_link = "http://www.epicurious.com" + recipe.find('a').get('href')
recipe_name = recipe.find('a').get_text()
try:
recipe_description = recipe.find('p',class_='dek').get_text()
except:
recipe_description = ''
recipe_list.append((recipe_name,recipe_link,recipe_description))
return recipe_list
logging in with requests and beautifulsoup
➡ Figure out the login url
➡ https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page
➡ Look for the login form in the html source
➡ form_tag = page_soup.find('form')
➡ Look for ALL the inputs in the login form (some may be tricky!)
➡ input_tags = form_tag.find_all('input')
➡ Create a Python dict object with key,value pairs for each input
➡ Use requests.session to create an open session object
➡ Send the login request (POST)
➡ Send followup requests keeping the sessions object open
Setting up the inputs

payload = {
'wpName': username,
'wpPassword': password,
'wploginattempt': 'Log in',
'wpEditToken': "+\\",
'title': "Special:UserLogin",
'authAction': "login",
'force': "",
'wpForceHttps': "1",
'wpFromhttp': "1",
#'wpLoginToken': ‘', #We need to read this from the page
}
Extracting token information

wpLoginToken: the value of this attribute is provided by the page. we need


to extract it.

login_page_response =
s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto
=Main+Page')
soup = BeautifulSoup(login_page_response.content,'lxml')
token = soup.find('input',{'name':"wpLoginToken"}).get('value')
Finalizing session parameters
username=<your username>
password=<your password>

def get_login_token(response):
soup = BeautifulSoup(response.text, 'lxml')
token = soup.find('input',{'name':"wpLoginToken"}).get('value')
return token

payload = {
'wpName': username,
'wpPassword': password,
'wploginattempt': 'Log in',
'wpEditToken': "+\\",
'title': "Special:UserLogin",
'authAction': "login",
'force': "",
'wpForceHttps': "1",
'wpFromhttp': "1",
#'wpLoginToken': '',
}
Activating session

with requests.session() as s:
response = s.get('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page')
payload['wpLoginToken'] = get_login_token(response)
#Send the login request
response_post = s.post('https://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login',
data=payload)
#Get another page and check if we’re still logged in
response = s.get('https://en.wikipedia.org/wiki/Special:Watchlist')

You might also like