Lecture03 Data II
Lecture03 Data II
Harvard IACS
CS109A
Pavlos Protopapas, Kevin Rader, and Chris Tanner
ANNOUNCEMENTS
2
Background
• So far, we’ve learned:
3
Background
• The Data Science Process:
4
Background
• The Data Science Process:
5
Learning Objectives
6
Agenda
7
What are common sources
for data?
(For Data Science and computation purposes.)
8
Obtaining Data
• You curate it
• Someone else provides it, all pre-packaged for you (e.g., files)
9
Obtaining Data: Web scraping
Web scraping
• Transfer the data into a form that is compatible with your code
10
Obtaining Data: Web scraping
• Automate tasks
• Fun!
11
Obtaining Data: Web scraping
12
Obtaining Data: Web scraping
Robots.txt
• E.g., http://google.com/robots.txt
13
Obtaining Data: Web scraping
Web Servers
• A server maintains a long-running process (also called a daemon),
which listens on a pre-specified port
• It responds to requests, which is sent using a protocol called HTTP
(HTTPS is secure)
• Our browser sends these requests and downloads the content, then
displays it
• 2– request was successful, 4– client error, often `page not found`; 5–
server error (often that your request was incorrectly formed)
14
Obtaining Data: Web scraping
HTML
Example
• Tags are denoted by angled
brackets
• Almost all tags are in pairs e.g.,
<p>Hello</p>
• Some tags do not have a closing tag
e.g., <br/>
15
Obtaining Data: Web scraping
HTML
• <html>, indicates the start of an html page
• <body>, contains the items on the actual webpage
(text, links, images, etc)
● <p>, the paragraph tag. Can contain text and links
● <a>, the link tag. Contains a link url, and possibly a description of the link
● <input>, a form input tag. Used for text boxes, and other user input
● <form>, a form start tag, to indicate the start of a form
● <img>, an image tag containing the link to an image
16
Obtaining Data: Web scraping
• Documentation: http://crummy.com/software/BeautifulSoup
17
The Big Picture Recap
18
Obtaining Data: Web scraping
page = requests.get(url)
page.status_code
page.content
19
Obtaining Data: Web scraping
20
Obtaining Data: Web scraping
page = requests.get(url)
page.status_code Returns the content of the
page.content response, in bytes.
21
Obtaining Data: Web scraping
22
Obtaining Data: Web scraping
23
Obtaining Data: Web scraping
24
Obtaining Data: Web scraping
BeautifulSoup
• Helps make messy HTML digestible
• Provides functions for quickly accessing certain sections of
HTML content
Example
25
Obtaining Data: Web scraping
26
Exercise 1 time!
27
PANDAS
28
Store and Explore Data: PANDAS
What / Why?
29
Store and Explore Data: PANDAS
How
• import pandas library (convenient to rename it)
• Use read_csv() function
30
Store and Explore Data: PANDAS
Visit https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html
for a more in-depth walkthrough
31
Store and Explore Data: PANDAS
Example
• Say we have the following, tiny DataFrame of just 3 rows and 3 columns
32
Store and Explore Data: PANDAS
Example continued
33
Store and Explore Data: PANDAS
Example continued
df2.iloc[2] returns a Series representing the row at index 2 (NOT the row labelled
2. Though, they are often the same, as seen here)
df2.sort_values(by=[‘c’]) returns the DataFrame with rows shuffled such that now they
are in ascending order according to column c. In this
example, df2 would remain the same, as the values were
already sorted
34
Store and Explore Data: PANDAS
• High-level viewing:
• head() – first N observations
• tail() – last N observations
• describe() – statistics of the quantitative data
• dtypes – the data types of the columns
• columns – names of the columns
• shape – the # of (rows, columns)
35
Store and Explore Data: PANDAS
36
Store and Explore Data: PANDAS
37
Exploratory Data Analysis (EDA)
Why?
• EDA encompasses the “explore data” part of the data science process
• EDA is crucial but often overlooked:
• If your data is bad, your results will be bad
• Conversely, understanding your data well can help you create smart,
appropriate models
38
Exploratory Data Analysis (EDA)
What?
1. Store data in data structure(s) that will be convenient for exploring/processing
(Memory is fast. Storage is slow)
2. Clean/format the data so that:
• Each row represents a single object/observation/entry
• Each column represents an attribute/property/feature of that entry
• Values are numeric whenever possible
• Columns contain atomic properties that cannot be further decomposed*
39
Exploratory Data Analysis (EDA)
What? (continued)
3. Explore global properties: use histograms, scatter plots, and aggregation
functions to summarize the data
4. Explore group properties: group like-items together to compare subsets of the
data (are the comparison results reasonable/expected?)
This process transforms your data into a format which is easier to work
with, gives you a basic overview of the data's properties, and likely
generates several questions for you to follow-up in subsequent analysis.
40
Up Next
41
Exercise 2 time!
42