0% found this document useful (0 votes)

53 views42 pages

Lecture03 Data II

This document discusses different ways to obtain data, including web scraping, and how to parse and explore data using Python libraries like Requests, BeautifulSoup, and Pandas. It provides an overview of obtaining data from various sources like files, APIs and webpages, parsing data using regular expressions and BeautifulSoup, and exploring data using Pandas.

Uploaded by

هارون هشام

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views42 pages

Lecture03 Data II

Uploaded by

هارون هشام

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Lecture 3: Data II

How to get it, methods to parse it,

and ways to explore it.

Harvard IACS
CS109A
Pavlos Protopapas, Kevin Rader, and Chris Tanner
ANNOUNCEMENTS

• Homework 0 isn’t graded for accuracy. If your questions were surface-

level / clarifying questions, you’re in good shape.

• Homework 1 is graded for accuracy

• it’ll be released today (due in a week)

• Study Break this Thurs @ 8:30pm and Fri @ 10:15am

• After lecture, please update your Zoom to the latest version

2
Background
• So far, we’ve learned:

Lecture 1 What is Data Science?

Lectures 1 & 2 The Data Science Process
Lecture 2 Data: types, formats, issues, etc.
Lecture 2 Regular Expressions (briefly)
This lecture How to get data and parse web data + PANDAS
Future lectures How to model data

3
Background
• The Data Science Process:

Ask an interesting question

Get the Data

Explore the Data

Model the Data

Communicate/Visualize the Results

4
Background
• The Data Science Process:

Ask an interesting question

Get the Data

This lecture
Explore the Data

Model the Data

Communicate/Visualize the Results

5
Learning Objectives

• Understand different ways to obtain it

• Be able to extract any web content of interest

• Be able to do basic PANDAS commands to store and explore data

• Feel comfortable using online resources to help with these

libraries (Requests, BeautifulSoup, and PANDAS)

6
Agenda

How to get web data?

How to parse basic elements using BeautifulSoup

Getting started with PANDAS

7
What are common sources
for data?
(For Data Science and computation purposes.)

8
Obtaining Data

Data can come from:

• You curate it

• Someone else provides it, all pre-packaged for you (e.g., files)

• Someone else provides an API

• Someone else has available content, and you try to take it

(web scraping)

9
Obtaining Data: Web scraping

Web scraping

• Using programs to get data from online

• Often much faster than manually copying data!

• Transfer the data into a form that is compatible with your code

• Legal and moral issues (per Lecture 2)

10
Obtaining Data: Web scraping

Why scrape the web?

• Vast source of information; can combine with multiple datasets

• Companies have not provided APIs

• Automate tasks

• Keep up with sites / real-time data

• Fun!

11
Obtaining Data: Web scraping

Web scraping tips:

• Be careful and polite

• Give proper credit

• Care about media law / obey licenses / privacy

• Don’t be evil (no spam, overloading sites, etc)

12
Obtaining Data: Web scraping

[Link]

• Specified by web site owner

• Gives instructions to web robots (e.g., your code)

• Located at the top-level directory of the web server

• E.g., [Link]

13
Obtaining Data: Web scraping

Web Servers
• A server maintains a long-running process (also called a daemon),
which listens on a pre-specified port
• It responds to requests, which is sent using a protocol called HTTP
(HTTPS is secure)
• Our browser sends these requests and downloads the content, then
displays it
• 2– request was successful, 4– client error, often `page not found`; 5–
server error (often that your request was incorrectly formed)

14
Obtaining Data: Web scraping

HTML
Example
• Tags are denoted by angled
brackets
• Almost all tags are in pairs e.g.,
<p>Hello</p>
• Some tags do not have a closing tag
e.g., <br/>

15
Obtaining Data: Web scraping

HTML
• <html>, indicates the start of an html page
• <body>, contains the items on the actual webpage
(text, links, images, etc)
● <p>, the paragraph tag. Can contain text and links
● <a>, the link tag. Contains a link url, and possibly a description of the link
● <input>, a form input tag. Used for text boxes, and other user input
● <form>, a form start tag, to indicate the start of a form
● <img>, an image tag containing the link to an image

16
Obtaining Data: Web scraping

How to Web scrape:

1. Get the webpage content

• Requests (Python library) gets a webpage for you

2. Parse the webpage content

• (e.g., find all the text or all the links on a page)

• BeautifulSoup (Python library) helps you parse the webpage.

• Documentation: [Link]

17
The Big Picture Recap

Data Sources Files, APIs, Webpages (via Requests)

Data Parsing Regular Expressions, Beautiful Soup

Data Structures/Storage Traditional lists/dictionaries, PANDAS

Models Linear Regression, Logistic Regression, kNN, etc

BeautifulSoup only concerns webpage data

18
Obtaining Data: Web scraping

1. Get the webpage content

Requests (Python library) gets a webpage for you

page = [Link](url)
page.status_code
[Link]

19
Obtaining Data: Web scraping

1. Get the webpage content

Requests (Python library) gets a webpage for you

Gets the status from the webpage

page = [Link](url) request.
page.status_code 200 means success.
[Link] 404 means page not found.

20
Obtaining Data: Web scraping

1. Get the webpage content

Requests (Python library) gets a webpage for you

page = [Link](url)
page.status_code Returns the content of the
[Link] response, in bytes.

21
Obtaining Data: Web scraping

2. Parse the webpage content

BeautifulSoup (Python library) helps you parse a webpage

soup = BeautifulSoup([Link], “[Link]”)

[Link]
[Link]

22
Obtaining Data: Web scraping

2. Parse the webpage content

BeautifulSoup (Python library) helps you parse a webpage

soup = BeautifulSoup([Link], “[Link]”)

[Link]
Returns the full context, including the title tag.
[Link]
e.g.,
<title data-rh="true">The New York Times – Breaking
News</title>

23
Obtaining Data: Web scraping

2. Parse the webpage content

BeautifulSoup (Python library) helps you parse a webpage

soup = BeautifulSoup([Link], “[Link]”)

[Link]
Returns the text part of the title tag. e.g.,
[Link] The New York Times – Breaking News

24
Obtaining Data: Web scraping

BeautifulSoup
• Helps make messy HTML digestible
• Provides functions for quickly accessing certain sections of
HTML content
Example

25
Obtaining Data: Web scraping

HTML is a tree Example

• You don’t have to access the
HTML as a tree, though;
• Can immediately search for
tags/content of interest (a la
previous slide)

26
Exercise 1 time!

27
PANDAS

Kung Fu Panda is property of DreamWorks and Paramount Pictures

28
Store and Explore Data: PANDAS

What / Why?

• Pandas is an open-source Python library (anyone can contribute)

• Allows for high-performance, easy-to-use data structures and data analysis
• Unlike NumPy library which provides multi-dimensional arrays, Pandas
provides 2D table object called DataFrame
(akin to a spreadsheet with column names and row labels).
• Used by a lot of people

29
Store and Explore Data: PANDAS

How
• import pandas library (convenient to rename it)
• Use read_csv() function

30
Store and Explore Data: PANDAS

What it looks like

Visit [Link]
for a more in-depth walkthrough

31
Store and Explore Data: PANDAS

Example
• Say we have the following, tiny DataFrame of just 3 rows and 3 columns

df2[‘a’] selects column a

df2[‘a’] == 4 returns a Boolean list representing which

rows of column a equal 4:
[False, True, False]

df2[‘a’].min() returns 1 because that’s the minimum value in

the a column

df2[[‘a’, ‘c’]] selects columns a and c

32
Store and Explore Data: PANDAS

Example continued

df2[‘a’].unique() returns all distinct values of the a column once

[Link][2] returns a Series

representing the row
w/ the label 2

[Link][df2[‘a’] == 4] .loc returns all rows that were passed-in

[False, True, False]

33
Store and Explore Data: PANDAS

Example continued

[Link][2] returns a Series representing the row at index 2 (NOT the row labelled
2. Though, they are often the same, as seen here)

df2.sort_values(by=[‘c’]) returns the DataFrame with rows shuffled such that now they
are in ascending order according to column c. In this
example, df2 would remain the same, as the values were
already sorted

34
Store and Explore Data: PANDAS

Common PANDAS functions

• High-level viewing:
• head() – first N observations
• tail() – last N observations
• describe() – statistics of the quantitative data
• dtypes – the data types of the columns
• columns – names of the columns
• shape – the # of (rows, columns)

35
Store and Explore Data: PANDAS

Common PANDAS functions

• Accessing/processing:
• df[“column_name”]
• df.column_name
• .max(), .min(), .idxmax(), .idxmin()
• <dataframe> <conditional statement>
• .loc[] – label-based accessing
• .iloc[] – index-based accessing
• .sort_values()
• .isnull(), .notnull()

36
Store and Explore Data: PANDAS

Common Panda functions

• Grouping/Splitting/Aggregating:
• groupby(), .get_groups()
• .merge()
• .concat()
• .aggegate()
• .append()

37
Exploratory Data Analysis (EDA)

Why?
• EDA encompasses the “explore data” part of the data science process
• EDA is crucial but often overlooked:
• If your data is bad, your results will be bad
• Conversely, understanding your data well can help you create smart,
appropriate models

38
Exploratory Data Analysis (EDA)

What?
1. Store data in data structure(s) that will be convenient for exploring/processing
(Memory is fast. Storage is slow)
2. Clean/format the data so that:
• Each row represents a single object/observation/entry
• Each column represents an attribute/property/feature of that entry
• Values are numeric whenever possible
• Columns contain atomic properties that cannot be further decomposed*

* Unlike food waste, which can be composted.

Please consider composting food scraps.

39
Exploratory Data Analysis (EDA)

What? (continued)
3. Explore global properties: use histograms, scatter plots, and aggregation
functions to summarize the data
4. Explore group properties: group like-items together to compare subsets of the
data (are the comparison results reasonable/expected?)

This process transforms your data into a format which is easier to work
with, gives you a basic overview of the data's properties, and likely
generates several questions for you to follow-up in subsequent analysis.

40
Up Next

We will address EDA more

and dive into Advanced
PANDAS operations

41
Exercise 2 time!

Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
L2 - Data Acquisition
No ratings yet
L2 - Data Acquisition
48 pages
Module 2 - Final
No ratings yet
Module 2 - Final
58 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Webscraping
No ratings yet
Webscraping
12 pages
Web Crawling and Scraping with Python
No ratings yet
Web Crawling and Scraping with Python
34 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Python For Data Science
No ratings yet
Python For Data Science
40 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Data Collection Techniques in Data Science
No ratings yet
Data Collection Techniques in Data Science
14 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Dap Mod 4-5
No ratings yet
Dap Mod 4-5
19 pages
Unit I
No ratings yet
Unit I
12 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Web Scraping Python Tutorial - How To Scrape Data From A Website
No ratings yet
Web Scraping Python Tutorial - How To Scrape Data From A Website
19 pages
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
No ratings yet
Anis D. Ultimate Step by Step Guide To Data Science..Python.2021
161 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
Quick Guide Web Scraping With Python
No ratings yet
Quick Guide Web Scraping With Python
3 pages
Lecture 12 - Web Scrapping
No ratings yet
Lecture 12 - Web Scrapping
11 pages
2 Data Science - Managing Data
No ratings yet
2 Data Science - Managing Data
37 pages
Assignment Unit I and II
No ratings yet
Assignment Unit I and II
3 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
Pandas
No ratings yet
Pandas
50 pages
Python Tools for Data Scientists
100% (1)
Python Tools for Data Scientists
23 pages
Data Cleaning and Web Scraping Guide
No ratings yet
Data Cleaning and Web Scraping Guide
4 pages
DADS404 Unit-02 - V1.1
No ratings yet
DADS404 Unit-02 - V1.1
23 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Web Scraping With: 1 High-Level Overview: The Process of Webscraping
No ratings yet
Web Scraping With: 1 High-Level Overview: The Process of Webscraping
11 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
04 DataMunging PDF
No ratings yet
04 DataMunging PDF
36 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Programming in Ds With Python
No ratings yet
Programming in Ds With Python
11 pages
Web Scraping in R: A Beginner's Guide
No ratings yet
Web Scraping in R: A Beginner's Guide
20 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Importing Data in Python Ii: Importing Flat Files From The Web
No ratings yet
Importing Data in Python Ii: Importing Flat Files From The Web
22 pages
Web Scraping & API Guide
No ratings yet
Web Scraping & API Guide
24 pages
Web Scraping CheatSheet Guide
No ratings yet
Web Scraping CheatSheet Guide
10 pages