0% found this document useful (0 votes)

120 views40 pages

Data - Collection Python

This document discusses different sources of data and how to collect data from websites using web scraping. It provides an overview of company data that is collected internally for business purposes, as well as open data that is publicly available. It then describes how to make requests to websites and APIs to gather data programmatically using Python and libraries like Requests and BeautifulSoup. The document explains how to inspect HTML, select elements, and extract text and information from tags. In summary, the document is an introduction to collecting and parsing data from the web through techniques like web scraping.

Uploaded by

ZADOD YASSINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

120 views40 pages

Data - Collection Python

Uploaded by

ZADOD YASSINE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Data

Collection
Zakaria KERKAOU
[email protected]
Sources of Data
Company data : Open data :
• Collected by companies. • Free open data sources.
• Helps them make data • Can be used, shared, and build
driven decisions on by anyone
Company Data


Web events.

Survey data.

Customer data.

Logistics data.

Financial transactions.
Company Data
Example: Web events
When you visit a web page or click on a link, usually this
information is tracked by companies in order to calculate
conversion rates or monitor the popularity of different pieces of
content.
User_id event_name Time/date Links_clicked

4856 Page_visited 2022-10-05 Link1

11:01:02 Link2
Link3
Company Data
Example: Survey Data
Data can also be collected by asking people for their opinions in
surveys.

Face-to-face interview.

Online questionnaire.

focus group.
Open Data


API (Application
Programming Interface)

Public records.
Public data APIs
• Using APIs we can request data from a
third party companies over the internet.
• Many companies have public APIs to let
anyone access their data.
• Some APIs:

Youtube Data API.

Facebook Graph API.

MediaWiki Action API (Wikipedia).

Openweather API.

Many more.
Public data APIs
• Example of API call of • Example of API Response
OpenWeatherAPI
Public Record
Public records are another great way of gathering data. They can
be collected and shared by:

International organisations:

World Bank, United nations, world heath organisation.

National statistical office:

Censuses surveys

Governmental agencies:

Weather, environment, population.
Public Record
For example www.covidmaroc.ma
Application
Programming
Interface
Python APIs
To use an API, you make a request to a remote web server, and retrieve the data you
need.

But why use an API instead of a static CSV dataset you can download from the web? APIs
are useful in the following cases:

The data is changing quickly. An example of this is stock price data.

You want a small piece of a much larger set of data. Reddit comments are one example.

There is repeated computation involved. Spotify has an API that can tell you the genre
of a piece of music.

In cases like the ones above, an API is the right solution.

What is an API?
An API, or Application Programming Interface, is a server that you can use to retrieve and
send data to using code. APIs are most commonly used to retrieve data, and that will be
the focus in our case.
API Documentation
In order to ensure we make a successful request, when we work with APIs it’s
important to consult the documentation. Documentation can seem scary at
first, but as you use documentation more and more you’ll find it gets easier.
API Requests in Python
To make a ‘GET’ request, we’ll use the requests.get() function, which requires most of the
time, one argument — the URL we want to make the request to.

Status codes are returned with every request that is made to a web server. Status codes
indicate information about what happened with a request. Here are some codes that are
relevant to GET requests:

200: Everything went okay, and the result has been returned (if any).

301: The server is redirecting you to a different endpoint.

400: The server thinks you made a bad request.

401: The server thinks you’re not authenticated.

403: The resource you’re trying to access is forbidden.

404: The resource you tried to access wasn’t found on the server.

503: The server is not ready to handle the request.
Requests in Python
The documentation for this specific API tells us that the response we’ll get is in JSON
format. In the next section we’ll learn about JSON, but first let’s use the response.json()
method to see the data we received back from the API.

Example :
The Notify API, which gives access to data
about the international space station. It’s a
great API for learning because it has a very
simple design, and doesn’t require
authentication. We’ll teach you how to use an
API that requires authentication in a later
post.
JSON Data in Python
JSON (JavaScript Object Notation) is the language of APIs. JSON is a way to encode data
structures that ensures that they are easily readable by machines.
You can think of JSON as being a combination of Python dictionaries, lists, strings and
integers represented as strings.
JSON Data in Python
Python has great JSON support with the json package. The json package is part of the
standard library, so we don’t have to install anything to use it. We can both convert lists
and dictionaries to JSON, and convert strings to lists and dictionaries.

The json library has two main functions:


json.dumps() — Takes in a Python object, and converts (dumps) it to a string.

json.loads() — Takes a JSON string, and converts (loads) it to a Python object.

The dumps() function is particularly useful as we can use it to print a formatted string
which makes it easier to understand the JSON output, like in the diagram we saw above:
API with Query Parameters
In order to use request parameters with your API request, we need to ad the argument
params to our request.

Or we can directly insert it into our URL.

Web Scraping
• Using the browser, requests, and Beautiful Soup.

Introduction to Web-scraping

Inspect Data source.

Scrape HTML content from a page.

Parse HTML code with Beautiful Soupe
What is web scraping
Any type of gathering information from the internet.
But generally, when we talk about web scraping we mean the
automated gathering of information from the web.
Or writing some code that fetches information from the internet.
Some websites offer data sets that are downloadable in CSV
format, or accessible via an Application Programming Interface
(API). But many websites with useful data don’t offer these
convenient options.
How Does Web Scraping Work?
When we scrape the web, we’re essentially doing the same thing
a web browser does.
We write code that sends a request to the server that’s hosting
the page we specified. The server will return the source code —
HTML, mostly — for the page (or pages) we requested.
Is Web Scraping Legal?
Unfortunately, there’s not a cut-and-dry answer here. Some
websites explicitly allow web scraping. Others explicitly forbid it.
Many websites don’t offer any clear guidance one way or the
other.
Before scraping any website, we should look for a terms and
conditions page to see if there are explicit rules about scraping. If
there are, we should follow them. If there are not, then it becomes
more of a judgement call.
Web Scraping Good Practices
Never scrape more frequently than you need to.
Consider caching the content you scrape so that it’s only
downloaded once.
Build pauses into your code using functions like time.sleep() to
keep from overwhelming servers with too many requests too
quickly.
How Does Web Scraping Work?
1.Request the content (source code) of a specific URL from the
server
2.Download the content that is returned
3.Identify the elements of the page that are part of the table we
want
4.Extract and (if necessary) reformat those elements into a dataset
we can analyse or use in whatever way we require.
HTML Page Structure
HTML Page Structure
Tags have commonly used names that depend on their position in relation to
other tags:


child — a child is a tag inside another tag. So the two tags p and h1 above
are both children of the body tag.

parent — a parent is the tag another tag is inside. Above, the html tag is the
parent of the body tag.

sibling — a sibling is a tag that is nested inside the same parent as another
tag. For example, head and body are siblings, since they’re both inside
html. Both tags p and h1 are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behaviour.
The Request Library
The first thing we’ll need to do to scrape a web page is to download the page.
We can download pages using the Python requests library.

A status_code of 200 means that the page downloaded successfully.

The Request Library
We can print out the HTML content of the page using the content property:
Parsing a page with
BeautifulSoup
We can use the BeautifulSoup library to
parse the documents, and extract the
text from tags.
We first have to import the library, and
create an instance of the BeautifulSoup
class to parse our document.

We can print out the HTML content of the page,

formatted nicely, using the prettify method on the
BeautifulSoup object.
Parsing a page with
BeautifulSoup
We can first select all the elements at the top level of the page using the
children property of soup.
Note that children returns a list generator, so we need to call the list function
on it:
Parsing a page with
BeautifulSoup
All of the items are BeautifulSoup objects:
The first is a Doctype object, which contains information about the type of the
document.
The second is a NavigableString, which represents text found in the HTML
document.
The final item is a Tag object, which contains other nested tags.
Parsing a page with
BeautifulSoup
To retrieve the children inside the html tag we use:
Finding all instances of a tag at
once
If we want to extract a single tag, we can instead use the find_all method, which will find
all the instances of a tag on a page.
Note that find_all returns a list, so we’ll have to loop through:

or use list indexing, it to extract text:

To find the first instance of a tag, we can use the find method
Searching for tags by class
You can use the find_all method to search for items by class or by id using the argument
class_ or id.

Example :
Lets search any tag that has the class outer-text:

Lets search for any p tag that has the class outer-text:
Searching for tags by Id
The same thing can be done to search for elements by id, using the argument id

Example :
Lets search any tag that has the class outer-text:
Searching for tags Using CSS
Selectors
We can also search for items using CSS selectors. These selectors are how the CSS
language allows developers to specify HTML tags to style. Here are some examples:


p a — finds all a tags inside of a p tag.

body p a — finds all a tags inside of a p tag inside of a body tag.

html body — finds all body tags inside of an html tag.

p.outer-text — finds all p tags with a class of outer-text.

p#first — finds all p tags with an id of first.

body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
Challenges of web scraping
Variety : every page is special
Challenges of web scraping
Durability : Websites change their structures.

Joomla 4 Development
No ratings yet
Joomla 4 Development
549 pages
Busi 330-b02 CMP Final Draft Group 1
100% (1)
Busi 330-b02 CMP Final Draft Group 1
29 pages
Smartgit Quickstart Guide
No ratings yet
Smartgit Quickstart Guide
28 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
KPIs and Metrics For An HR Dashboard
No ratings yet
KPIs and Metrics For An HR Dashboard
6 pages
Sql-Unit 1
No ratings yet
Sql-Unit 1
32 pages
Clothing Preview For Roblox 4
No ratings yet
Clothing Preview For Roblox 4
1 page
FROM ZERO TO AI HERO BOOKLET - Compressed
No ratings yet
FROM ZERO TO AI HERO BOOKLET - Compressed
8 pages
Analytics Twitter
No ratings yet
Analytics Twitter
20 pages
Strings in Python
No ratings yet
Strings in Python
58 pages
Digital Notes of Big Data Analytics Dated 5.1.2024
No ratings yet
Digital Notes of Big Data Analytics Dated 5.1.2024
175 pages
Big Data Analytics - Unit 1
No ratings yet
Big Data Analytics - Unit 1
43 pages
The Book of LUA
No ratings yet
The Book of LUA
87 pages
Creating A Live World Weather Map Using Shiny - by M. Makkawi - The Startup - Medium
No ratings yet
Creating A Live World Weather Map Using Shiny - by M. Makkawi - The Startup - Medium
40 pages
Testking Exam Questions & Answers: 312-50 TK'S Certified Ethical Hacker
No ratings yet
Testking Exam Questions & Answers: 312-50 TK'S Certified Ethical Hacker
7 pages
Installation and Configuration of Wamp Server
No ratings yet
Installation and Configuration of Wamp Server
24 pages
Django Reference Sheet
No ratings yet
Django Reference Sheet
3 pages
CHRP Brochure Regform
No ratings yet
CHRP Brochure Regform
5 pages
SHRM Project On MCB
No ratings yet
SHRM Project On MCB
38 pages
M Impact of Branding and Packaging On Sales Turnover of Nestle Food Nigeria PLC
No ratings yet
M Impact of Branding and Packaging On Sales Turnover of Nestle Food Nigeria PLC
8 pages
XML and PHP
100% (1)
XML and PHP
33 pages
UniVerse Documentation
No ratings yet
UniVerse Documentation
234 pages
Grammer Practice
No ratings yet
Grammer Practice
34 pages
Python Comments: Creating Variables
100% (1)
Python Comments: Creating Variables
37 pages
All Roblox Code
No ratings yet
All Roblox Code
4 pages
Corejavabynageswararaopdffreedownload PDF
0% (2)
Corejavabynageswararaopdffreedownload PDF
3 pages
Installing and Using Tesseract 500 OCRFINAL
No ratings yet
Installing and Using Tesseract 500 OCRFINAL
4 pages
Roblox
No ratings yet
Roblox
2 pages
Master Machine Learning in Just 30 Days Version01
No ratings yet
Master Machine Learning in Just 30 Days Version01
25 pages
Pandas 12 Pivot Table and Drop - A
No ratings yet
Pandas 12 Pivot Table and Drop - A
18 pages
Tableau REST API
No ratings yet
Tableau REST API
34 pages
Object Oriented Development With Java: Concept of Object Orientation
No ratings yet
Object Oriented Development With Java: Concept of Object Orientation
32 pages
DataTable JSON Serialization in JSON - Net and JavaScriptSerializer
No ratings yet
DataTable JSON Serialization in JSON - Net and JavaScriptSerializer
9 pages
Soap Web Services API Guide
No ratings yet
Soap Web Services API Guide
107 pages
CS202 - Fundamentals of Front End Development Handouts
No ratings yet
CS202 - Fundamentals of Front End Development Handouts
307 pages
TextMate Ruby/Rails Cheat Sheet
100% (2)
TextMate Ruby/Rails Cheat Sheet
3 pages
Java With Pokemons - Object Oriented Programming
100% (1)
Java With Pokemons - Object Oriented Programming
57 pages
Frontify-Cabify Case Study
No ratings yet
Frontify-Cabify Case Study
8 pages
Front End Development of A Website
No ratings yet
Front End Development of A Website
32 pages
Active Server Pages Guide PDF
0% (1)
Active Server Pages Guide PDF
659 pages
Unit - 1 Intoduction To Software Project Management
No ratings yet
Unit - 1 Intoduction To Software Project Management
13 pages
How To Use IFTTT
No ratings yet
How To Use IFTTT
101 pages
The Top Five Myths of Website Security: A Focus On Small-to-Mid-Sized Enterprises
No ratings yet
The Top Five Myths of Website Security: A Focus On Small-to-Mid-Sized Enterprises
9 pages
Google APIs
100% (1)
Google APIs
19 pages
HackerRank 2020 Developer Skills Report PDF
No ratings yet
HackerRank 2020 Developer Skills Report PDF
25 pages
Machine Learning Chapter 1
No ratings yet
Machine Learning Chapter 1
24 pages
Computer Networks
No ratings yet
Computer Networks
23 pages
Overview QSS SaaS - 2021 - 2022
No ratings yet
Overview QSS SaaS - 2021 - 2022
22 pages
JIRA Product Overview
No ratings yet
JIRA Product Overview
2 pages
How To Locate Elements in Chrome and IE Browsers For Building Selenium Scripts
No ratings yet
How To Locate Elements in Chrome and IE Browsers For Building Selenium Scripts
8 pages
Introduction To Cybersecurity
No ratings yet
Introduction To Cybersecurity
5 pages
Lecture 4 - Decisions and Conditions - EDITED
100% (1)
Lecture 4 - Decisions and Conditions - EDITED
24 pages
Powershell Commandlets - APPX Module
No ratings yet
Powershell Commandlets - APPX Module
44 pages
Cracking The Campus PM Interview
No ratings yet
Cracking The Campus PM Interview
2 pages
Best Practices For Using Git
No ratings yet
Best Practices For Using Git
2 pages
Wireshark Cheat Sheet
No ratings yet
Wireshark Cheat Sheet
1 page
Python Hand Book
No ratings yet
Python Hand Book
387 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
Webscraping
No ratings yet
Webscraping
12 pages

Data - Collection Python

Uploaded by

Data - Collection Python

Uploaded by

Data

4856 Page_visited 2022-10-05 Link1

In cases like the ones above, an API is the right solution.

The json library has two main functions:

Or we can directly insert it into our URL.

A status_code of 200 means that the page downloaded successfully.

We can print out the HTML content of the page,

or use list indexing, it to extract text:

You might also like