Data - Collection Python
Data - Collection Python
Collection
Zakaria KERKAOU
[email protected]
Sources of Data
Company data : Open data :
• Collected by companies. • Free open data sources.
• Helps them make data • Can be used, shared, and build
driven decisions on by anyone
Company Data
Web events.
Survey data.
Customer data.
Logistics data.
Financial transactions.
Company Data
Example: Web events
When you visit a web page or click on a link, usually this
information is tracked by companies in order to calculate
conversion rates or monitor the popularity of different pieces of
content.
User_id event_name Time/date Links_clicked
API (Application
Programming Interface)
Public records.
Public data APIs
• Using APIs we can request data from a
third party companies over the internet.
• Many companies have public APIs to let
anyone access their data.
• Some APIs:
Youtube Data API.
Facebook Graph API.
MediaWiki Action API (Wikipedia).
Openweather API.
Many more.
Public data APIs
• Example of API call of • Example of API Response
OpenWeatherAPI
Public Record
Public records are another great way of gathering data. They can
be collected and shared by:
International organisations:
World Bank, United nations, world heath organisation.
National statistical office:
Censuses surveys
Governmental agencies:
Weather, environment, population.
Public Record
For example www.covidmaroc.ma
Application
Programming
Interface
Python APIs
To use an API, you make a request to a remote web server, and retrieve the data you
need.
But why use an API instead of a static CSV dataset you can download from the web? APIs
are useful in the following cases:
The data is changing quickly. An example of this is stock price data.
You want a small piece of a much larger set of data. Reddit comments are one example.
There is repeated computation involved. Spotify has an API that can tell you the genre
of a piece of music.
Status codes are returned with every request that is made to a web server. Status codes
indicate information about what happened with a request. Here are some codes that are
relevant to GET requests:
200: Everything went okay, and the result has been returned (if any).
301: The server is redirecting you to a different endpoint.
400: The server thinks you made a bad request.
401: The server thinks you’re not authenticated.
403: The resource you’re trying to access is forbidden.
404: The resource you tried to access wasn’t found on the server.
503: The server is not ready to handle the request.
Requests in Python
The documentation for this specific API tells us that the response we’ll get is in JSON
format. In the next section we’ll learn about JSON, but first let’s use the response.json()
method to see the data we received back from the API.
Example :
The Notify API, which gives access to data
about the international space station. It’s a
great API for learning because it has a very
simple design, and doesn’t require
authentication. We’ll teach you how to use an
API that requires authentication in a later
post.
JSON Data in Python
JSON (JavaScript Object Notation) is the language of APIs. JSON is a way to encode data
structures that ensures that they are easily readable by machines.
You can think of JSON as being a combination of Python dictionaries, lists, strings and
integers represented as strings.
JSON Data in Python
Python has great JSON support with the json package. The json package is part of the
standard library, so we don’t have to install anything to use it. We can both convert lists
and dictionaries to JSON, and convert strings to lists and dictionaries.
The dumps() function is particularly useful as we can use it to print a formatted string
which makes it easier to understand the JSON output, like in the diagram we saw above:
API with Query Parameters
In order to use request parameters with your API request, we need to ad the argument
params to our request.
child — a child is a tag inside another tag. So the two tags p and h1 above
are both children of the body tag.
parent — a parent is the tag another tag is inside. Above, the html tag is the
parent of the body tag.
sibling — a sibling is a tag that is nested inside the same parent as another
tag. For example, head and body are siblings, since they’re both inside
html. Both tags p and h1 are siblings, since they’re both inside body.
We can also add properties to HTML tags that change their behaviour.
The Request Library
The first thing we’ll need to do to scrape a web page is to download the page.
We can download pages using the Python requests library.
To find the first instance of a tag, we can use the find method
Searching for tags by class
You can use the find_all method to search for items by class or by id using the argument
class_ or id.
Example :
Lets search any tag that has the class outer-text:
Lets search for any p tag that has the class outer-text:
Searching for tags by Id
The same thing can be done to search for elements by id, using the argument id
Example :
Lets search any tag that has the class outer-text:
Searching for tags Using CSS
Selectors
We can also search for items using CSS selectors. These selectors are how the CSS
language allows developers to specify HTML tags to style. Here are some examples:
p a — finds all a tags inside of a p tag.
body p a — finds all a tags inside of a p tag inside of a body tag.
html body — finds all body tags inside of an html tag.
p.outer-text — finds all p tags with a class of outer-text.
p#first — finds all p tags with an id of first.
body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
Challenges of web scraping
Variety : every page is special
Challenges of web scraping
Durability : Websites change their structures.