Topic 02 - Data Collection
Topic 02 - Data Collection
Môn
Introduction toCơ SởScience
Data Trí Tuệ Nhân Tạo
Course
Data Collection
Le Ngoc Thanh
[email protected]
Department of Computer Science
2
Data Science Process
3
Collecting data
◎ General notes when collecting data
○ Is the data correct and sufficient to answer the question?
○ Garbage in à garbage out
○ Is collecting such data valid? Does it affect others?
◎ Ways to collect Data
○ Data is available in company, organization: ok, use it
○ Data is available but out there (online)
Scope ◉ Pre-packaged data (file csv, excel, ...): download
of the ◉ Data provided through the website's API: use API
course ◉ Data is on the site but no API: parse HTML
○ The data is not yet available: created by yourself in ways such as
conducting surveys, using sensor devices, ...
4
Ask Question
How is the recruitment situation of the Data Science in Vietnam now?
○ Initially, the question was often broad and vague
○ At a later time, it will go back to this step a number of times to adjust
the question to be more clear and more specific.
5
Collecting data: Planned
Q: Where to collect data?
A: On recruitment sites in Vietnam
Q: For each job page, looking for recruitment with which keywords?
A: “Khoa học dữ liệu”, “data science”, “data scientist”, ...
Q: For each job page, after searching with a certain keyword, how do I get the recruitment information?
A: On each recruitment, copy-paste information to take into file L
Q: After you've got data from different job pages, or from the same page, but with different
keywords, how do you merge these data?
A: ...
6
Collecting data: Planned
Q: Where to collect data?
A: On recruitment sites in Vietnam
Q: For each job page, looking for recruitment with which keywords?
A: “Khoa học dữ liệu”, “data science”, “data scientist”, ...
Q: For each job page, after searching with a certain keyword, how do I get the recruitment information?
A: Write a program that automatically parse HTML, get the information to retrieve and write down the file J
Q: After you've got data from different job pages, or from the same page but with different keywords, how
do you merge these data?
A: ...
7
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection
8
Collecting data from the CareerBuilder site with the keyword
"data scientist"
9
Collecting data from the CareerBuilder site with the keyword
"data scientist"
10
Collecting data from the CareerBuilder site with the keyword
"data scientist"
12
Retrieving and parse the HTML of your website using Python
◎ Use libary requests-HTML
15
Note Privacy and Copyright about Data
Note: Avoid doing good things
○ Check the "robots. txt" file of the website to see what data is allowed
to collect, what data are not allowed
◉ For Example: https://careerbuilder.vn/robots.txt
○ It is not advisable to send too many request to the site in a short time
(for example, it is possible to give the program a little sleep between
the submitted request)
16
Note Privacy and Copyright about Data
18
Ask Question
19
How can we get answers?
◎ To the detailed content of each recruitment, see which
programming languages are required, and update the
corresponding counting variables
◎ How do I create program to do that automatically?
○ Need to make a list of programming languages to be counted
◉ Where to get this list?
○ Then, with the detailed content of each recruitment and for each language
in the list, check if the language appears in the content, if so, update the
corresponding count variable
◉ From the string you can switch to the set of words and then check
◉ Example: ‘Proficiency requirements in python, R.’
◉ à {‘Proficiency’, ‘requirements’, ‘python’, ‘R’}
20
?
Content → Set of words
21
How to use Regular Expression
Example 1
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find strings ‘hcmus’ in s
import re
results = re.findall(r'hcmus', s)
# results: ['hcmus']
Raw string
Using strings is also but in some cases will be more troublesome than the
raw string
22
How to use Regular Expression
Example 2
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find the student code (7 digits) in s
import re
results = re.findall(r'\d{7}', s)
# Results: ['1612345', '1654321', '1654321’]
# Can cast to the set type to remove the duplication
Find the string:
• with numeric characters (from 0 to 9)
• and there are 7 characters
23
How to use Regular Expression
Example 3
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find the email addresses in S
import re
results = re.findall(r'\w+@[\w.]+', s)
# Results:
# ['[email protected]', '[email protected]']
Find the string:
• with alphabet character, and there are one or more such characters
• then the character @
• then the characters in set include word and character ., and there are
one or more such characters
24
How to use Regular Expression
Example 4
s = 'Required to know c, c++, c#, r, python.'
# Request: Find the words in S
import re
results = re.findall(r'[\w+#]+', s)
# Results:
# ['Required', ‘to', ‘know', 'c', 'c++', 'c#', 'r',
# 'python']
25
𝒓𝒆
Content → set of words
and count the number of occurrences of the languages
Demo..
26
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection
27
What is the problem with JavaScript?
28
How to solve the problem of a website with JavaScript?
◎ As document of Requests-HTML: “Full JavaScript support” J
○ session = HTMLSession()
○ r = session.get(‘...’)
○ r.html.render()
◎ Function .render() will run a browser (without an interface) to
fetch HTML content after a JavaScript has been run, and then
replace the existing (unjavascript) content with this content (already
running JavaScript)
◎ Function .render() currently not running at Jupyter Notebook
due to this is somewhat clashed with each other
◎ One way to run is write code in File *.py and run this file in
PowerShell/cmd by typing:
◎ python file-name.py
29
Selenium Library
30
Trying with Selenium?
31
How to use Selenium?
◎ Which Vietjet flight from Ho Chi Minh city to Da Nang is the
cheapest price in the next 5 days (not include today)?
◎ Steps:
1. Use Selenium to open web browser and https://www.vietjetair.com/Sites/Web/vi-
VN/Home
2. Use Selenium to choose where to go is "Ho Chi Minh City (SGN)", the Destination
"Da Nang (DAD), select "One Way", select the departure date is tomorrow, then
press the "Find flights“ button
3. After the results page has been loaded, use Selenium to obtain HTML content, and
then give the Requests-HTML for Requests-HTML to handle the rest (parse HTML
and search for the data you need)
4. Repeat step 1 to 3 with the travel date of the next and loop until the full 5 days
5. From the data collected, find the cheapest flight
32
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection
33
Collecting data using Web APIs
◎ Some websites offer API (Application Programming Interface)
to make external apps retrieve data easier
◎ Use the web API "more official" than parse HTML
○ As this is the path that “host" opens to "guests" entering the data
à If the site has API, use it first.
◎ Need to read the host’s document to know what data to take,
which way to go, …
◎ This is a list (incomplete) of sites providing API
○ https://github.com/public-apis/public-apis
○ Large sites like Twitter, Facebook, Google, ... Often provide API
○ Some sites require registration to use the API (charges may apply)
34
Example: Get information about current weather in Ho Chi Minh
City
Parse HTML
35
Example: Get information about current weather in Ho Chi Minh
City
36
Example: Get information about current weather in Ho Chi Minh
City
37
Source: http://www.json.org/
JSON
“JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy
for humans to read and write. It is easy for machines to parse and generate. It is based
on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition
- December 1999. JSON is a text format that is completely language independent but
uses conventions that are familiar to programmers of the C-family of languages,
including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These
properties make JSON an ideal data-interchange language.
38
Example: JSON
[
{ "id" : 1,
"name" : "Hoa"
{"employees": [ "student": true
{ "firstName":"John", "lastName":"Doe" }, "email": null
{ "firstName":"Anna", "lastName":"Smith" }, },
{ "firstName":"Peter", "lastName":"Jones" } { "id" : 2,
]} "name" : "Mai“
"student": true
"email": null
}
File *.ipynb ]
{ "cells": [
{ "cell_type": "markdown",
"metadata": {},
"source": ["# Continue "]
},
...
39
How to use Web API in Python?
Q: Get the JSON content that the site returns through the
API?
A: Use Requests library
42
43
References
◎ Slides from Tran Trung Kien
44