0% found this document useful (0 votes)
31 views44 pages

Topic 02 - Data Collection

Topic2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views44 pages

Topic 02 - Data Collection

Topic2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

University of Science, VNU-HCM

Faculty of Information Technology

Môn
Introduction toCơ SởScience
Data Trí Tuệ Nhân Tạo
Course

Data Collection

Le Ngoc Thanh
[email protected]
Department of Computer Science

Ho Chi Minh City


Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

2
Data Science Process

◎ Give the question to answer


◎ Collecting data From one step you will probably need
◎ Data Discovery & preprocessing to obtain to go back to the previous steps to
readjust, which will probably need to
data that can be analyzed
go to the retreat a number of time.
◎ Data analysis (in statistics, visualizations,
Required attitude: calm, intuitive.
machine learning)
○ à answers (hypotheses) for the question Tools to know how to use: Python and
◎ Evaluation libraries, Jupyter Notebook.
◎ Decision Making

3
Collecting data
◎ General notes when collecting data
○ Is the data correct and sufficient to answer the question?
○ Garbage in à garbage out
○ Is collecting such data valid? Does it affect others?
◎ Ways to collect Data
○ Data is available in company, organization: ok, use it
○ Data is available but out there (online)
Scope ◉ Pre-packaged data (file csv, excel, ...): download
of the ◉ Data provided through the website's API: use API
course ◉ Data is on the site but no API: parse HTML
○ The data is not yet available: created by yourself in ways such as
conducting surveys, using sensor devices, ...

4
Ask Question
How is the recruitment situation of the Data Science in Vietnam now?
○ Initially, the question was often broad and vague
○ At a later time, it will go back to this step a number of times to adjust
the question to be more clear and more specific.

5
Collecting data: Planned
Q: Where to collect data?
A: On recruitment sites in Vietnam

Q: What are the recruitment sites in Vietnam?


A: Ask Google ...
A: à http://www.vietnamworks.com/, http://careerbuilder.vn/, ...

Q: For each job page, looking for recruitment with which keywords?
A: “Khoa học dữ liệu”, “data science”, “data scientist”, ...

Q: For each job page, after searching with a certain keyword, how do I get the recruitment information?
A: On each recruitment, copy-paste information to take into file L

Q: After you've got data from different job pages, or from the same page, but with different
keywords, how do you merge these data?
A: ...

6
Collecting data: Planned
Q: Where to collect data?
A: On recruitment sites in Vietnam

Q: What are the recruitment sites in Vietnam?


A: Ask Google ...
A: à http://www.vietnamworks.com/, http://careerbuilder.vn/, ...

Q: For each job page, looking for recruitment with which keywords?
A: “Khoa học dữ liệu”, “data science”, “data scientist”, ...

Q: For each job page, after searching with a certain keyword, how do I get the recruitment information?
A: Write a program that automatically parse HTML, get the information to retrieve and write down the file J

Q: After you've got data from different job pages, or from the same page but with different keywords, how
do you merge these data?
A: ...

7
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

8
Collecting data from the CareerBuilder site with the keyword
"data scientist"

9
Collecting data from the CareerBuilder site with the keyword
"data scientist"

◎ For each recruitment, draw out the information:


○ Title
○ Recruiters
○ Locations
○ Wage
○ Date Posted Notice
○ Link to detailed content
○ Detailed content
◎ Save to CSV file (each recruitment is one line)

10
Collecting data from the CareerBuilder site with the keyword
"data scientist"

◎ For each recruitment, draw out the information:


○ Title
○ Recruiters The steps taken?
○ Locations 1. Get the website's HTML content
○ Wage 2. Parse HTML to retrieve data
○ Date Posted Notice needed
3. Write data to CSV file
○ Link to detailed content
○ Detailed content
◎ Save to CSV file (each recruitment is one line) 11
HTML Code of a Web page
◎ HTML code is composed of tags and tree form with tag <html> as root
node
◎ Common structure of a tag:
○ <head>...</head>: tag contains meta information of the site
○ <body>...</body>: tag contains content that will be displayed by the site
○ <h1>...</h1>: tag defines the Heading 1
○ <p>...</p>: tag defines the paragraphs
○ ...
◎ Tags can have the attribute to provide more information about the tag
○ <a href=“https://www.google.com/” class=“link”>google link</a>: tag
contains links
○ <h1 id=“myHeader”>my header</h1>
○ ...

12
Retrieving and parse the HTML of your website using Python
◎ Use libary requests-HTML

◎ Install: PowerShell / cmd type


◎ pip install requests-html
13
Use basic requests-HTML libraries
(document lookups as needed)
◎ Import the library
○ from requests_html import HTMLSession
◎ Get the website's HTML code
○ session = HTMLSession()
○ r = session.get(‘web address’)
○ # r contains all the data sent from the site's server, including the HTML of the website
◎ Parse HTML and Tag Search
○ tag = r.html.find(selectors, first=True)
○ # selectors are written in the manner of CSS Selector (for example, '#about' means to find the
tag with the ID about), how to define the search criteria: using the inspect function of the Web
browser
○ # first=True this means returning only the first tag found, first=False returns the list
containing all the found tags
○ # From the found tag, it is possible to call .find(...) to find next in this tag
◎ Retrieving tag elements
○ tag.html: tag's HTML string
○ tag.text: tag’s text string
○ tag.attrs: dictionaries containing tag attributes
14
Demo
◎ Do it by yourself

15
Note Privacy and Copyright about Data
Note: Avoid doing good things
○ Check the "robots. txt" file of the website to see what data is allowed
to collect, what data are not allowed
◉ For Example: https://careerbuilder.vn/robots.txt
○ It is not advisable to send too many request to the site in a short time
(for example, it is possible to give the program a little sleep between
the submitted request)

16
Note Privacy and Copyright about Data

◎ Check file “robots.txt” of the site (Example,


https://careerbuilder.vn/robots.txt)
◎ The following Python code can be used to automatically
check
○ import urllib.robotparser
○ rp = urllib.robotparser.RobotFileParser()
○ rp.set_url('https://careerbuilder.vn/robots.txt')
○ rp.read()
○ rp.can_fetch('*', 'https://careerbuilder.vn/viec-
lam/data-science-k-vi.html')
○ # The result will be True or False
17
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

18
Ask Question

How is the recruitment situation of the Data Science in Vietnam


now?
A more specific question is

What programming languages are often required in the
recruitment of DS in Vietnam now?

Assumption: we only demo focus on careerbuilder with keywords


“data scientist”

19
How can we get answers?
◎ To the detailed content of each recruitment, see which
programming languages are required, and update the
corresponding counting variables
◎ How do I create program to do that automatically?
○ Need to make a list of programming languages to be counted
◉ Where to get this list?
○ Then, with the detailed content of each recruitment and for each language
in the list, check if the language appears in the content, if so, update the
corresponding count variable
◉ From the string you can switch to the set of words and then check
◉ Example: ‘Proficiency requirements in python, R.’
◉ à {‘Proficiency’, ‘requirements’, ‘python’, ‘R’}
20
?
Content → Set of words

◎ One way is to use Regular Expression


◎ Regular Expression allows to perform complex searches
on the string

21
How to use Regular Expression
Example 1
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find strings ‘hcmus’ in s
import re
results = re.findall(r'hcmus', s)
# results: ['hcmus']

Raw string
Using strings is also but in some cases will be more troublesome than the
raw string

22
How to use Regular Expression
Example 2
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find the student code (7 digits) in s
import re
results = re.findall(r'\d{7}', s)
# Results: ['1612345', '1654321', '1654321’]
# Can cast to the set type to remove the duplication
Find the string:
• with numeric characters (from 0 to 9)
• and there are 7 characters

23
How to use Regular Expression
Example 3
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find the email addresses in S
import re
results = re.findall(r'\w+@[\w.]+', s)
# Results:
# ['[email protected]', '[email protected]']
Find the string:
• with alphabet character, and there are one or more such characters
• then the character @
• then the characters in set include word and character ., and there are
one or more such characters
24
How to use Regular Expression
Example 4
s = 'Required to know c, c++, c#, r, python.'
# Request: Find the words in S
import re
results = re.findall(r'[\w+#]+', s)
# Results:
# ['Required', ‘to', ‘know', 'c', 'c++', 'c#', 'r',
# 'python']

25
𝒓𝒆
Content → set of words
and count the number of occurrences of the languages
Demo..

26
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

27
What is the problem with JavaScript?

◎ Example: Get string “Yay! Supports javascript” in


http://avi.im/stuff/js-or-no-js.html
◎ Using the inspect function of the Web browser, you should
see the string ID “intro-text”
◎ Use Requests-HTML to retrieve ...
◎ The result is a string “No javascript support”
○ Cause: HTML content obtained by Requests-HTML is the original
content sent from the server, if in this content there is a JavaScript,
HTML content when using the inspect function of Web browser in the
client as HTML content after it has been run JavaScript

28
How to solve the problem of a website with JavaScript?
◎ As document of Requests-HTML: “Full JavaScript support” J
○ session = HTMLSession()
○ r = session.get(‘...’)
○ r.html.render()
◎ Function .render() will run a browser (without an interface) to
fetch HTML content after a JavaScript has been run, and then
replace the existing (unjavascript) content with this content (already
running JavaScript)
◎ Function .render() currently not running at Jupyter Notebook
due to this is somewhat clashed with each other
◎ One way to run is write code in File *.py and run this file in
PowerShell/cmd by typing:
◎ python file-name.py
29
Selenium Library

◎ Rather than using the Render() method in Requests-HTML,


we can programmatically control a Web browser and retrieve
the HTML content after it has been run by JavaScript.
◎ In Python, there are Selenium libraries to do that
○ Selenium doesn't clash with Jupyter Notebook J
○ Selenium allows programmers to interactive (fill in information, select,
check, Push button,...) with Web browser J (Requests-HTML can't do this)
○ Selenium can be made from A to Z, but will usually run faster if Selenium
does not do the Requests-HTML jobs and let the rest Requests-HTML

30
Trying with Selenium?

◎ Which Vietjet flight from Ho Chi Minh city to Da Nang is


the cheapest price in the next 5 days (not include today)?

31
How to use Selenium?
◎ Which Vietjet flight from Ho Chi Minh city to Da Nang is the
cheapest price in the next 5 days (not include today)?
◎ Steps:
1. Use Selenium to open web browser and https://www.vietjetair.com/Sites/Web/vi-
VN/Home
2. Use Selenium to choose where to go is "Ho Chi Minh City (SGN)", the Destination
"Da Nang (DAD), select "One Way", select the departure date is tomorrow, then
press the "Find flights“ button
3. After the results page has been loaded, use Selenium to obtain HTML content, and
then give the Requests-HTML for Requests-HTML to handle the rest (parse HTML
and search for the data you need)
4. Repeat step 1 to 3 with the travel date of the next and loop until the full 5 days
5. From the data collected, find the cheapest flight

32
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

33
Collecting data using Web APIs
◎ Some websites offer API (Application Programming Interface)
to make external apps retrieve data easier
◎ Use the web API "more official" than parse HTML
○ As this is the path that “host" opens to "guests" entering the data
à If the site has API, use it first.
◎ Need to read the host’s document to know what data to take,
which way to go, …
◎ This is a list (incomplete) of sites providing API
○ https://github.com/public-apis/public-apis
○ Large sites like Twitter, Facebook, Google, ... Often provide API
○ Some sites require registration to use the API (charges may apply)

34
Example: Get information about current weather in Ho Chi Minh
City

Parse HTML

35
Example: Get information about current weather in Ho Chi Minh
City

Use API: Almost immediately receive data J

This is the XML (eXtensible


Markup Language) format,
which similar with HTML
- HTML used to display data to
viewers
- XML for performing data to
exchange between computer
applications through a network
path
- XML easier parse than HTML

36
Example: Get information about current weather in Ho Chi Minh
City

Use API: Almost immediately receive data J

Another format for using API is JSON (JavaScript Object Notation)


• JSON is simpler, easier parse than XML (however, the
representation is not equal to the XML)
• The simplicity of JSON is sufficient for many cases in practice à
JSON is more common than XML
• In the course, we will focus on JSON

37
Source: http://www.json.org/

JSON
“JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy
for humans to read and write. It is easy for machines to parse and generate. It is based
on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition
- December 1999. JSON is a text format that is completely language independent but
uses conventions that are familiar to programmers of the C-family of languages,
including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These
properties make JSON an ideal data-interchange language.

JSON is built on two structures:


◎ A collection of name/value pairs. In various languages, this is realized as an object,
record, struct, dictionary, hash table, keyed list, or associative array.
◎ An ordered list of values. In most languages, this is realized as an array, vector, list,
or sequence.”

38
Example: JSON
[
{ "id" : 1,
"name" : "Hoa"
{"employees": [ "student": true
{ "firstName":"John", "lastName":"Doe" }, "email": null
{ "firstName":"Anna", "lastName":"Smith" }, },
{ "firstName":"Peter", "lastName":"Jones" } { "id" : 2,
]} "name" : "Mai“
"student": true
"email": null
}
File *.ipynb ]
{ "cells": [
{ "cell_type": "markdown",
"metadata": {},
"source": ["# Continue "]
},
...
39
How to use Web API in Python?

Q: Get the JSON content that the site returns through the
API?
A: Use Requests library

Q: Parse JSON (converting from JSON string to Python data


structure)?
A: Use JSON library
40
Requests Library
◎ It is same author with library Requests-HTML
○ if only get site content: use Requests
○ if get site content + parse HTML: use Requests-HTML
◎ It is installed when installing Requests-HTML. Otherwise:
pip install requests
◎ Basic usage:
○ import requests
○ r = requests.get(‘site path’)
○ r.text # Content string (HTML/XML/JSON)
○ sent from server
41
JSON Library
◎ It is built-in library of Python
◎ Basic usage:
○ import json
○ # JSON string à data structure of python (parse JSON):
○ json_pydata = json.loads(json_str)
○ # Data structure of python à JSON string:
○ json_str = json.dumps(json_pydata)
○ # JSON File à data structure of python:
○ json_pydata = json.load(json_fileobj)
○ # Data structure of python à JSON file:
○ json.dump(json_pydata, json_fileobj)

42
43
References
◎ Slides from Tran Trung Kien

44

You might also like