0% found this document useful (0 votes)

31 views44 pages

Topic 02 - Data Collection

Topic2

Uploaded by

Sơn Nguyễn Kim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views44 pages

Topic 02 - Data Collection

Topic2

Uploaded by

Sơn Nguyễn Kim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

University of Science, VNU-HCM

Faculty of Information Technology

Môn
Introduction toCơ SởScience
Data Trí Tuệ Nhân Tạo
Course

Data Collection

Le Ngoc Thanh
[email protected]
Department of Computer Science

Ho Chi Minh City

Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

2
Data Science Process

◎ Give the question to answer

◎ Collecting data From one step you will probably need
◎ Data Discovery & preprocessing to obtain to go back to the previous steps to
readjust, which will probably need to
data that can be analyzed
go to the retreat a number of time.
◎ Data analysis (in statistics, visualizations,
Required attitude: calm, intuitive.
machine learning)
○ à answers (hypotheses) for the question Tools to know how to use: Python and
◎ Evaluation libraries, Jupyter Notebook.
◎ Decision Making

3
Collecting data
◎ General notes when collecting data
○ Is the data correct and sufficient to answer the question?
○ Garbage in à garbage out
○ Is collecting such data valid? Does it affect others?
◎ Ways to collect Data
○ Data is available in company, organization: ok, use it
○ Data is available but out there (online)
Scope ◉ Pre-packaged data (file csv, excel, ...): download
of the ◉ Data provided through the website's API: use API
course ◉ Data is on the site but no API: parse HTML
○ The data is not yet available: created by yourself in ways such as
conducting surveys, using sensor devices, ...

4
Ask Question
How is the recruitment situation of the Data Science in Vietnam now?
○ Initially, the question was often broad and vague
○ At a later time, it will go back to this step a number of times to adjust
the question to be more clear and more specific.

5
Collecting data: Planned
Q: Where to collect data?
A: On recruitment sites in Vietnam

Q: What are the recruitment sites in Vietnam?

A: Ask Google ...
A: à http://www.vietnamworks.com/, http://careerbuilder.vn/, ...

Q: For each job page, looking for recruitment with which keywords?
A: “Khoa học dữ liệu”, “data science”, “data scientist”, ...

Q: For each job page, after searching with a certain keyword, how do I get the recruitment information?
A: On each recruitment, copy-paste information to take into file L

Q: After you've got data from different job pages, or from the same page, but with different
keywords, how do you merge these data?
A: ...

6
Collecting data: Planned
Q: Where to collect data?
A: On recruitment sites in Vietnam

Q: What are the recruitment sites in Vietnam?

A: Ask Google ...
A: à http://www.vietnamworks.com/, http://careerbuilder.vn/, ...

Q: For each job page, looking for recruitment with which keywords?
A: “Khoa học dữ liệu”, “data science”, “data scientist”, ...

Q: For each job page, after searching with a certain keyword, how do I get the recruitment information?
A: Write a program that automatically parse HTML, get the information to retrieve and write down the file J

Q: After you've got data from different job pages, or from the same page but with different keywords, how
do you merge these data?
A: ...

7
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

8
Collecting data from the CareerBuilder site with the keyword
"data scientist"

9
Collecting data from the CareerBuilder site with the keyword
"data scientist"

◎ For each recruitment, draw out the information:

○ Title
○ Recruiters
○ Locations
○ Wage
○ Date Posted Notice
○ Link to detailed content
○ Detailed content
◎ Save to CSV file (each recruitment is one line)

10
Collecting data from the CareerBuilder site with the keyword
"data scientist"

◎ For each recruitment, draw out the information:

○ Title
○ Recruiters The steps taken?
○ Locations 1. Get the website's HTML content
○ Wage 2. Parse HTML to retrieve data
○ Date Posted Notice needed
3. Write data to CSV file
○ Link to detailed content
○ Detailed content
◎ Save to CSV file (each recruitment is one line) 11
HTML Code of a Web page
◎ HTML code is composed of tags and tree form with tag <html> as root
node
◎ Common structure of a tag:
○ <head>...</head>: tag contains meta information of the site
○ <body>...</body>: tag contains content that will be displayed by the site
○ <h1>...</h1>: tag defines the Heading 1
○ <p>...</p>: tag defines the paragraphs
○ ...
◎ Tags can have the attribute to provide more information about the tag
○ <a href=“https://www.google.com/” class=“link”>google link</a>: tag
contains links
○ <h1 id=“myHeader”>my header</h1>
○ ...

12
Retrieving and parse the HTML of your website using Python
◎ Use libary requests-HTML

◎ Install: PowerShell / cmd type

◎ pip install requests-html
13
Use basic requests-HTML libraries
(document lookups as needed)
◎ Import the library
○ from requests_html import HTMLSession
◎ Get the website's HTML code
○ session = HTMLSession()
○ r = session.get(‘web address’)
○ # r contains all the data sent from the site's server, including the HTML of the website
◎ Parse HTML and Tag Search
○ tag = r.html.find(selectors, first=True)
○ # selectors are written in the manner of CSS Selector (for example, '#about' means to find the
tag with the ID about), how to define the search criteria: using the inspect function of the Web
browser
○ # first=True this means returning only the first tag found, first=False returns the list
containing all the found tags
○ # From the found tag, it is possible to call .find(...) to find next in this tag
◎ Retrieving tag elements
○ tag.html: tag's HTML string
○ tag.text: tag’s text string
○ tag.attrs: dictionaries containing tag attributes
14
Demo
◎ Do it by yourself

15
Note Privacy and Copyright about Data
Note: Avoid doing good things
○ Check the "robots. txt" file of the website to see what data is allowed
to collect, what data are not allowed
◉ For Example: https://careerbuilder.vn/robots.txt
○ It is not advisable to send too many request to the site in a short time
(for example, it is possible to give the program a little sleep between
the submitted request)

16
Note Privacy and Copyright about Data

◎ Check file “robots.txt” of the site (Example,

https://careerbuilder.vn/robots.txt)
◎ The following Python code can be used to automatically
check
○ import urllib.robotparser
○ rp = urllib.robotparser.RobotFileParser()
○ rp.set_url('https://careerbuilder.vn/robots.txt')
○ rp.read()
○ rp.can_fetch('*', 'https://careerbuilder.vn/viec-
lam/data-science-k-vi.html')
○ # The result will be True or False
17
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

18
Ask Question

How is the recruitment situation of the Data Science in Vietnam

now?
A more specific question is
↓
What programming languages are often required in the
recruitment of DS in Vietnam now?

Assumption: we only demo focus on careerbuilder with keywords

“data scientist”

19
How can we get answers?
◎ To the detailed content of each recruitment, see which
programming languages are required, and update the
corresponding counting variables
◎ How do I create program to do that automatically?
○ Need to make a list of programming languages to be counted
◉ Where to get this list?
○ Then, with the detailed content of each recruitment and for each language
in the list, check if the language appears in the content, if so, update the
corresponding count variable
◉ From the string you can switch to the set of words and then check
◉ Example: ‘Proficiency requirements in python, R.’
◉ à {‘Proficiency’, ‘requirements’, ‘python’, ‘R’}
20
?
Content → Set of words

◎ One way is to use Regular Expression

◎ Regular Expression allows to perform complex searches
on the string

21
How to use Regular Expression
Example 1
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find strings ‘hcmus’ in s
import re
results = re.findall(r'hcmus', s)
# results: ['hcmus']

Raw string
Using strings is also but in some cases will be more troublesome than the
raw string

22
How to use Regular Expression
Example 2
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find the student code (7 digits) in s
import re
results = re.findall(r'\d{7}', s)
# Results: ['1612345', '1654321', '1654321’]
# Can cast to the set type to remove the duplication
Find the string:
• with numeric characters (from 0 to 9)
• and there are 7 characters

23
How to use Regular Expression
Example 3
s = ‘An has a student ID number 1612345 and email
[email protected]\nHà has a student ID number 1654321 and email
[email protected]'
# Request: Find the email addresses in S
import re
results = re.findall(r'\w+@[\w.]+', s)
# Results:
# ['[email protected]', '[email protected]']
Find the string:
• with alphabet character, and there are one or more such characters
• then the character @
• then the characters in set include word and character ., and there are
one or more such characters
24
How to use Regular Expression
Example 4
s = 'Required to know c, c++, c#, r, python.'
# Request: Find the words in S
import re
results = re.findall(r'[\w+#]+', s)
# Results:
# ['Required', ‘to', ‘know', 'c', 'c++', 'c#', 'r',
# 'python']

25
𝒓𝒆
Content → set of words
and count the number of occurrences of the languages
Demo..

26
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

27
What is the problem with JavaScript?

◎ Example: Get string “Yay! Supports javascript” in

http://avi.im/stuff/js-or-no-js.html
◎ Using the inspect function of the Web browser, you should
see the string ID “intro-text”
◎ Use Requests-HTML to retrieve ...
◎ The result is a string “No javascript support”
○ Cause: HTML content obtained by Requests-HTML is the original
content sent from the server, if in this content there is a JavaScript,
HTML content when using the inspect function of Web browser in the
client as HTML content after it has been run JavaScript

28
How to solve the problem of a website with JavaScript?
◎ As document of Requests-HTML: “Full JavaScript support” J
○ session = HTMLSession()
○ r = session.get(‘...’)
○ r.html.render()
◎ Function .render() will run a browser (without an interface) to
fetch HTML content after a JavaScript has been run, and then
replace the existing (unjavascript) content with this content (already
running JavaScript)
◎ Function .render() currently not running at Jupyter Notebook
due to this is somewhat clashed with each other
◎ One way to run is write code in File *.py and run this file in
PowerShell/cmd by typing:
◎ python file-name.py
29
Selenium Library

◎ Rather than using the Render() method in Requests-HTML,

we can programmatically control a Web browser and retrieve
the HTML content after it has been run by JavaScript.
◎ In Python, there are Selenium libraries to do that
○ Selenium doesn't clash with Jupyter Notebook J
○ Selenium allows programmers to interactive (fill in information, select,
check, Push button,...) with Web browser J (Requests-HTML can't do this)
○ Selenium can be made from A to Z, but will usually run faster if Selenium
does not do the Requests-HTML jobs and let the rest Requests-HTML

30
Trying with Selenium?

◎ Which Vietjet flight from Ho Chi Minh city to Da Nang is

the cheapest price in the next 5 days (not include today)?

31
How to use Selenium?
◎ Which Vietjet flight from Ho Chi Minh city to Da Nang is the
cheapest price in the next 5 days (not include today)?
◎ Steps:
1. Use Selenium to open web browser and https://www.vietjetair.com/Sites/Web/vi-
VN/Home
2. Use Selenium to choose where to go is "Ho Chi Minh City (SGN)", the Destination
"Da Nang (DAD), select "One Way", select the departure date is tomorrow, then
press the "Find flights“ button
3. After the results page has been loaded, use Selenium to obtain HTML content, and
then give the Requests-HTML for Requests-HTML to handle the rest (parse HTML
and search for the data you need)
4. Repeat step 1 to 3 with the travel date of the next and loop until the full 5 days
5. From the data collected, find the cheapest flight

32
Contents
◎ Review Data Science Process
◎ Data Collection from Website
◎ Data Preprocessing
◎ Working with Dynamic Webpage
◎ API Data Collection

33
Collecting data using Web APIs
◎ Some websites offer API (Application Programming Interface)
to make external apps retrieve data easier
◎ Use the web API "more official" than parse HTML
○ As this is the path that “host" opens to "guests" entering the data
à If the site has API, use it first.
◎ Need to read the host’s document to know what data to take,
which way to go, …
◎ This is a list (incomplete) of sites providing API
○ https://github.com/public-apis/public-apis
○ Large sites like Twitter, Facebook, Google, ... Often provide API
○ Some sites require registration to use the API (charges may apply)

34
Example: Get information about current weather in Ho Chi Minh
City

Parse HTML

35
Example: Get information about current weather in Ho Chi Minh
City

Use API: Almost immediately receive data J

This is the XML (eXtensible

Markup Language) format,
which similar with HTML
- HTML used to display data to
viewers
- XML for performing data to
exchange between computer
applications through a network
path
- XML easier parse than HTML

36
Example: Get information about current weather in Ho Chi Minh
City

Use API: Almost immediately receive data J

Another format for using API is JSON (JavaScript Object Notation)

• JSON is simpler, easier parse than XML (however, the
representation is not equal to the XML)
• The simplicity of JSON is sufficient for many cases in practice à
JSON is more common than XML
• In the course, we will focus on JSON

37
Source: http://www.json.org/

JSON
“JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy
for humans to read and write. It is easy for machines to parse and generate. It is based
on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition
- December 1999. JSON is a text format that is completely language independent but
uses conventions that are familiar to programmers of the C-family of languages,
including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These
properties make JSON an ideal data-interchange language.

JSON is built on two structures:

◎ A collection of name/value pairs. In various languages, this is realized as an object,
record, struct, dictionary, hash table, keyed list, or associative array.
◎ An ordered list of values. In most languages, this is realized as an array, vector, list,
or sequence.”

38
Example: JSON
[
{ "id" : 1,
"name" : "Hoa"
{"employees": [ "student": true
{ "firstName":"John", "lastName":"Doe" }, "email": null
{ "firstName":"Anna", "lastName":"Smith" }, },
{ "firstName":"Peter", "lastName":"Jones" } { "id" : 2,
]} "name" : "Mai“
"student": true
"email": null
}
File *.ipynb ]
{ "cells": [
{ "cell_type": "markdown",
"metadata": {},
"source": ["# Continue "]
},
...
39
How to use Web API in Python?

Q: Get the JSON content that the site returns through the
API?
A: Use Requests library

Q: Parse JSON (converting from JSON string to Python data

structure)?
A: Use JSON library
40
Requests Library
◎ It is same author with library Requests-HTML
○ if only get site content: use Requests
○ if get site content + parse HTML: use Requests-HTML
◎ It is installed when installing Requests-HTML. Otherwise:
pip install requests
◎ Basic usage:
○ import requests
○ r = requests.get(‘site path’)
○ r.text # Content string (HTML/XML/JSON)
○ sent from server
41
JSON Library
◎ It is built-in library of Python
◎ Basic usage:
○ import json
○ # JSON string à data structure of python (parse JSON):
○ json_pydata = json.loads(json_str)
○ # Data structure of python à JSON string:
○ json_str = json.dumps(json_pydata)
○ # JSON File à data structure of python:
○ json_pydata = json.load(json_fileobj)
○ # Data structure of python à JSON file:
○ json.dump(json_pydata, json_fileobj)

42
43
References
◎ Slides from Tran Trung Kien

AI Class PDF
No ratings yet
AI Class PDF
542 pages
Abhimanyu project file 2
No ratings yet
Abhimanyu project file 2
64 pages
dw_long
No ratings yet
dw_long
45 pages
CV_PhamDinhThuc
No ratings yet
CV_PhamDinhThuc
1 page
Data Structuring & Data Gathering 1
No ratings yet
Data Structuring & Data Gathering 1
30 pages
CV NguyenVanNham
No ratings yet
CV NguyenVanNham
4 pages
L2 Data Crawling Preprocessinge
No ratings yet
L2 Data Crawling Preprocessinge
30 pages
Top 3 Data Engineering Resume Tips - by Liam Hartley - Python in Plain English
No ratings yet
Top 3 Data Engineering Resume Tips - by Liam Hartley - Python in Plain English
11 pages
IT-Nguyen Xuan Minh Khoi
No ratings yet
IT-Nguyen Xuan Minh Khoi
13 pages
DATA WRANGLING
No ratings yet
DATA WRANGLING
4 pages
Nguyen Huy Hoang
No ratings yet
Nguyen Huy Hoang
2 pages
I Analyzed 2k Data Scientist and Data Engineer Jobs and This Is What I Found - by Khuyen Tran - Towards AI
No ratings yet
I Analyzed 2k Data Scientist and Data Engineer Jobs and This Is What I Found - by Khuyen Tran - Towards AI
17 pages
CV Nagaraj 3 4 2023.pdf 1680525267971
No ratings yet
CV Nagaraj 3 4 2023.pdf 1680525267971
3 pages
CV PQD NNS
No ratings yet
CV PQD NNS
1 page
Le Anh Son
No ratings yet
Le Anh Son
2 pages
How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC
No ratings yet
How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC
12 pages
Trần Hoàn Đức Duy: Education
No ratings yet
Trần Hoàn Đức Duy: Education
2 pages
Hoang CV
No ratings yet
Hoang CV
2 pages
unit-1 .ds
No ratings yet
unit-1 .ds
30 pages
Nguyen Xuan Cong: Career Object
No ratings yet
Nguyen Xuan Cong: Career Object
3 pages
Dat-Nguyen-Thanh-TopCV.vn-110325.230133
No ratings yet
Dat-Nguyen-Thanh-TopCV.vn-110325.230133
3 pages
Data ScientistGiang Vo
No ratings yet
Data ScientistGiang Vo
1 page
Bui Tien Phat Resume
No ratings yet
Bui Tien Phat Resume
2 pages
Web Scraping Project
No ratings yet
Web Scraping Project
1 page
ds final
No ratings yet
ds final
45 pages
Data Science My Notes
No ratings yet
Data Science My Notes
61 pages
Full Stack Data Science Roadmap
No ratings yet
Full Stack Data Science Roadmap
17 pages
Module 2_final
No ratings yet
Module 2_final
58 pages
Project Report: BS (CS) - 6 (A) Project Title: Toxic Comment Analysis
No ratings yet
Project Report: BS (CS) - 6 (A) Project Title: Toxic Comment Analysis
20 pages
DS Retest
No ratings yet
DS Retest
18 pages
Roadmap
No ratings yet
Roadmap
9 pages
Dsa Report
No ratings yet
Dsa Report
24 pages
Resume Reviewer Report
No ratings yet
Resume Reviewer Report
12 pages
Data-Science-and-Machine-Learning
No ratings yet
Data-Science-and-Machine-Learning
30 pages
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
100% (1)
Data Science and Big Data by IBM CE Allsoft Summer Training Final Report
41 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
Data Science Curriculum.pdf (1)
No ratings yet
Data Science Curriculum.pdf (1)
19 pages
Data v2
No ratings yet
Data v2
25 pages
Math705 S2022 Summary
No ratings yet
Math705 S2022 Summary
6 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
UNIT-1
No ratings yet
UNIT-1
25 pages
Unit 1
No ratings yet
Unit 1
26 pages
Interview Notes
No ratings yet
Interview Notes
2 pages
How To Hire Data Scientists
No ratings yet
How To Hire Data Scientists
34 pages
Unit 1
No ratings yet
Unit 1
21 pages
WT EX 1-6 Raju PDF
No ratings yet
WT EX 1-6 Raju PDF
24 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Resume Building Tips by Prafful
No ratings yet
Resume Building Tips by Prafful
7 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
Eee Presentation
No ratings yet
Eee Presentation
12 pages
Getting and Cleaning Data Course Notes: Xing Su
No ratings yet
Getting and Cleaning Data Course Notes: Xing Su
27 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
10 pages
(Developer) (Pham Huy Hoang) CV PDF
No ratings yet
(Developer) (Pham Huy Hoang) CV PDF
2 pages
BROCHURE - Data Science Learning Path - Board - Infinity
No ratings yet
BROCHURE - Data Science Learning Path - Board - Infinity
30 pages
Certified Professional Diploma in Data Science-1
No ratings yet
Certified Professional Diploma in Data Science-1
43 pages
Full Stack Data-Science AI, ChatGPT & Generative - 5
No ratings yet
Full Stack Data-Science AI, ChatGPT & Generative - 5
35 pages
Phyton Cheat Sheet
No ratings yet
Phyton Cheat Sheet
9 pages
MCOSE 31 E SLM Merged Compressed
No ratings yet
MCOSE 31 E SLM Merged Compressed
1,397 pages
A Comprehensive Textbook of Sample Surveys First Edition Arijit Chaudhuri - Quickly access the ebook and start reading today
No ratings yet
A Comprehensive Textbook of Sample Surveys First Edition Arijit Chaudhuri - Quickly access the ebook and start reading today
77 pages
AIO2024 Module02 Extra SQL Big Data
No ratings yet
AIO2024 Module02 Extra SQL Big Data
94 pages
Computer_vision_part2
No ratings yet
Computer_vision_part2
62 pages
Module 5 - The World Around Us
No ratings yet
Module 5 - The World Around Us
60 pages
Topic 03 - Basic Statistics
No ratings yet
Topic 03 - Basic Statistics
42 pages
Ceic3006 Lecture 1 Notes
No ratings yet
Ceic3006 Lecture 1 Notes
93 pages
PAD 742 PAST QUESTION AND ANSWERS
No ratings yet
PAD 742 PAST QUESTION AND ANSWERS
33 pages
Ch567 Cognitive SOCIAL Emotional INTERACTION
No ratings yet
Ch567 Cognitive SOCIAL Emotional INTERACTION
177 pages
Draft - UMC C&D RFP
No ratings yet
Draft - UMC C&D RFP
36 pages
Data Science Solutions Sample
100% (6)
Data Science Solutions Sample
53 pages
Asset Accountability Form_SherwinJamesVillacin_LaptopChargerHeadsetWiredMouseBag
No ratings yet
Asset Accountability Form_SherwinJamesVillacin_LaptopChargerHeadsetWiredMouseBag
1 page
January 26th 2025
No ratings yet
January 26th 2025
8 pages
4th Quarter HEALTH 8
No ratings yet
4th Quarter HEALTH 8
45 pages
03a-GP Organomet Cat
No ratings yet
03a-GP Organomet Cat
40 pages
HDFC Securities Institutional Equities Tata Elxsi Q2FY25 Results
No ratings yet
HDFC Securities Institutional Equities Tata Elxsi Q2FY25 Results
13 pages
Skyline Short Book
50% (2)
Skyline Short Book
34 pages
04-05-2024 SR - Super60 (Incoming) Nucleus & Sterling BT Jee-Main Ctm-12&Ctm-24 (Qmt-07) Key & Sol's
No ratings yet
04-05-2024 SR - Super60 (Incoming) Nucleus & Sterling BT Jee-Main Ctm-12&Ctm-24 (Qmt-07) Key & Sol's
21 pages
Tấn
No ratings yet
Tấn
16 pages
Daphniphyllum Alkaloids Final MDP
No ratings yet
Daphniphyllum Alkaloids Final MDP
15 pages
Apollo Reports_France Luxury Footwear Market Report With Global Overview_2021_Sommaire
No ratings yet
Apollo Reports_France Luxury Footwear Market Report With Global Overview_2021_Sommaire
3 pages
KDC-U449 KDC-U3049 KDC-U349 KDC-4051UM: About "KENWOOD Music Editor"
No ratings yet
KDC-U449 KDC-U3049 KDC-U349 KDC-4051UM: About "KENWOOD Music Editor"
19 pages
Swami Satyanand Vs Rajiv Ranjan Kumar Singh 1106W120910COM634098
No ratings yet
Swami Satyanand Vs Rajiv Ranjan Kumar Singh 1106W120910COM634098
8 pages
AOC 511vwb 15inch LCD Monitor SM
100% (1)
AOC 511vwb 15inch LCD Monitor SM
45 pages
Teruaki Mukaiyama - : Y. Ishihara Baran Lab Group Meeting
No ratings yet
Teruaki Mukaiyama - : Y. Ishihara Baran Lab Group Meeting
9 pages
Expert Judgment Format
No ratings yet
Expert Judgment Format
3 pages
FMI9e PPT Ch08
No ratings yet
FMI9e PPT Ch08
12 pages
Cylindrospermopsin Synthesis
No ratings yet
Cylindrospermopsin Synthesis
8 pages
Francisco v. NLRC
No ratings yet
Francisco v. NLRC
2 pages
Homemade Toffee Bits Recipe - Handle The Heat
No ratings yet
Homemade Toffee Bits Recipe - Handle The Heat
1 page
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
From Everand
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Anish Chapagain
No ratings yet
DES-3828 Ds
No ratings yet
DES-3828 Ds
3 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Game Ieee Paper
No ratings yet
Game Ieee Paper
2 pages
Group Task 4 Bckgrndlit.
No ratings yet
Group Task 4 Bckgrndlit.
2 pages
List of 100 Essential Business English Nouns
100% (1)
List of 100 Essential Business English Nouns
8 pages
ISO - 7783 - 2 - 1999 - EN - FR Permeability
50% (2)
ISO - 7783 - 2 - 1999 - EN - FR Permeability
6 pages
Virtual Reality
No ratings yet
Virtual Reality
21 pages
Accenture Trends Reshaping HR Workforce One
100% (1)
Accenture Trends Reshaping HR Workforce One
12 pages

Topic 02 - Data Collection

Uploaded by

Topic 02 - Data Collection

Uploaded by

University of Science, VNU-HCM

Faculty of Information Technology

Ho Chi Minh City

◎ Give the question to answer

Q: What are the recruitment sites in Vietnam?

Q: What are the recruitment sites in Vietnam?

◎ For each recruitment, draw out the information:

◎ For each recruitment, draw out the information:

◎ Install: PowerShell / cmd type

◎ Check file “robots.txt” of the site (Example,

How is the recruitment situation of the Data Science in Vietnam

Assumption: we only demo focus on careerbuilder with keywords

◎ One way is to use Regular Expression

◎ Example: Get string “Yay! Supports javascript” in

◎ Rather than using the Render() method in Requests-HTML,

◎ Which Vietjet flight from Ho Chi Minh city to Da Nang is

Use API: Almost immediately receive data J

This is the XML (eXtensible

Use API: Almost immediately receive data J

Another format for using API is JSON (JavaScript Object Notation)

JSON is built on two structures:

Q: Parse JSON (converting from JSON string to Python data

You might also like