L2 Data Crawling Preprocessinge
L2 Data Crawling Preprocessinge
Khoat Than
Le Minh Hoa, Nguyen Van Son
School of Information and Communication Technology
Hanoi University of Science and Technology
2
Content
-0.0920
3.4931
𝑥"
-1.8493 𝑥#
𝑥 ! = ... 𝒟= …
...
-0.2010 !
-1.3079 𝑥
5
How?
§ Data collection
• Sampling
• Method: crawling, logging, scraping
§ Data processing
• Noise filtering, cleaning, digitizing…
Business Analytic
understanding approach
Data
Feedback
requirements
Data
Deployment
collection
Data
Evaluation
understanding
Data
Modeling
preparation
6
Data collection
Input Output
Problems to be solved Data samples
7
Fundamentals :: Sampling
¡ WHAT – Take a small “One or more small spoon(s) can be enough to assess whether the
soup is good or not.”
set of samples to be
representative of the
whole data space
¡ WHY – can’t access the
whole data space due to
time and computational
power limitations
¡ HOW – Collect samples
from real life, or web https://www.coursera.org/learn/inferential-statistics-intro
data sources,
databases…
8
Fundamentals :: Sampling :: How
¡ Variety – the sample “One or more small spoon(s) can be enough to assess whether the soup is
good or not.”
Remember to stir to avoid tasting biases.
set should be diverse
enough to cover all
contexts of the
field/domain.
¡ Bias – data needs to
be general,
not to bias towards a
small part of the field.
https://www.coursera.org/learn/inferential-statistics-intro
9
Fundamentals :: Sampling :: How
¡ Variety – samples vary
enough to reflect reality?
Actual results
https://projects.fivethirtyeight.com/2016-election-forecast/ https://www.coursera.org/learn/inferential-statistics-intro
http://edition.cnn.com/election/results/president
Image credit: Wikipedia, FiveThirtyEight
10
Techniques
Input Output
Problem: classifying Sample data: newspaper
newspaper articles articles and lables
13
DEMO :: Steps
Input Output
Raw data samples Preprocessed data
(text, image, audio...) for ML/AI model(s)
-0.0920 𝑥"
3.4931
-1.8493 𝑥#
𝑥 ! = ... 𝒟= …
...
-0.2010 𝑥!
-1.3079
16
Fundamentals :: Data “rawness”
Completeness Integrity
(đầy đủ) (trung thực)
Each collected sample should have all § Ensure the samples to correctly
reflect the reality
the required attributes/features
§ Jan. 1 as everyone’s birthday? –
intentional (systematic) noises
Homogeneity Structures
(đồng nhất) (cấu trúc)
§ Rating “1, 2, 3” & “A, B, C”; or Age
= “42” & Birthday = “03/07/2010”
(inconsistency)
Techniques
Cleaning
Integrating
Transforming
18
Techniques :: Cleaning
A1 A2 A3 A4 A5 A6 A7 A8 y
? 3.683 ? -0.634 1 0.409 7 30 5
? 3.096 A 1.573 1 0.639 7 30 5
? ? A 0.249 0 0.089 ? 80 3
2.887 3.870 C -1.347 ? 1.276 ? 60 5
2.731 3.945 D 1.967 1 2.487 ? 100 4
20
Techniques :: Cleaning (cont.)
¡ Data uniformity
èNeed a normalization
Examples:
texts in websites, emails, articles, tweets 2D/3D images, videos + meta spectrograms, DNAs, …
cat 0.28
human 0.17
car 0.08
ground 0.25
building 0.22
… and standardize
• Feature discretization (rời rạc hoá):
Some attributes are more efficient
when being grouped.
One-hot encoding
1= 10000
𝑥 − 𝑥̅
3= 00100 𝑠
…
25
Techniques :: Transforming (cont.)
¡ Data reduction:
¡ Helps reduce the size of the data and, at the same time,
preserve the core semantics of the data.
¡ Helps speed up the process of learning or knowledge
discovery.
¡ Some strategies:
¡ Feature selection: redundant attributes or dimensions can be
eliminated
¡ Dimensionality reduction: use some algorithms (eg. PCA,
ICA, t-SNE) to transform the original data into a low-
dimensional space.
¡ Abstraction: raw data values are replaced by abstract
concepts.
26
Techniques :: Transforming
example & demo
Input Output
Mẫu dữ liệu thô: json text Dữ liệu số theo từng ML/AI
model(s)
28
DEMO :: Steps
Data Input
Tokenize Dictionary
(tfidf-Vector)
29
DEMO :: Exercise
§ Exercise: Calculate vector representation of text with small
dataset.
§ Request:
§ Use the word separator module.
§ Build a dictionary from 2 documents
§ Use a list of stopwords to filter stopwords.
§ Convert 2 documents into 2 tf.idf vectors
Summary 30
(Take-home messages)