0% found this document useful (0 votes)
41 views30 pages

L2 Data Crawling Preprocessinge

Uploaded by

trinhtrinh5923
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views30 pages

L2 Data Crawling Preprocessinge

Uploaded by

trinhtrinh5923
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Introduction to

Machine Learning and Data Mining


(Học máy và Khai phá dữ liệu)

Khoat Than
Le Minh Hoa, Nguyen Van Son
School of Information and Communication Technology
Hanoi University of Science and Technology
2
Content

¡ Introduction to Machine Learning & Data Mining

¡ Data crawling and pre-processing


¡ Supervised learning
¡ Unsupervised learning
¡ Practical advice
3
Spend time in data science tasks

§ What data scientists spend the


most time doing?
• Collecting data: 19%
• Cleaning and organizing data: 60%
• Building training datasets: 3%
• Data mining: 9%
• Refining algorithms: 4%
• Others: 5%
4
Why?

¡ Why preprocess the data?


• Convenient in storage, query data
• Machine learning models often work with structured data: matrices, vectors,
arrays, etc.
• Machine learning usually works well if there is a suitable representation of
the data
Input Output
Problems to be solved Numeric data - matrix, vector

-0.0920
3.4931
𝑥"
-1.8493 𝑥#
𝑥 ! = ... 𝒟= …
...
-0.2010 !
-1.3079 𝑥
5
How?

§ Data collection
• Sampling
• Method: crawling, logging, scraping
§ Data processing
• Noise filtering, cleaning, digitizing…
Business Analytic
understanding approach

Data
Feedback
requirements

Data
Deployment
collection

Data
Evaluation
understanding

Data
Modeling
preparation
6
Data collection

Input Output
Problems to be solved Data samples
7
Fundamentals :: Sampling

¡ WHAT – Take a small “One or more small spoon(s) can be enough to assess whether the
soup is good or not.”

set of samples to be
representative of the
whole data space
¡ WHY – can’t access the
whole data space due to
time and computational
power limitations
¡ HOW – Collect samples
from real life, or web https://www.coursera.org/learn/inferential-statistics-intro

data sources,
databases…
8
Fundamentals :: Sampling :: How

¡ Variety – the sample “One or more small spoon(s) can be enough to assess whether the soup is
good or not.”
Remember to stir to avoid tasting biases.
set should be diverse
enough to cover all
contexts of the
field/domain.
¡ Bias – data needs to
be general,
not to bias towards a
small part of the field.

https://www.coursera.org/learn/inferential-statistics-intro
9
Fundamentals :: Sampling :: How
¡ Variety – samples vary
enough to reflect reality?

Actual results
https://projects.fivethirtyeight.com/2016-election-forecast/ https://www.coursera.org/learn/inferential-statistics-intro
http://edition.cnn.com/election/results/president
Image credit: Wikipedia, FiveThirtyEight
10
Techniques

§ Crowd-sourcing: Conduct surveys


§ Logging: record user interaction history, product access...
§ Scrapping: Search data sources on websites, download,
extract, filter data
§ Synthesize by a generative model?
11

Techniques :: Scrapping :: DEMO

§ Objective: Data for the problem of text classification –


newspaper articles.
§ DEMO: Newspaper crawling system
12
DEMO

Input Output
Problem: classifying Sample data: newspaper
newspaper articles articles and lables
13
DEMO :: Steps

Rss Item Content


14
DEMO :: Sample
15
Data preprocessing

Input Output
Raw data samples Preprocessed data
(text, image, audio...) for ML/AI model(s)

-0.0920 𝑥"
3.4931
-1.8493 𝑥#
𝑥 ! = ... 𝒟= …
...
-0.2010 𝑥!
-1.3079
16
Fundamentals :: Data “rawness”
Completeness Integrity
(đầy đủ) (trung thực)

Each collected sample should have all § Ensure the samples to correctly
reflect the reality
the required attributes/features
§ Jan. 1 as everyone’s birthday? –
intentional (systematic) noises

Homogeneity Structures
(đồng nhất) (cấu trúc)
§ Rating “1, 2, 3” & “A, B, C”; or Age
= “42” & Birthday = “03/07/2010”
(inconsistency)

§ Heterogenous data sources /


schemas
17

Techniques

Cleaning
Integrating
Transforming
18
Techniques :: Cleaning

¡ Completeness and integrity • Data samples should be


collected from reliable
sources. Reflect the problem
to be solved.
• Eliminate (outliers) noise:
remove some data samples
that are significantly different
from other samples?
• A data sample may be
empty (missing, incomplete).
A suitable strategy is
needed:
• Remove the sample?
• Fill a value to the missing
fields of a sample?
19
Techniques :: Cleaning

¡ Fill in the missing value 1. Fill in the missing value manually


2. Use a global constant
3. Use and “average” value
4. Use average value for all samples
belonging to the same class/group
5. Use the most probable value
(regression, bayesian inference)

A1 A2 A3 A4 A5 A6 A7 A8 y
? 3.683 ? -0.634 1 0.409 7 30 5
? 3.096 A 1.573 1 0.639 7 30 5
? ? A 0.249 0 0.089 ? 80 3
2.887 3.870 C -1.347 ? 1.276 ? 60 5
2.731 3.945 D 1.967 1 2.487 ? 100 4
20
Techniques :: Cleaning (cont.)
¡ Data uniformity

Different data representations,


units of measures, metrics etc.

èNeed a normalization

Examples:

Rating “1, 2, 3” & “A, B, C”;

Age = 42 & Birthday = 03/08/2020


21
Techniques :: Integrating w/ some Transforming
`` Un-structured

texts in websites, emails, articles, tweets 2D/3D images, videos + meta spectrograms, DNAs, …

image credits: wikipedia, shutterstock, CNN


22
Techniques :: Transforming
Semantics?
Extract semantic features, normalize
23
Semantics example: visual data
Mid-/High-level semantics
``
Low-level semantics
(raw pixels) (e.g. human-interpretable features)

cat 0.28
human 0.17
car 0.08
ground 0.25
building 0.22

cat → not on → car


people ← behind ← building
car → is → red

Minimum semantic level to understand:


- Text classification
- Emotional analysis
- AI Chatbot (various semantic levels)
Image credits: CS231n, Stanford University; Lee et al, 2009; Socher et al, 2011
24
Techniques :: Transforming (cont.)
¡ Objective: to extract semantic features.

• For a specific field and type of data


(text data, images, ...):
use different methods for extracting
semantic features

… and standardize
• Feature discretization (rời rạc hoá):
Some attributes are more efficient
when being grouped.

• Feature normalization (chuẩn hóa):


normalize attribute values to the
same domain.

One-hot encoding
1= 10000
𝑥 − 𝑥̅
3= 00100 𝑠

25
Techniques :: Transforming (cont.)
¡ Data reduction:
¡ Helps reduce the size of the data and, at the same time,
preserve the core semantics of the data.
¡ Helps speed up the process of learning or knowledge
discovery.

¡ Some strategies:
¡ Feature selection: redundant attributes or dimensions can be
eliminated
¡ Dimensionality reduction: use some algorithms (eg. PCA,
ICA, t-SNE) to transform the original data into a low-
dimensional space.
¡ Abstraction: raw data values are replaced by abstract
concepts.
26

Techniques :: Transforming
example & demo

Transforming text data


27
DEMO

Input Output
Mẫu dữ liệu thô: json text Dữ liệu số theo từng ML/AI
model(s)
28
DEMO :: Steps

Data Input
Tokenize Dictionary
(tfidf-Vector)
29
DEMO :: Exercise
§ Exercise: Calculate vector representation of text with small
dataset.

§ Data: 2 articles from Dantri.com.vn

§ Request:
§ Use the word separator module.
§ Build a dictionary from 2 documents
§ Use a list of stopwords to filter stopwords.
§ Convert 2 documents into 2 tf.idf vectors
Summary 30
(Take-home messages)

§ The data in a field before entering the machine learning


system must be collected and represented in a structured
form with some desirable characteristics: completeness,
integrity, homogeneity, well-defined structure.

§ The data collected for the learning process is a small set,


but it should reflect all aspects of the problem to be solved.

§ Raw data, after collection and preprocessing, must retain


the full range of semantic features – features that affect
problem solving.

You might also like