0% found this document useful (0 votes)

41 views30 pages

L2 Data Crawling Preprocessinge

Uploaded by

trinhtrinh5923

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views30 pages

L2 Data Crawling Preprocessinge

Uploaded by

trinhtrinh5923

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Introduction to

Machine Learning and Data Mining

(Học máy và Khai phá dữ liệu)

Khoat Than
Le Minh Hoa, Nguyen Van Son
School of Information and Communication Technology
Hanoi University of Science and Technology
2
Content

¡ Introduction to Machine Learning & Data Mining

¡ Data crawling and pre-processing

¡ Supervised learning
¡ Unsupervised learning
¡ Practical advice
3
Spend time in data science tasks

§ What data scientists spend the

most time doing?
• Collecting data: 19%
• Cleaning and organizing data: 60%
• Building training datasets: 3%
• Data mining: 9%
• Refining algorithms: 4%
• Others: 5%
4
Why?

¡ Why preprocess the data?

• Convenient in storage, query data
• Machine learning models often work with structured data: matrices, vectors,
arrays, etc.
• Machine learning usually works well if there is a suitable representation of
the data
Input Output
Problems to be solved Numeric data - matrix, vector

-0.0920
3.4931
𝑥"
-1.8493 𝑥#
𝑥 ! = ... 𝒟= …
...
-0.2010 !
-1.3079 𝑥
5
How?

§ Data collection
• Sampling
• Method: crawling, logging, scraping
§ Data processing
• Noise filtering, cleaning, digitizing…
Business Analytic
understanding approach

Data
Feedback
requirements

Data
Deployment
collection

Data
Evaluation
understanding

Data
Modeling
preparation
6
Data collection

Input Output
Problems to be solved Data samples
7
Fundamentals :: Sampling

¡ WHAT – Take a small “One or more small spoon(s) can be enough to assess whether the
soup is good or not.”

set of samples to be
representative of the
whole data space
¡ WHY – can’t access the
whole data space due to
time and computational
power limitations
¡ HOW – Collect samples
from real life, or web https://www.coursera.org/learn/inferential-statistics-intro

data sources,
databases…
8
Fundamentals :: Sampling :: How

¡ Variety – the sample “One or more small spoon(s) can be enough to assess whether the soup is
good or not.”
Remember to stir to avoid tasting biases.
set should be diverse
enough to cover all
contexts of the
field/domain.
¡ Bias – data needs to
be general,
not to bias towards a
small part of the field.

https://www.coursera.org/learn/inferential-statistics-intro
9
Fundamentals :: Sampling :: How
¡ Variety – samples vary
enough to reflect reality?

Actual results
https://projects.fivethirtyeight.com/2016-election-forecast/ https://www.coursera.org/learn/inferential-statistics-intro
http://edition.cnn.com/election/results/president
Image credit: Wikipedia, FiveThirtyEight
10
Techniques

§ Crowd-sourcing: Conduct surveys

§ Logging: record user interaction history, product access...
§ Scrapping: Search data sources on websites, download,
extract, filter data
§ Synthesize by a generative model?
11

Techniques :: Scrapping :: DEMO

§ Objective: Data for the problem of text classification –

newspaper articles.
§ DEMO: Newspaper crawling system
12
DEMO

Input Output
Problem: classifying Sample data: newspaper
newspaper articles articles and lables
13
DEMO :: Steps

Rss Item Content

14
DEMO :: Sample
15
Data preprocessing

Input Output
Raw data samples Preprocessed data
(text, image, audio...) for ML/AI model(s)

-0.0920 𝑥"
3.4931
-1.8493 𝑥#
𝑥 ! = ... 𝒟= …
...
-0.2010 𝑥!
-1.3079
16
Fundamentals :: Data “rawness”
Completeness Integrity
(đầy đủ) (trung thực)

Each collected sample should have all § Ensure the samples to correctly
reflect the reality
the required attributes/features
§ Jan. 1 as everyone’s birthday? –
intentional (systematic) noises

Homogeneity Structures
(đồng nhất) (cấu trúc)
§ Rating “1, 2, 3” & “A, B, C”; or Age
= “42” & Birthday = “03/07/2010”
(inconsistency)

§ Heterogenous data sources /

schemas
17

Techniques

Cleaning
Integrating
Transforming
18
Techniques :: Cleaning

¡ Completeness and integrity • Data samples should be

collected from reliable
sources. Reflect the problem
to be solved.
• Eliminate (outliers) noise:
remove some data samples
that are significantly different
from other samples?
• A data sample may be
empty (missing, incomplete).
A suitable strategy is
needed:
• Remove the sample?
• Fill a value to the missing
fields of a sample?
19
Techniques :: Cleaning

¡ Fill in the missing value 1. Fill in the missing value manually

2. Use a global constant
3. Use and “average” value
4. Use average value for all samples
belonging to the same class/group
5. Use the most probable value
(regression, bayesian inference)

A1 A2 A3 A4 A5 A6 A7 A8 y
? 3.683 ? -0.634 1 0.409 7 30 5
? 3.096 A 1.573 1 0.639 7 30 5
? ? A 0.249 0 0.089 ? 80 3
2.887 3.870 C -1.347 ? 1.276 ? 60 5
2.731 3.945 D 1.967 1 2.487 ? 100 4
20
Techniques :: Cleaning (cont.)
¡ Data uniformity

Different data representations,

units of measures, metrics etc.

èNeed a normalization

Examples:

Rating “1, 2, 3” & “A, B, C”;

Age = 42 & Birthday = 03/08/2020

21
Techniques :: Integrating w/ some Transforming
`` Un-structured

texts in websites, emails, articles, tweets 2D/3D images, videos + meta spectrograms, DNAs, …

image credits: wikipedia, shutterstock, CNN

22
Techniques :: Transforming
Semantics?
Extract semantic features, normalize
23
Semantics example: visual data
Mid-/High-level semantics
``
Low-level semantics
(raw pixels) (e.g. human-interpretable features)

cat 0.28
human 0.17
car 0.08
ground 0.25
building 0.22

cat → not on → car

people ← behind ← building
car → is → red

Minimum semantic level to understand:

- Text classification
- Emotional analysis
- AI Chatbot (various semantic levels)
Image credits: CS231n, Stanford University; Lee et al, 2009; Socher et al, 2011
24
Techniques :: Transforming (cont.)
¡ Objective: to extract semantic features.

• For a specific field and type of data

(text data, images, ...):
use different methods for extracting
semantic features

… and standardize
• Feature discretization (rời rạc hoá):
Some attributes are more efficient
when being grouped.

• Feature normalization (chuẩn hóa):

normalize attribute values to the
same domain.

One-hot encoding
1= 10000
𝑥 − 𝑥̅
3= 00100 𝑠
…
25
Techniques :: Transforming (cont.)
¡ Data reduction:
¡ Helps reduce the size of the data and, at the same time,
preserve the core semantics of the data.
¡ Helps speed up the process of learning or knowledge
discovery.

¡ Some strategies:
¡ Feature selection: redundant attributes or dimensions can be
eliminated
¡ Dimensionality reduction: use some algorithms (eg. PCA,
ICA, t-SNE) to transform the original data into a low-
dimensional space.
¡ Abstraction: raw data values are replaced by abstract
concepts.
26

Techniques :: Transforming
example & demo

Transforming text data

27
DEMO

Input Output
Mẫu dữ liệu thô: json text Dữ liệu số theo từng ML/AI
model(s)
28
DEMO :: Steps

Data Input
Tokenize Dictionary
(tfidf-Vector)
29
DEMO :: Exercise
§ Exercise: Calculate vector representation of text with small
dataset.

§ Data: 2 articles from Dantri.com.vn

§ Request:
§ Use the word separator module.
§ Build a dictionary from 2 documents
§ Use a list of stopwords to filter stopwords.
§ Convert 2 documents into 2 tf.idf vectors
Summary 30
(Take-home messages)

§ The data in a field before entering the machine learning

system must be collected and represented in a structured
form with some desirable characteristics: completeness,
integrity, homogeneity, well-defined structure.

§ The data collected for the learning process is a small set,

but it should reflect all aspects of the problem to be solved.

§ Raw data, after collection and preprocessing, must retain

the full range of semantic features – features that affect
problem solving.

LTM 1080 PT2
100% (1)
LTM 1080 PT2
20 pages
Data
No ratings yet
Data
36 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Chương
No ratings yet
Chương
12 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
1 Introduction
No ratings yet
1 Introduction
51 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
UNIT 2 DT
No ratings yet
UNIT 2 DT
8 pages
02 DP
No ratings yet
02 DP
31 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
NN 7
No ratings yet
NN 7
26 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Lecture 4 - Machine Learning Pipeline
No ratings yet
Lecture 4 - Machine Learning Pipeline
38 pages
DS Tools&Techniques
No ratings yet
DS Tools&Techniques
36 pages
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
No ratings yet
Exploring, Transforming, and Summarizing Input Datasets For Building Classification Models
21 pages
To Artificial Intelligence: What Is Data Science?
100% (1)
To Artificial Intelligence: What Is Data Science?
131 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Designing Machine Learning Systems With Python - Sample Chapter
100% (1)
Designing Machine Learning Systems With Python - Sample Chapter
31 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
DCPP Notes
No ratings yet
DCPP Notes
6 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
What Is Data Preprocessing
No ratings yet
What Is Data Preprocessing
4 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
No ratings yet
Machine Learning: Dr. Jagan. T Professor Department of ECE, GRIET
69 pages
Data Preprocessing v6.1
No ratings yet
Data Preprocessing v6.1
64 pages
Week 12 Intro To DS and ML
No ratings yet
Week 12 Intro To DS and ML
67 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
Lesson 4 - Introduction Machine Learning
No ratings yet
Lesson 4 - Introduction Machine Learning
44 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
ML Da
No ratings yet
ML Da
55 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
3 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Unit 2
No ratings yet
Unit 2
91 pages
Copy Merged
No ratings yet
Copy Merged
3 pages
SWE 227 Slide 01
No ratings yet
SWE 227 Slide 01
21 pages
Salazar CPE124 Courswork 1
No ratings yet
Salazar CPE124 Courswork 1
22 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Basics of Machine Learning1
No ratings yet
Basics of Machine Learning1
67 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Life Lesson
No ratings yet
Life Lesson
13 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Subject - Machine Learning Group - E27-24 Name
No ratings yet
Subject - Machine Learning Group - E27-24 Name
18 pages
100 Days ML
No ratings yet
100 Days ML
15 pages
Data+Science+in+Python+ +Data+Prep+&+EDA
No ratings yet
Data+Science+in+Python+ +Data+Prep+&+EDA
196 pages
Big Data Analytics (1) : Definition
No ratings yet
Big Data Analytics (1) : Definition
15 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Chaper 3 FoDS
No ratings yet
Chaper 3 FoDS
127 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Data Science Mastery: From Beginner to Expert in Big Data Analytics
From Everand
Data Science Mastery: From Beginner to Expert in Big Data Analytics
Kameron Hussain
No ratings yet
Pitch and Frequency
No ratings yet
Pitch and Frequency
3 pages
Manual Aspirador Makita DCL180Z A Batería 18V Litio
No ratings yet
Manual Aspirador Makita DCL180Z A Batería 18V Litio
44 pages
The Studying Mastermind Guide
No ratings yet
The Studying Mastermind Guide
35 pages
Ouelhazi Mohamed Attia: Personal Informations
No ratings yet
Ouelhazi Mohamed Attia: Personal Informations
2 pages
Entity Level GHG Survey (2019)
No ratings yet
Entity Level GHG Survey (2019)
2 pages
Liverpool Medals Catalogue
100% (1)
Liverpool Medals Catalogue
116 pages
SUMMER INTERNSHIP REPORT (AutoRecovered)
No ratings yet
SUMMER INTERNSHIP REPORT (AutoRecovered)
19 pages
Antibacterial Polymers - A Mini Review: Sciencedirect
No ratings yet
Antibacterial Polymers - A Mini Review: Sciencedirect
6 pages
75 Years of Markting History
No ratings yet
75 Years of Markting History
8 pages
Usulan Alat Lab TKLP 2022 Asiin
No ratings yet
Usulan Alat Lab TKLP 2022 Asiin
11 pages
Course Outcome - BCA - BU - Sep - 2023 - Update
No ratings yet
Course Outcome - BCA - BU - Sep - 2023 - Update
24 pages
Air Cargo Brochure
No ratings yet
Air Cargo Brochure
6 pages
3.0 Central Processing Unit: ITE 1922 - ICT Applications
No ratings yet
3.0 Central Processing Unit: ITE 1922 - ICT Applications
7 pages
Physics EE Subject Guide
No ratings yet
Physics EE Subject Guide
9 pages
3280 4.19MB Strabismus - A Decision Making Approach
No ratings yet
3280 4.19MB Strabismus - A Decision Making Approach
206 pages
1130048585final Petition
No ratings yet
1130048585final Petition
50 pages
Crop Circle Templates
No ratings yet
Crop Circle Templates
2 pages
BeneFusion SP1 Operators Manual 2024
No ratings yet
BeneFusion SP1 Operators Manual 2024
86 pages
Au
No ratings yet
Au
5 pages
Definitions Goals and Scope of Counseling
No ratings yet
Definitions Goals and Scope of Counseling
1 page
List of Documents To Be Attached With The Application Form For Registration As Professional Engineer (Pe) (Through Epe)
100% (1)
List of Documents To Be Attached With The Application Form For Registration As Professional Engineer (Pe) (Through Epe)
6 pages
CE OOO BOQ Solar PV System Contractor XXX
No ratings yet
CE OOO BOQ Solar PV System Contractor XXX
1 page
STS Advance User Manual
No ratings yet
STS Advance User Manual
138 pages
All Plan Ronchester
No ratings yet
All Plan Ronchester
38 pages
AW-FP128-Addressable Fire Alarm Control Panel (Mini) Datasheet-20221205
No ratings yet
AW-FP128-Addressable Fire Alarm Control Panel (Mini) Datasheet-20221205
2 pages
Tle 6281
No ratings yet
Tle 6281
15 pages
Wins Narrative 2022
100% (1)
Wins Narrative 2022
4 pages
Poster Presentation-Assessment Rubric: Group: Class
No ratings yet
Poster Presentation-Assessment Rubric: Group: Class
2 pages
Your Tax Invoice: Summary of Charges
No ratings yet
Your Tax Invoice: Summary of Charges
7 pages