0% found this document useful (0 votes)

17 views44 pages

Data Science Unit I (LN and QB)

The document outlines the syllabus for a unit on Data Science and Big Data, focusing on data objects and attributes, including their types and characteristics. It discusses the importance of data quality, preprocessing tasks such as data cleaning, integration, reduction, transformation, and the handling of missing and noisy data. Additionally, it covers dimensionality reduction techniques like Principal Component Analysis (PCA) and introduces matrix factorization in the context of recommender systems.

Uploaded by

Ritesh Borse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views44 pages

Data Science Unit I (LN and QB)

Uploaded by

Ritesh Borse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

1

Data science and Big Data

Unit –I Syllabus

2
Reference Book
Data Mining:
Concepts and Techniques
(3rd ed.)

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
3
Lecture Notes

4
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
5
Attributes

 Attribute (or dimensions, features, variables):

a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address
 Types:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

6
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings

7
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities
8
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and

represented using a finite number of digits
 Continuous attributes are typically represented as
floating-point variables
9
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view

 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

10
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

11
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
12
Incomplete (Missing) Data

 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
13
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
14
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which require data cleaning

 duplicate records

 incomplete data

 inconsistent data

15
How to Handle Noisy Data?

 Binning
 first sort data and partition into (equal-frequency) bins

 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,

deal with possible outliers)

16
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes

 Wavelet transforms

 Principal Components Analysis (PCA)

 Feature subset selection, feature creation

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Data cube aggregation

 Data compression

17
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

18
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space

x1
19
Principal Component Analysis (Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
 Works for numeric data only
20
numerical

Watch This:

https://www.youtube.com/watch?v=MLaJbA82nzk

21
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA

22
Heuristic Search in Attribute Selection

 There are 2d possible attribute combinations of d attributes

 Typical heuristic attribute selection methods:
 Best single attribute under the attribute independence

assumption: choose by significance tests

 Best step-wise feature selection:

 The best single-attribute is picked first

 Then next best attribute condition to the first, ...

 Step-wise attribute elimination:

 Repeatedly eliminate the worst attribute

 Best combined attribute selection and elimination

 Optimal branch and bound:

 Use attribute elimination and backtracking

23
Attribute Creation (Feature Generation)
 Create new attributes (features) that can capture the
important information in a data set more effectively than
the original ones
 Three general methodologies
 Attribute extraction

 Domain-specific

 Mapping data to new space (see: data reduction)

 E.g., Fourier transformation, wavelet

transformation, manifold approaches (not covered)

 Attribute construction

 Combining features (see: discriminative frequent

patterns in Chapter 7)
 Data discretization
24
Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing 25
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600  12,000
1.0]. Then $73,000 is mapped to 98,000  12,000 (1.0  0)  0  0.716
 Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
 Ex. Let μ = 54,000, σ = 16,000. Then  1.225
16,000
 Normalization by decimal scaling
v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10
26
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem

 Remove redundancies

 Detect inconsistencies

 Data reduction
 Dimensionality reduction

 Numerosity reduction

 Data compression

 Data transformation and data discretization

 Normalization

 Concept hierarchy generation

27
MultiDimension Scaling

2/16/2025 Data Mining: Concepts and Techniques 28

Local Binary Pattern

2/16/2025 Data Mining: Concepts and Techniques 29

Local Binary Pattern

2/16/2025 Data Mining: Concepts and Techniques 30

2/16/2025 Data Mining: Concepts and Techniques 31
example

2/16/2025 Data Mining: Concepts and Techniques 32

Matrix Factorization

Matrix factorization is a class of collaborative

filtering algorithms used in recommender
systems. Matrix factorization algorithms work by
decomposing the user-item interaction matrix into
the product of two lower dimensionality
rectangular matrices.
Watch This
m
ultiplying these factors will produce the original
matrix.

2/16/2025 Data Mining: Concepts and Techniques 33

Watch This

 https://www.youtube.com/watch?v=ZspR5PZ
emcs

2/16/2025 Data Mining: Concepts and Techniques 34

Matrix Factorization

2/16/2025 Data Mining: Concepts and Techniques 35

2/16/2025 Data Mining: Concepts and Techniques 36
2/16/2025 Data Mining: Concepts and Techniques 37
2/16/2025 Data Mining: Concepts and Techniques 38
Sample Unit I questions

1. What are Features. Explain different types with

example
2. What is data preprocessing.
3. What are the necessities of data preprocessing
4. What are outliers. Explain the mechanism for
outlier detection.
5. How to handle null values
6. Explain IQR using box plot
7. Explain PCA algorithm
8. What is feature engineering
39
Continued….
 9.What is the difference between feature
selection and feature extraction
 10. Explain Filter, wrapper and embedded
method. Explain forward selection and backward
elimination
 11. What is meant by curse of dimentionality.how
to handle it
 12. mention dimensionality reduction technique.
 13. what is stress function in multi dimension
scaling. Explain its uses.
 14. Write the formula for local binary pattern
 15. What is matrix factorization.Explain its use in
movie rating prediction 40
Continued…..
 16. Find the Mean of the following data
 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40,
45, 46, 52, 70
 17. Find the standard deviation of the following data
 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40,
45, 46, 52, 70
 18. Consider following data (in increasing order) for the attribute age: 13, 15, 16, 16,
19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
, answer the following:
 (a) Use min-max normalization to transform the value 35 for age onto the range [0.0,
1.0].
 (b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
 (c) Use normalization by decimal scaling to transform the value 35 for age.

41
Continued

 19. Calculate the z score and then determine the outlier

 4,5,7,8,9,10,12,15,16,17,100
 20. Draw the box plot for the data of Q.5
 21. Use LBP and find the transformed center pixel
value

42
Continued….

 22. Use the methods below to normalize the following

group of data:
 200, 300, 400, 600, 700
 (a) min-max normalization by setting min = 0 and max
=1
 (b) z-score normalization
 (c) normalization by decimal scaling
 23. calculate IQR and find the outlier for the
following data
 4,5,7,8,9,10,12,15,16,17,100
43
Also, Study lecture notes given in class

Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Week 2
No ratings yet
Week 2
96 pages
Unit I
No ratings yet
Unit I
57 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Normalization
No ratings yet
Normalization
35 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Mining P5
No ratings yet
Data Mining P5
32 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13
No ratings yet
Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13
20 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Z Score Tables
No ratings yet
Z Score Tables
5 pages
Toc
No ratings yet
Toc
14 pages
Contingency Table Analysis Methods and Implementation Using R Full Ebook Access
100% (8)
Contingency Table Analysis Methods and Implementation Using R Full Ebook Access
17 pages
2022 Stat Analysis Module 4 Parametric Test R
No ratings yet
2022 Stat Analysis Module 4 Parametric Test R
18 pages
Measure of Validity
No ratings yet
Measure of Validity
79 pages
Business Statistics: Australasian
No ratings yet
Business Statistics: Australasian
38 pages
Math 1040
No ratings yet
Math 1040
5 pages
RDP 517 pSMILE Validation Presentation - Blank - Copy - Id - 9579398
No ratings yet
RDP 517 pSMILE Validation Presentation - Blank - Copy - Id - 9579398
86 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
DSF Lab Exp Full
No ratings yet
DSF Lab Exp Full
88 pages
Stat 151 Fall 2019 Syllabus
No ratings yet
Stat 151 Fall 2019 Syllabus
9 pages
Project Cardio Good Fitness
No ratings yet
Project Cardio Good Fitness
29 pages
13-Time Series Forecasting Chap013
No ratings yet
13-Time Series Forecasting Chap013
26 pages
Lecture Set 5
No ratings yet
Lecture Set 5
32 pages
Hypothesis Testing Skills Set
No ratings yet
Hypothesis Testing Skills Set
6 pages
3) Notes Box Plot
No ratings yet
3) Notes Box Plot
3 pages
Multivariate Time Series Models
No ratings yet
Multivariate Time Series Models
28 pages
Handwriting Problems in Primary School Children A
No ratings yet
Handwriting Problems in Primary School Children A
11 pages
Chapter 2 - P2 PDF
100% (1)
Chapter 2 - P2 PDF
41 pages
Plot (Graphics)
No ratings yet
Plot (Graphics)
9 pages
SQC Model Questions
No ratings yet
SQC Model Questions
3 pages
GRADE 12 - Print Players - Quizizz
No ratings yet
GRADE 12 - Print Players - Quizizz
22 pages
Statistics For Business and Economics
No ratings yet
Statistics For Business and Economics
7 pages
Research Methodology - Measurement & Scaling Techniques
No ratings yet
Research Methodology - Measurement & Scaling Techniques
13 pages
Case 2
No ratings yet
Case 2
2 pages
Lecture 20 - KEY - Multiple Linear Regression Worksheet
No ratings yet
Lecture 20 - KEY - Multiple Linear Regression Worksheet
4 pages
W7A1
No ratings yet
W7A1
6 pages
Business Research Methods: Assignment-3 Mediation Analysis
No ratings yet
Business Research Methods: Assignment-3 Mediation Analysis
6 pages
SPSS Practical 5 - Categorical Data
No ratings yet
SPSS Practical 5 - Categorical Data
3 pages