Course Code: CSA3002
MACHINE LEARNING ALGORITHMS
Course Type: LPC – 2-2-3
Course Objectives
• The objective of the course is to familiarize the learners with
the concepts of Machine Learning Algorithms and attain
Skill Development through Experiential Learning
techniques.
Course Outcomes
At the end of the course, students should be able to
1. Understanding of training and testing the datasets using machine
Learning techniques.
2. Apply optimization and parameter tuning techniques for machine
Learning algorithms.
3. Apply a machine learning model to solve various problems using
machine learning algorithms.
4. Apply machine learning algorithm to create models.
DATA PRE-PROCESSING
Table of Contents:
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
Why Is Data Dirty?
• Incomplete data may come from
• “Not applicable” data value when collected
• Different considerations between the time when the data was collected
and when it is analyzed.
• Human/hardware/software problems
• Noisy data (incorrect values) may come from
• Faulty data collection instruments
• Human or computer error at data entry
• Errors in data transmission
• Duplicate records also need data cleaning
5
Why Is Data Preprocessing Important?
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
• Data warehouse needs consistent integration of quality data
• Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
6
Multi-Dimensional Measure of Data Quality
• A well-accepted multidimensional view:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
• Broad categories:
• Essential, contextual, representational, and accessibility
7
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
8
Forms of Data Preprocessing
9
Data Cleaning
• Importance
• “Data cleaning is one of the biggest problems in data
warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data
warehousing”—DCI survey
• Data cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
10
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred.
11
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or
decision tree
12
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data cleaning
• duplicate records
• incomplete data
• inconsistent data
13
How to Handle Noisy Data?
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)
Binning
• Binning methods smooth a sorted data value by consulting
its “neighborhood,” that is, the values around it. The sorted
values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of
values, they perform local smoothing.
• In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin. For example, the mean of the
values 4, 8, and 15 in Bin 1 is 9. Therefore, each original
value in this bin is replaced by the value 9.
• smoothing by bin medians can be employed, in which
each bin value is replaced by the bin median.
• smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin
boundaries. Each bin value is then replaced by the closest
boundary value.
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
16
Regression
Data smoothing can also
be done by regression, a y
technique that conforms
data values to a function.
Linear regression involves Y1
finding the “best” line to
fit two attributes (or
variables) so that one
Y1’ y=x+1
attribute can be used to
predict the other.
X1 x
17
Cluster Analysis
Outliers may be detected by
clustering, for example, where
similar values are organized into
groups, or “clusters.” Intuitively,
values that fall outside of
the set of clusters may be
considered outliers
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• 1. Entity identification problem:
• Identify real world entities from multiple data sources, e.g.,
Bill Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different
sources are different
• Possible reasons: different representations, different scales,
e.g., metric vs. British units
19
2. Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
• Object identification: The same attribute or object may
have different names in different databases
• Derivable data: One attribute may be a “derived” attribute
in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
20
3. Tuple Duplication
• At the time data integration, duplication should also be
detected at the tuple level (e.g., where there are two or
more identical tuples for a given unique data entry
case).
• The use of denormalized tables (often done to improve
performance by avoiding joins) is another source of data
redundancy.
• Inconsistencies often arise between various duplicates,
due to inaccurate data entry or updating some but not
all data occurrences.
4. Data Value Conflict Detection and Resolution
• Data integration also involves the detection and resolution of data value conflicts.
• For example, for the same real-world entity, attribute values from different sources may
differ. This may be due to differences in representation, scaling, or encoding. For
instance, a weight attribute may be stored in metric units in one system and British
imperial units in another.
• For a hotel chain, the price of rooms in different cities may involve not only different
currencies but also different services (e.g., free breakfast) and taxes.
Data Transformation
• Strategies for data transformation include the following:
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Attribute/feature construction
• New attributes constructed from the given ones
1. Data Transformation: Normalization
• Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600 12,000
(1.0 0) 0 0.716
$73,600 is mapped to 98,000 12,000
• Z-score normalization (μ: mean, σ: standard deviation):
v A 73,600 54,000
1.225
v' 16,000
A
• Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
Suppose that the recorded values of A range from _986 to 917. The maximum absolute value of A is 986. To
normalize by decimal scaling, we therefore divide each value by 1000 (i.e., j=3) so that _986 normalizes to _0.986
July 3, 2025
and 917 normalizes to 0.917.
Pre-Processing
Table of Contents:
• Data reduction
• Discretization and concept hierarchy generation
Module Outcome:
• Apply the various pre-processing Techniques in dataset
Data Reduction Strategies
• Why data reduction?
• A database/data warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on the
complete data set
• Data reduction
• Obtain a reduced representation of the data set that is much smaller in volume
but yet produce the same (or almost the same) analytical results
• Data reduction strategies
1) Dimensionality reduction
2) Numerosity reduction,
3) Data compression.
Data Mining: Concepts and Techniques
Dimensionality reduction
• It is the process of reducing the number of random variables or attributes
under consideration.
• Attribute subset selection is a method of dimensionality reduction in which
irrelevant, weakly relevant, or redundant attributes or dimensions are
detected and removed
• Dimensionality reduction methods include wavelet transforms and
principal components analysis which transform or project the original data
onto a smaller space.
Attribute Subset Selection
• Feature selection (i.e., attribute subset selection):
• It is the process of selecting a subset of relevant features for use in model
construction.
Heuristic methods (due to exponential # of choices):
• Step-wise forward selection
• Step-wise backward elimination
• Combining forward selection and backward elimination
• Decision-tree induction
Data Mining: Concepts and Techniques
Heuristic Feature Selection Methods
• Several heuristic feature selection methods:
• Best step-wise feature selection:
• The best single-feature is picked first
• Then next best feature condition to the first, ...
• Step-wise feature elimination:
• Repeatedly eliminate the worst feature
• Best combined feature selection and elimination
Data Mining: Concepts and Techniques 29
Dimensionality Reduction: Principal
Component Analysis (PCA)
• Given N data vectors from n-dimensions, find k ≤ n
orthogonal vectors (principal components) that can be
best used to represent data.
• PCA, is a dimensionality-reduction method that is often
used to reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one
that still contains most of the information in the large
set.
Data Mining: Concepts and Techniques 30
Image Compression
Image compression is the process of encoding or
converting an image file in such a way that it consumes
less space than the original file.
July 3, 2025 Data Mining: Concepts and Techniques 31
Numerosity reduction
• Numerosity reduction techniques replace the
original data volume by alternative, smaller forms of
data representation. These techniques may be
parametric or nonparametric.
• For parametric methods, a model is used to
estimate the data, so that typically only the data
parameters need to be stored, instead of the actual
data. (Outliers may also be stored.) Regression and
log-linear models are examples.
• Nonparametric methods for storing reduced
representations of the data include histograms ,
clustering , sampling , and data cube aggregation .
Data Reduction Method : Histograms
• Histogram is 40
the data representation in 35
terms of frequency. It uses
30
binning to
25
approximate data distribution
and is a popular form of data 20
reduction. 15
10
5
0
10000 30000 50000 70000 90000
Data Mining: Concepts and Techniques 33
Data Reduction Method: Clustering
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered
• Can have hierarchical clustering and be stored in multi-dimensional index
tree structures
• There are many choices of clustering definitions and clustering algorithms
Data Mining: Concepts and Techniques 34
Data compression
• In data compression, transformations are applied so as to obtain a
reduced or “compressed” representation of the original data.
• If the original data can be reconstructed from the compressed data
without any information loss, the data reduction is called lossless.
• If, instead, we can reconstruct only an approximation of the original
data, then the data reduction is called lossy.
• Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression.
Data Reduction Method : Sampling
• Sampling: obtaining a small sample s to represent the whole
data set N.
• Techniques:
• 1) Simple random sample without replacement
(SRSWOR) of size s:
• 2) Simple random sample with replacement
(SRSWR) of size s:
• 3) Cluster sample:
Data Mining: Concepts and Techniques 36
Data Discretization
• Data discretization is defined as a process of converting
continuous data attribute values into a finite set of intervals
and associating with each interval some
specific data value.
Discretization
• Three types of attributes:
• Nominal — values from an unordered set, e.g., color, profession
• Ordinal — values from an ordered set, e.g., military or academic rank
• Continuous — real numbers, e.g., integer or real numbers
• Discretization:
• Divide the range of a continuous attribute into intervals
• Some classification algorithms only accept categorical attributes.
• Reduce data size by discretization
Data Mining: Concepts and Techniques 38
Discretization
• Discretization
• Reduce the number of values for a given continuous attribute by dividing
the range of the attribute into intervals
• Interval labels can then be used to replace actual data values
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
Data Mining: Concepts and Techniques 39
Discretization for Numeric Data
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised,
• Histogram analysis
• Clustering analysis
• Either top-down split or bottom-up merge, unsupervised
Data Mining: Concepts and Techniques 40