ML 4
ML 4
Rajam, AP REV.: 00
(An Autonomous Institution Affiliated to JNTUGV, AP)
1. Objective
3. 2D Mapping of ILOs with Knowledge Dimension and Cognitive Learning Levels of RBT
4. Teaching Methodology
Power Point Presentation, Chalk Talk, visual presentation
5. Evocation
Fig. 1: Evocation
6. Deliverables
Lecture Notes-17:
Data Preprocessing:
Data preprocessing is an important step in the data science. It refers to the cleaning, transforming,
and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to
improve the quality of the data and to make it more suitable for the specific data mining task.
• Data cleaning: this step involves identifying and removing missing, inconsistent, or
irrelevant data. This can include removing duplicate records, filling in missing values, and
handling outliers.
• Data integration: this step involves combining data from multiple sources, such as
databases, spreadsheets, and text files. The goal of integration is to create a single, consistent
view of the data.
• Data transformation: this step involves converting the data into a format that is more suitable
for the data science tasks. This can include normalizing numerical data, creating dummy
variables, and encoding categorical data.
• Data reduction: this step is used to select a subset of the data that is relevant to the data
science task. This can include feature selection (selecting a subset of the variables) or feature
extraction (extracting new variables from the data).
• Data discretization: this step is used to convert continuous numerical data into categorical
data, which can be used for decision tree and other categorical data mining techniques.
Noisy Data: “What is noise?” Noise is a random error or variance in a measured variable.
• some basic statistical description techniques (e.g., boxplots and scatter plots), and methods of
data visualization can be used to identify outliers, which may represent noise.
• It can be handled through below techniques:
• Binning Method
• Regression
• Outlier analysis or Clustering
Lecture Notes-18:
Binning Method:
• This method works on sorted data in order to smooth it.
• The whole data is divided into segments of equal size and then various methods are performed
to complete the task.
• Each segment is handled separately.
• One can replace all data in a segment by its means or boundary values can be used to complete
the task.
• The below example illustrates some binning techniques. In this example, the data for price are
first sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains
three values).
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34.
• Partition into (equal frequency) bins:
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example,
the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced
by the value 9.
• Smoothing by bin means:
• Bin 1: 9, 9, 9
• Bin 2: 22, 22, 22
• Bin 3: 29, 29, 29
Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the
bin median.
• In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34
Regression: Data smoothing can also be done by regression, a technique that conforms data values
to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so
that one attribute can be used to predict the other. Multiple linear regression is an extension of linear
regression, where more than two attributes are involved, and the data are fit to a multidimensional
surface.
Outlier analysis: Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be
considered outliers.
Data Reduction:
• Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost
the same) analytical results.
• Overview of Data Reduction Strategies
• Wavelet Transforms
• Principal Component Analysis
• Attribute Subset Selection
• Regression and Log-Linear Models: Parametric Data Reduction
• Histograms
• Clustering
• Sampling
• Data Cube Aggregation
Lecture Notes-19:
PCA:
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of
large datasets while preserving most of the important variation in the data. PCA involves
transforming a set of correlated variables into a new set of uncorrelated variables called principal
components.
The first principal component accounts for the maximum amount of variation in the data, while each
subsequent component accounts for as much of the remaining variation as possible.
• Suppose we have a dataset of 2-dimensional points (x, y) with the following values:
• (1, 1), (2, 2), (3, 3), (4, 4), (5, 5)
• We want to reduce the dimensionality of this dataset using PCA.
• We first standardize the data by subtracting the mean of each variable and dividing it by the
standard deviation. This ensures that the variables are on the same scale.
• (𝜇𝑥, 𝜇𝑦) = (3, 3)
• (𝜎𝑥, 𝜎𝑦) = (1.414, 1.414)
• Standardized dataset:
• (-1.414, -1.414), (0, 0), (1.414, 1.414), (2.828, 2.828), (4.242, 4.242)
• We can compute the principal components of the covariance matrix using eigen-
decomposition. The eigenvectors of the covariance matrix represent the principal components,
while the corresponding eigenvalues represent the amount of variance explained by each
component.
• The eigenvectors of the covariance matrix are:
• [0.707, -0.707]
• [0.707, 0.707]
• The corresponding eigenvalues are:
• [2.0, 0.0]
• The first principal component accounts for 100% of the variation in the data, while the second
principal component accounts for 0%. This means that we can reduce the dimensionality of
the dataset from 2 to 1 by projecting the data onto the first principal component.
• The PCA class from sklearn.decomposition is used to perform PCA on the dataset. We first
standardize the dataset using the mean and standard deviation of each feature, and then apply
PCA with n_components=1 to obtain a one-dimensional representation of the data.
• The explained_variance_ratio_ attribute of the PCA object returns the proportion of the total
variance in the data that is explained by each principal component. In this case, there is only
one principal component and it explains 100% of the variance in the data.
• The fit_transform() method of the PCA object is used to compute the principal components
of the dataset and return the projected dataset. The resulting principal_components variable
contains the projected dataset in one dimension.
Lecture Notes-20:
Numerosity reduction
• Numerosity reduction is the process of simplifying or reducing a large set of data or
information to a more manageable and understandable form, while still retaining the important
features and characteristics of the original data.
• One example of numerosity reduction is summarizing a long text into a brief paragraph or
bullet points, which capture the main points of the text.
• This techniques replace the original data volume by alternative, smaller forms of data
representation. These techniques may be parametric or nonparametric. For parametric
methods, a model is used to estimate the data, so that typically only the data parameters need
to be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-
linear models are examples.
• Nonparametric methods for storing reduced representations of the data include histograms,
clustering, sampling, and data cube aggregation.
Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes
(or dimensions). The goal of attribute subset selection is to find a minimum set of attributes such that
the resulting probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes.
Basic heuristic methods of attribute subset selection include the techniques that follow, some of
which are illustrated in below Figure.
The attribute subset selection follows below approaches:
• Stepwise forward selection:
• The procedure starts with an empty set of attributes as the reduced set.
• The best of the original attributes is determined and added to the reduced set.
• At each subsequent iteration or step, the best of the remaining original attributes is added to
the set.
• Stepwise backward elimination:
• The procedure starts with the full set of attributes.
• At each step, it removes the worst attribute remaining in the set.
• Combination of forward selection and backward elimination:
• The stepwise forward selection and backward elimination methods can be combined so that,
at each step, the procedure selects the best attribute and removes the worst from among the
remaining attributes.
• Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchart like
structure where each internal (non-leaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class
prediction.
• At each node, the algorithm chooses the “best” attribute to partition the data into individual
classes.
• When decision tree induction is used for attribute subset selection, a tree is constructed from
the given data.
• All attributes that do not appear in the tree are assumed to be irrelevant.
• The set of attributes appearing in the tree form the reduced subset of attributes.
• The stopping criteria for the methods may vary. The procedure may employ a threshold on
the measure used to determine when to stop the attribute selection process.
Forward Selection:
• Let's say we have a dataset with 10 attributes (A1, A2, ..., A10) and we want to select a subset
of attributes that will yield the best performance on a classification task. We will use forward
selection to iteratively add attributes to the subset:
• Start with an empty set of attributes.
• Train a model using each attribute individually and select the attribute that yields the best
performance (e.g., highest accuracy or lowest error rate).
• Add the selected attribute to the attribute subset.
• Train a model using the attribute subset and each remaining attribute individually. Select the
attribute that, when combined with the attribute subset, yields the best performance.
• Add the selected attribute to the attribute subset.
• Repeat steps 4 and 5 until a stopping criterion is met (e.g., a maximum number of attributes
in the subset is reached or the performance no longer improves).
• For example, let's say we have the following performance metrics for each attribute:
• Using forward selection, we would start by selecting A5 as the first attribute in the subset,
since it has the highest accuracy.
• Then we would train models using the attribute subset {A5} and each of the remaining
attributes and select the attribute that yields the best performance.
• In this case, A3 would be selected since it yields the highest accuracy when combined with
A5. We would then add A3 to the subset and repeat the process. The next best attribute to add
to the subset would be A2, and so on.
• The final attribute subset selected by forward selection would depend on the stopping criterion
chosen. For example, if we set a maximum subset size of three, the final subset selected by
forward selection would be {A5, A3, A2}.
• Attribute Accuracy Attribute Accuracy Attribute Accuracy
• A1 0.70 A5 0.80 A9 0.70
• A2 0.75 A6 0.75 A10 0.60
• A3 0.80 A7 0.65
• A4 0.70 A8 0.75
Lecture Notes-21:
Lecture Notes-22:
• Ex: Min-max normalization. Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively.
• We would like to map income to the range [0.0,1.0]. By min-max normalization, a value of
$73,600 for income is transformed to
• 73,600−12,000 (1.0 − 0) + 0 = 0.716
𝑣i′ =
98,000 −12,000
• In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean (i.e., average) and standard deviation of A. A value, 𝑣𝑖, of A
is normalized to 𝑣i′ by computing.
• 𝑣𝑖−𝐴̅
𝑣i′ =
𝜎𝐴
• where 𝐴̅ and 𝜎𝐴 are the mean and standard deviation, respectively, of attribute A.
• Ex: z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a
value of $73,600 for income is transformed to
• 73,600−54,000
𝑣i′ = =1.225
16000
Decimal scaling. Suppose that the recorded values of A range from −986 to 917.
• The maximum absolute value of A is 986. To normalize by decimal scaling, we therefore
divide each value by 1000 (i.e., j =3) so that −986 normalizes to −0.986 and 917 normalizes
to 0.917.
• Note that normalization can change the original data quite a bit, especially when using z-
score normalization or decimal scaling.
• It is also necessary to save the normalization parameters (e.g., the mean and standard
deviation if using z-score normalization) so that future data can be normalized in a uniform
manner.
Data Integration: The merging of data from multiple data stores is known as data integration.
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data
set. This can help improve the accuracy and speed of the subsequent data science tasks.
• Below are the challenges to integrate data from different sources.
• Entity identification problem
• Redundancy and Correlation AnalysisTuple duplication
• Tuple Duplication.
• Data Value Conflict Detection and Resolution
Lecture Notes-23:
Tuple Duplication:
• In addition to detecting redundancies between attributes, duplication should also be detected
at the tuple level (e.g., where there are two or more identical tuples for a given unique data
entry case). The use of denormalized tables (often done to improve performance by avoiding
joins) is another source of data redundancy.
• Inconsistencies often arise between various duplicates, due to inaccurate data entry or
updating some but not all data occurrences.
• For example, if a purchase order database contains attributes for the purchaser’s name and
address instead of a key to this information in a purchaser database, discrepancies can occur,
such as the same purchaser’s name appearing with different addresses within the purchase
order database.
7. Keywords
Mean
Median
Mode
Point and Interval Estimate
8. Sample Questions
Remember:
1. List different statistics.
2. Define confidence Interval.
3. Define measure of IQR
Understand:
1. Describe List methods in descriptive statistics.
2. Explain the operations on data.
9. Stimulating Question (s)
1. What is the need of a statistics?
10. Mind Map
11. Student Summary
At the end of this session, the facilitator (Teacher) shall randomly pick-up few students to
summarize the deliverables.
1. Stephen Marsland, "Machine Learning -An Algorithmic Perspective ", CRC Press, 2009.
2. Tom M. Mitchell, "Machine Learning ", Tata McGraw Hill, 1997
NIL
---------------