0% found this document useful (0 votes)
7 views17 pages

ML 4

The document outlines the Cohesive Teaching-Learning Practices (CTLP) for a course on Fundamentals of Machine Learning at GMR Institute of Technology, focusing on data preprocessing techniques. It covers various methods such as data cleaning, integration, transformation, reduction, and discretization, along with specific strategies like Principal Component Analysis (PCA) and numerosity reduction. The intended learning outcomes aim to equip students with an understanding of these preprocessing techniques to enhance data quality for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views17 pages

ML 4

The document outlines the Cohesive Teaching-Learning Practices (CTLP) for a course on Fundamentals of Machine Learning at GMR Institute of Technology, focusing on data preprocessing techniques. It covers various methods such as data cleaning, integration, transformation, reduction, and discretization, along with specific strategies like Principal Component Analysis (PCA) and numerosity reduction. The intended learning outcomes aim to equip students with an understanding of these preprocessing techniques to enhance data quality for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

GMR Institute of Technology GMRIT/ADM/F-44

Rajam, AP REV.: 00
(An Autonomous Institution Affiliated to JNTUGV, AP)

Cohesive Teaching – Learning Practices (CTLP)

Class 4th Sem. – B. Tech Department: CSE-AI&ML


Course Fundamentals of Machine Learning Course Code 21ML405
Prepared by Dr. S. Akila Agnes, Ms Manisha Das
Lecture Topic Preprocessing Pipeline, Forms of Preprocessing, Data Cleaning, Data
Integration, Data Reduction, Data Transformation and Discretization.
Course Outcome (s) CO3 Program Outcome (s) PO1, PO2, PSO1, PSO2
Duration 50 Min Lecture 17-23 Unit – II
Pre-requisite (s) Fundamentals of Python

1. Objective

 Understand the need of preprocessing the raw data.


 Understand the knowledge on different preprocessing techniques.

2. Intended Learning Outcomes (ILOs)

At the end of this session the students will be able to:

1. Understand various methods of data preprocessing techniques.

3. 2D Mapping of ILOs with Knowledge Dimension and Cognitive Learning Levels of RBT

Cognitive Learning Levels


Knowledge
Remember Understand Apply Analyze Evaluate Create
Dimension
Factual  
Conceptual 
Procedural
Meta Cognitive

4. Teaching Methodology
 Power Point Presentation, Chalk Talk, visual presentation

5. Evocation
Fig. 1: Evocation
6. Deliverables

Lecture Notes-17:

Data Preprocessing:

Data preprocessing is an important step in the data science. It refers to the cleaning, transforming,
and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to
improve the quality of the data and to make it more suitable for the specific data mining task.

Some common steps in data preprocessing include:

• Data cleaning: this step involves identifying and removing missing, inconsistent, or
irrelevant data. This can include removing duplicate records, filling in missing values, and
handling outliers.
• Data integration: this step involves combining data from multiple sources, such as
databases, spreadsheets, and text files. The goal of integration is to create a single, consistent
view of the data.
• Data transformation: this step involves converting the data into a format that is more suitable
for the data science tasks. This can include normalizing numerical data, creating dummy
variables, and encoding categorical data.
• Data reduction: this step is used to select a subset of the data that is relevant to the data
science task. This can include feature selection (selecting a subset of the variables) or feature
extraction (extracting new variables from the data).
• Data discretization: this step is used to convert continuous numerical data into categorical
data, which can be used for decision tree and other categorical data mining techniques.

Data Cleaning: Real-world data tends to be incomplete, noisy, and inconsistent.


Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data. It can be done in below 3 ways.
• Handling Missing values
• Noisy Data
• Data Cleaning as Process

Handling Missing values:


• Ignore the tuple(row): Ignore tuples with missing values: This approach is suitable only
when the dataset is quite large and multiple values are missing within a tuple. Is an option
only if the tuples containing missing values are about 2% or less. Works with Missing
Completely At Random (MCAR).
• Fill in the missing value manually: In general, this approach is time consuming and may not
be feasible given a large data set with many missing values.
• Use a global constant to fill in the missing value: Replace all missing attribute values by
the same constant.
• Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the
missing value
• Use the attribute mean or median for all samples belonging to the same class as the given
tuple
• Use the most probable value to fill in the missing value

Noisy Data: “What is noise?” Noise is a random error or variance in a measured variable.
• some basic statistical description techniques (e.g., boxplots and scatter plots), and methods of
data visualization can be used to identify outliers, which may represent noise.
• It can be handled through below techniques:
• Binning Method
• Regression
• Outlier analysis or Clustering

Lecture Notes-18:

Methods to handle Noisy Data:

Binning Method:
• This method works on sorted data in order to smooth it.
• The whole data is divided into segments of equal size and then various methods are performed
to complete the task.
• Each segment is handled separately.
• One can replace all data in a segment by its means or boundary values can be used to complete
the task.
• The below example illustrates some binning techniques. In this example, the data for price are
first sorted and then partitioned into equal-frequency bins of size 3 (i.e., each bin contains
three values).
• Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34.
• Partition into (equal frequency) bins:
• Bin 1: 4, 8, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 28, 34
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example,
the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced
by the value 9.
• Smoothing by bin means:
• Bin 1: 9, 9, 9
• Bin 2: 22, 22, 22
• Bin 3: 29, 29, 29

Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the
bin median.
• In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
• Smoothing by bin boundaries:
• Bin 1: 4, 4, 15
• Bin 2: 21, 21, 24
• Bin 3: 25, 25, 34

Regression: Data smoothing can also be done by regression, a technique that conforms data values
to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so
that one attribute can be used to predict the other. Multiple linear regression is an extension of linear
regression, where more than two attributes are involved, and the data are fit to a multidimensional
surface.

Outlier analysis: Outliers may be detected by clustering, for example, where similar values are
organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be
considered outliers.

Data Reduction:
• Data reduction techniques can be applied to obtain a reduced representation of the data set
that is much smaller in volume, yet closely maintains the integrity of the original data. That
is, mining on the reduced data set should be more efficient yet produce the same (or almost
the same) analytical results.
• Overview of Data Reduction Strategies
• Wavelet Transforms
• Principal Component Analysis
• Attribute Subset Selection
• Regression and Log-Linear Models: Parametric Data Reduction
• Histograms
• Clustering
• Sampling
• Data Cube Aggregation

Overview of Data Reduction Strategies:


• Data reduction strategies include dimensionality reduction, numerosity reduction, and
data compression.
• Dimensionality reduction methods include wavelet transforms and principal components
analysis, which transform or project the original data onto a smaller space.
• Attribute subset selection is a method of dimensionality reduction in which irrelevant,
weakly relevant, or redundant attributes or dimensions are detected and removed.

Lecture Notes-19:

PCA:

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of
large datasets while preserving most of the important variation in the data. PCA involves
transforming a set of correlated variables into a new set of uncorrelated variables called principal
components.

The first principal component accounts for the maximum amount of variation in the data, while each
subsequent component accounts for as much of the remaining variation as possible.

Here's a simple example of PCA:

• Suppose we have a dataset of 2-dimensional points (x, y) with the following values:
• (1, 1), (2, 2), (3, 3), (4, 4), (5, 5)
• We want to reduce the dimensionality of this dataset using PCA.

Step 1: Standardize the data.

• We first standardize the data by subtracting the mean of each variable and dividing it by the
standard deviation. This ensures that the variables are on the same scale.
• (𝜇𝑥, 𝜇𝑦) = (3, 3)
• (𝜎𝑥, 𝜎𝑦) = (1.414, 1.414)
• Standardized dataset:
• (-1.414, -1.414), (0, 0), (1.414, 1.414), (2.828, 2.828), (4.242, 4.242)

Mean and Standard deviation calculation:


• Therefore, the values of the first column of the data array are [1, 2, 3, 4, 5].
• mean = (1 + 2 + 3 + 4 + 5) / 5
• = 15 / 5
• =3
• deviations = [1 - 3, 2 - 3, 3 - 3, 4 - 3, 5 - 3]
• = [-2, -1, 0, 1, 2]
• deviations_squared = [(-2)2, (-1)2, 02, 12, 22]
• = [4, 1, 0, 1, 4]
• variance = (4 + 1 + 0 + 1 + 4) / 5
• = 10 / 5 = 2
• standard_deviation = sqrt(2) = 1.41

Step 2: Compute the covariance matrix.


• Next, we compute the covariance matrix of the standardized dataset.
• Covariance matrix:
• [[1.0, 1.0],
• [1.0, 1.0]]

Step 3: Compute the principal components.

• We can compute the principal components of the covariance matrix using eigen-
decomposition. The eigenvectors of the covariance matrix represent the principal components,
while the corresponding eigenvalues represent the amount of variance explained by each
component.
• The eigenvectors of the covariance matrix are:
• [0.707, -0.707]
• [0.707, 0.707]
• The corresponding eigenvalues are:
• [2.0, 0.0]
• The first principal component accounts for 100% of the variation in the data, while the second
principal component accounts for 0%. This means that we can reduce the dimensionality of
the dataset from 2 to 1 by projecting the data onto the first principal component.

Step 4: Project the data onto the first principal component.


• To project the data onto the first principal component, we take the dot product of each point
in the dataset with the first eigenvector.
• Projected dataset:
• [-2.0, 0.0, 2.0, 4.0, 6.0]
• The projected dataset represents the original dataset in one dimension, while preserving most
of the important variation in the data.
The Python code to perform PCA on the example dataset we used earlier:
import numpy as np
from sklearn.decomposition import PCA
# create dataset
data = np.array([[1, 1], [2, 2], [3, 3], [4, 4], [5, 5]])
# standardize the data
data_std = (data - np.mean(data, axis=0)) / np.std(data, axis=0)
# perform PCA
pca = PCA(n_components=1)
principal_components = pca.fit_transform(data_std)
# print explained variance ratio and principal components
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Principal components:\n", principal_components)

• The PCA class from sklearn.decomposition is used to perform PCA on the dataset. We first
standardize the dataset using the mean and standard deviation of each feature, and then apply
PCA with n_components=1 to obtain a one-dimensional representation of the data.
• The explained_variance_ratio_ attribute of the PCA object returns the proportion of the total
variance in the data that is explained by each principal component. In this case, there is only
one principal component and it explains 100% of the variance in the data.
• The fit_transform() method of the PCA object is used to compute the principal components
of the dataset and return the projected dataset. The resulting principal_components variable
contains the projected dataset in one dimension.

Lecture Notes-20:

Numerosity reduction
• Numerosity reduction is the process of simplifying or reducing a large set of data or
information to a more manageable and understandable form, while still retaining the important
features and characteristics of the original data.
• One example of numerosity reduction is summarizing a long text into a brief paragraph or
bullet points, which capture the main points of the text.
• This techniques replace the original data volume by alternative, smaller forms of data
representation. These techniques may be parametric or nonparametric. For parametric
methods, a model is used to estimate the data, so that typically only the data parameters need
to be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-
linear models are examples.
• Nonparametric methods for storing reduced representations of the data include histograms,
clustering, sampling, and data cube aggregation.

In data compression, transformations are applied to obtain a reduced or “compressed”


representation of the original data.
• If the original data can be reconstructed from the compressed data without any information
loss, the data reduction is called lossless.
• If, instead, we can reconstruct only an approximation of the original data, then the data
reduction is called lossy. There are several lossless algorithms for string compression;
however, they typically allow only limited data manipulation.
• Dimensionality reduction and numerosity reduction techniques can also be considered forms
of data compression.

• Here's an example of numerosity reduction:


• Original text:
• A recent study found that people who consume more than 3 cups of coffee per day are at a
higher risk of developing heart disease. The study involved over 10,000 participants and
followed their coffee consumption habits and health outcomes for 10 years. The researchers
found that people who drank 3 or more cups of coffee per day had a 21% higher risk of
developing heart disease than those who drank less than 1 cup per day.
• Reduced version:
• A study with over 10,000 participants found that consuming more than 3 cups of coffee per
day increases the risk of heart disease by 21%.
• In this example, the original text was reduced to a single sentence that captures the main
findings of the study, while still conveying the essential information.

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes
(or dimensions). The goal of attribute subset selection is to find a minimum set of attributes such that
the resulting probability distribution of the data classes is as close as possible to the original
distribution obtained using all attributes.

Basic heuristic methods of attribute subset selection include the techniques that follow, some of
which are illustrated in below Figure.
The attribute subset selection follows below approaches:
• Stepwise forward selection:
• The procedure starts with an empty set of attributes as the reduced set.
• The best of the original attributes is determined and added to the reduced set.
• At each subsequent iteration or step, the best of the remaining original attributes is added to
the set.
• Stepwise backward elimination:
• The procedure starts with the full set of attributes.
• At each step, it removes the worst attribute remaining in the set.
• Combination of forward selection and backward elimination:
• The stepwise forward selection and backward elimination methods can be combined so that,
at each step, the procedure selects the best attribute and removes the worst from among the
remaining attributes.
• Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchart like
structure where each internal (non-leaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class
prediction.
• At each node, the algorithm chooses the “best” attribute to partition the data into individual
classes.
• When decision tree induction is used for attribute subset selection, a tree is constructed from
the given data.
• All attributes that do not appear in the tree are assumed to be irrelevant.
• The set of attributes appearing in the tree form the reduced subset of attributes.
• The stopping criteria for the methods may vary. The procedure may employ a threshold on
the measure used to determine when to stop the attribute selection process.
Forward Selection:
• Let's say we have a dataset with 10 attributes (A1, A2, ..., A10) and we want to select a subset
of attributes that will yield the best performance on a classification task. We will use forward
selection to iteratively add attributes to the subset:
• Start with an empty set of attributes.
• Train a model using each attribute individually and select the attribute that yields the best
performance (e.g., highest accuracy or lowest error rate).
• Add the selected attribute to the attribute subset.
• Train a model using the attribute subset and each remaining attribute individually. Select the
attribute that, when combined with the attribute subset, yields the best performance.
• Add the selected attribute to the attribute subset.
• Repeat steps 4 and 5 until a stopping criterion is met (e.g., a maximum number of attributes
in the subset is reached or the performance no longer improves).
• For example, let's say we have the following performance metrics for each attribute:
• Using forward selection, we would start by selecting A5 as the first attribute in the subset,
since it has the highest accuracy.
• Then we would train models using the attribute subset {A5} and each of the remaining
attributes and select the attribute that yields the best performance.
• In this case, A3 would be selected since it yields the highest accuracy when combined with
A5. We would then add A3 to the subset and repeat the process. The next best attribute to add
to the subset would be A2, and so on.
• The final attribute subset selected by forward selection would depend on the stopping criterion
chosen. For example, if we set a maximum subset size of three, the final subset selected by
forward selection would be {A5, A3, A2}.
• Attribute Accuracy Attribute Accuracy Attribute Accuracy
• A1 0.70 A5 0.80 A9 0.70
• A2 0.75 A6 0.75 A10 0.60
• A3 0.80 A7 0.65
• A4 0.70 A8 0.75
Lecture Notes-21:

Regression and Log-Linear Models: Parametric Data Reduction


• Regression and log-linear models are statistical methods used in parametric data reduction.
• These methods aim to identify the underlying relationships between variables and simplify
the complexity of the data by fitting a mathematical model to the data.
• Regression models are used to describe the relationship between a dependent variable and
one or more independent variables.
• For example, a simple linear regression model could be used to predict the weight of a
person based on their height. In this case, height would be the independent variable and
weight would be the dependent variable.
• Regression and log-linear models can be used to approximate the given data.
• In (simple) linear regression, the data are modeled to fit a straight line.
• For example, a random variable, y (called a response/dependent variable), can be modeled
as a linear function of another random variable, x (called a predictor/independent variable),
with the equation
• 𝑦 = 𝑚𝑥 + 𝑐 or 𝑦 = 𝑤𝑥 + 𝑏
• where the variance of y is assumed to be constant. In the context of data mining, x and y are
numeric database attributes.
• The coefficients, w and b (called regression coefficients) specify the slope of the line and the
y-intercept, respectively.
• Multiple linear regression is an extension of (simple) linear regression, which allows a
response variable, y, to be modeled as a linear function of two or more predictor variables
• Both regression and log-linear models are parametric, meaning that they make assumptions
about the distribution of the data and the functional form of the relationship between
variables. These assumptions allow the models to estimate parameters that can be used to
predict outcomes or describe the relationship between variables.
• Parametric data reduction using regression and log-linear models can be useful in many
applications, such as data mining, marketing research, and predictive analytics. These
models can help identify key factors that influence a particular outcome and can be used to
develop models that can be used to make predictions or inform decision making.
Clustering
• Clustering techniques consider data tuples as objects.
• They partition the objects into groups, or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters.
• Similarity is commonly defined in terms of how “close” the objects are in space, based on a
distance function. The “quality” of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster.
Data Cube Aggregation
• Imagine that you have collected the data for your analysis. These data consist of the All
Electronics sales per quarter, for the years 2008 to 2010.
• You are, however, interested in the annual sales (total per year), rather than the total per
quarter. Thus, the data can be aggregated so that the resulting data summarizes the total
sales per year instead of per quarter.

Data Transformation and Data Discretization:


• In data transformation, the data are transformed or consolidated into forms appropriate for
mining. Strategies for data transformation include the following:
• Smoothing, which works to remove noise from the data. Techniques include binning,
regression, and clustering.
• Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
• Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for data analysis at multiple
abstraction levels.
• Normalization, where the attribute data are scaled so as to fall within a smaller range, such
as −1.0 to 1.0, or 0.0 to 1.0.
• Discretization, where the raw values of a numeric attribute (e.g., age) are replaced by
interval labels (e.g., 0–10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
• Below Figure shows a concept hierarchy for the attribute price. More than one concept
hierarchy can be defined for the same attribute to accommodate the needs of various users.

Lecture Notes-22:

• Data Transformation by Normalization:


• This involves transforming the data to fall within a smaller or common range such as [−1,1]
or [0.0, 1.0].
• Normalizing the data attempts to give all attributes an equal weight.
• Normalization is particularly useful for classification and clustering.
• There are many methods for data normalization. We study min-max normalization, z-score
normalization, and normalization by decimal scaling. For our discussion, let A be a numeric
attribute with n observed values, 𝑣1, 𝑣2, . . . , 𝑣𝑛.
• Min-max normalization performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute, A.
• Min-max normalization maps a value, 𝑣𝑖, of A to 𝑣′ in the range [𝑛𝑒𝑤_ min, 𝑛𝑒𝑤_ max] by
i
𝐴 𝐴
computing.
𝑣 −𝑚𝑖𝑛𝐴
• 𝑣′ = 𝑖 (𝑛𝑒𝑤_𝑚𝑎𝑥 − 𝑛𝑒𝑤_𝑚𝑖𝑛 ) + 𝑛𝑒𝑤_𝑚𝑖𝑛
i 𝑚𝑎𝑥𝐴 −𝑚𝑖𝑛𝐴 𝐴 𝐴 𝐴

• Ex: Min-max normalization. Suppose that the minimum and maximum values for the
attribute income are $12,000 and $98,000, respectively.
• We would like to map income to the range [0.0,1.0]. By min-max normalization, a value of
$73,600 for income is transformed to
• 73,600−12,000 (1.0 − 0) + 0 = 0.716
𝑣i′ =
98,000 −12,000
• In z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean (i.e., average) and standard deviation of A. A value, 𝑣𝑖, of A
is normalized to 𝑣i′ by computing.
• 𝑣𝑖−𝐴̅
𝑣i′ =
𝜎𝐴
• where 𝐴̅ and 𝜎𝐴 are the mean and standard deviation, respectively, of attribute A.
• Ex: z-score normalization. Suppose that the mean and standard deviation of the values for
the attribute income are $54,000 and $16,000, respectively. With z-score normalization, a
value of $73,600 for income is transformed to
• 73,600−54,000
𝑣i′ = =1.225
16000
Decimal scaling. Suppose that the recorded values of A range from −986 to 917.
• The maximum absolute value of A is 986. To normalize by decimal scaling, we therefore
divide each value by 1000 (i.e., j =3) so that −986 normalizes to −0.986 and 917 normalizes
to 0.917.
• Note that normalization can change the original data quite a bit, especially when using z-
score normalization or decimal scaling.
• It is also necessary to save the normalization parameters (e.g., the mean and standard
deviation if using z-score normalization) so that future data can be normalized in a uniform
manner.
Data Integration: The merging of data from multiple data stores is known as data integration.
Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data
set. This can help improve the accuracy and speed of the subsequent data science tasks.
• Below are the challenges to integrate data from different sources.
• Entity identification problem
• Redundancy and Correlation AnalysisTuple duplication
• Tuple Duplication.
• Data Value Conflict Detection and Resolution

Lecture Notes-23:

Entity identification problem:


• It is likely that your data analysis task will involve data integration, which combines data
from multiple sources into a coherent data store. These sources may include multiple
databases, data cubes, or flat files.
• There are several issues to consider during data integration. Schema integration and object
matching can be tricky.
• For example, how can the data analyst or the computer be sure that customer id in one
database and cust_number in another refer to the same attribute?
• Examples of metadata for each attribute include the name, meaning, data type, and range of
values permitted for the attribute, and null rules for handling blank, zero, or null values.
• Such metadata can be used to help avoid errors in schema integration.
• When matching attributes from one database to another during integration, special attention
must be paid to the structure of the data. This is to ensure that any attribute functional
dependencies and referential constraints in the source system match those in the target system.
• For example, in one system, a discount may be applied to the order, whereas in another
system it is applied to each individual item within the order. If this is not caught before
integration, items in the target system may be improperly discounted.

Redundancy and Correlation Analysis:


• Redundancy is another important issue in data integration. An attribute (such as annual
revenue, for instance) may be redundant if it can be “derived” from another attribute or set of
attributes.
• Some redundancies can be detected by correlation analysis. Given two attributes, such
analysis can measure how strongly one attribute implies the other, based on the available data.
• For nominal data, we use the χ2 (chi-square) test. For numeric attributes, we can use the
correlation coefficient and covariance, both of which assess how one attribute’s values vary
from those of another.

Tuple Duplication:
• In addition to detecting redundancies between attributes, duplication should also be detected
at the tuple level (e.g., where there are two or more identical tuples for a given unique data
entry case). The use of denormalized tables (often done to improve performance by avoiding
joins) is another source of data redundancy.
• Inconsistencies often arise between various duplicates, due to inaccurate data entry or
updating some but not all data occurrences.
• For example, if a purchase order database contains attributes for the purchaser’s name and
address instead of a key to this information in a purchaser database, discrepancies can occur,
such as the same purchaser’s name appearing with different addresses within the purchase
order database.

Data Value Conflict Detection and Resolution:


• Data integration also involves the detection and resolution of data value conflicts.
• For example, for the same real-world entity, attribute values from different sources may differ.
This may be due to differences in representation, scaling, or encoding. For instance, a weight
attribute may be stored in metric units in one system and British imperial units in another.
• For a hotel chain, the price of rooms in different cities may involve not only different
currencies but also different services (e.g., free breakfast) and taxes.
• When exchanging information between schools, for example, each school may have its own
curriculum and grading scheme.
• One university may adopt a quarter system, offer three courses on database systems, and
assign grades from AC to F, whereas another may adopt a semester system, offer two courses
on databases, and assign grades from 1 to 10.

7. Keywords
 Mean
 Median
 Mode
 Point and Interval Estimate

8. Sample Questions

Remember:
1. List different statistics.
2. Define confidence Interval.
3. Define measure of IQR
Understand:
1. Describe List methods in descriptive statistics.
2. Explain the operations on data.
9. Stimulating Question (s)
1. What is the need of a statistics?
10. Mind Map
11. Student Summary

At the end of this session, the facilitator (Teacher) shall randomly pick-up few students to
summarize the deliverables.

11. Reading Materials

1. Stephen Marsland, "Machine Learning -An Algorithmic Perspective ", CRC Press, 2009.
2. Tom M. Mitchell, "Machine Learning ", Tata McGraw Hill, 1997

13. Scope for Mini Project

NIL

---------------

You might also like