0% found this document useful (0 votes)
4 views4 pages

Data Mining Unit-II

The document discusses data preprocessing as a crucial step in data mining, focusing on cleaning, transforming, and integrating data for analysis. It outlines the benefits of data mining, common preprocessing steps, and various techniques such as feature extraction, selection, and transformation. Additionally, it covers dimensionality reduction, its applications, and CUR decomposition as a method for identifying significant attributes in data matrices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

Data Mining Unit-II

The document discusses data preprocessing as a crucial step in data mining, focusing on cleaning, transforming, and integrating data for analysis. It outlines the benefits of data mining, common preprocessing steps, and various techniques such as feature extraction, selection, and transformation. Additionally, it covers dimensionality reduction, its applications, and CUR decomposition as a method for identifying significant attributes in data matrices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

DATA MINING (UNIT-II)

1.What is data preprocessing?


Ans. Data pre-processing is an important step in the data mining process. It
refers to the cleaning, transforming, and integrating of data in order to make
it ready for analysis. The goal of data pre-processing is to improve the quality
of the data and to make it more suitable for the specific data mining task.

2. What All are the Benefits of Data Mining?

Ans.The benefits of Data Mining are as follows:

 Data mining helps businesses acquire knowledge-based information.


 It is applicable to both new and existing systems.
 Businesses can use data mining to make profitable changes to their
operations and production.
 It aids in the prediction of trends and behaviors, as well as the
automated discovery of hidden patterns.

3.Explain some common steps in data pre-processing.


Ans. Data preprocessing is an important step in the data mining process that
involves cleaning and transforming raw data to make it suitable for analysis.
Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or
inconsistencies in the data, such as missing values, outliers, and duplicates.
Various techniques can be used for data cleaning, such as imputation,
removal, and transformation.
Data Integration: This involves combining data from multiple sources to
create a unified dataset. Data integration can be challenging as it requires
handling data with different formats, structures, and semantics. Techniques
such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable
format for analysis. Common techniques used in data transformation include
normalization, standardization, and discretization. Normalization is used to
scale the data to a common range, while standardization is used to transform
the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while
preserving the important information. Data reduction can be achieved
through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset,
while feature extraction involves transforming the data into a lower-
dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete
categories or intervals. Discretization is often used in data mining and
machine learning algorithms that require categorical data. Discretization can
be achieved through techniques such as equal width binning, equal
frequency binning, and clustering.

4. What is Data Summarization? State its types.


Ans. The term Data Summarization can be defined as the presentation of
a summary/report of generated data in a comprehensible and informative
manner. To relay information about the dataset, summarization is obtained
from the entire dataset. It is a carefully performed summary that will convey
trends and patterns from the dataset in a simplified manner.

The different types of Data Summarization in Data Mining are:

 Tabular Summarization: This method instantly conveys patterns


such as frequency distribution, cumulative frequency, etc, and
 Data Visualization: Visualisations from a chosen graph style such as
histogram, time-series line graph, column/bar graphs, etc. can help to
spot trends immediately in a visually appealing way.

5. What is a Concept Hierarchies?


Ans. A concept hierarchy represents a series of mappings from a set of low-level
concepts to larger-level, more general concepts. Concept hierarchy organizes
information or concepts in a hierarchical structure or a specific partial order, which
are used for defining knowledge in brief, high-level methods, and creating possible
mining knowledge at several levels of abstraction.
There are several types of concept hierarchies which are as follows −
Schema Hierarchy − Schema hierarchy represents the total or partial order
between attributes in the database. It can define existing semantic relationships
between attributes.
Set-Grouping Hierarchy − A set-grouping hierarchy constructs values for a given
attribute or dimension into groups or constant range values. It is also known as
instance hierarchy because the partial series of the hierarchy is represented on the
set of instances or values of an attribute.
Operation-Derived Hierarchy − Operation-derived hierarchy is represented by a
set of operations on the data. These operations are defined by users, professionals,
or the data mining system. These hierarchies are usually represented for
mathematical attributes.
Rule-based Hierarchy − In a rule-based hierarchy either a whole concept hierarchy
or an allocation of it is represented by a set of rules and is computed dynamically
based on the current information and rule definition.

6.Define: i) Feature Extraction ii) Feature selection iii) Feature


Transformation
Ans. i)Feature Extraction: Feature Extraction is basically a process of
dimensionality reduction where the raw data obtained is separated into
related manageable groups. A distinctive feature of these large datasets is
that they contain a large number of variables and additionally these
variables require a lot of computing resources in order to process them.
Hence Feature Extraction can be useful in this case in selecting particular
variables and also combining some of the related variables which in a way
would reduce the amount of data.
ii) Feature selection: Feature selection refers to the process of
reducing the inputs for processing and analysis, or of finding the
most meaningful inputs. A related term, feature engineering (or
feature extraction), refers to the process of extracting useful
information or features from existing data.
iii) Feature Transformation: Feature transformation is a mathematical
transformation in which we apply a mathematical formula to a particular
column (feature) and transform the values, which are useful for our
further analysis. It is a technique by which we can boost our model
performance. It is also known as Feature Engineering, which creates
new features from existing features that may help improve the model
performance.

7. What is Dimensionality reduction?


Ans. Dimensionality reduction is the process in which we reduced the
number of unwanted variables, attributes, and. Dimensionality reduction
is a very important stage of data pre-processing.
Dimensionality reduction is considered a significant task in data mining
applications.

8.What are the applications of dimensionality reduction?


Ans. The applications of dimensionality reduction
 Text mining
 Image retrieval
 Microarray data analysis
 Protein classification
 Face and image recognition
 Intrusion detection
 Customer relationship management
 Handwritten digit recognition

The different methods used for dimensionality reduction are mentioned


below;
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)

9.What is CUR decomposition?


Ans. CUR matrix decomposition is a low-rank matrix
decomposition algorithm that is explicitly expressed in a small
number of actual columns and/or actual rows of data matrix.
CUR matrix decomposition was developed as an alternative
to Singular Value Decomposition (SVD) and Principal
Component Analysis (PCA). CUR matrix decomposition selects
columns and rows that exhibit high statistical leverage or
large influence from the data matrix. By implementing
the CUR matrix decomposition algorithm, a small number of
most important attributes and/or rows can be identified from
the original data matrix. Therefore, CUR matrix
decomposition is an important tool for exploratory data
analysis. CUR matrix decomposition can be applied to a variety
of areas and facilitates Regression, Classification,
and Clustering.

You might also like