The document discusses data preprocessing as a crucial step in data mining, focusing on cleaning, transforming, and integrating data for analysis. It outlines the benefits of data mining, common preprocessing steps, and various techniques such as feature extraction, selection, and transformation. Additionally, it covers dimensionality reduction, its applications, and CUR decomposition as a method for identifying significant attributes in data matrices.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
4 views4 pages
Data Mining Unit-II
The document discusses data preprocessing as a crucial step in data mining, focusing on cleaning, transforming, and integrating data for analysis. It outlines the benefits of data mining, common preprocessing steps, and various techniques such as feature extraction, selection, and transformation. Additionally, it covers dimensionality reduction, its applications, and CUR decomposition as a method for identifying significant attributes in data matrices.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4
DATA MINING (UNIT-II)
1.What is data preprocessing?
Ans. Data pre-processing is an important step in the data mining process. It refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis. The goal of data pre-processing is to improve the quality of the data and to make it more suitable for the specific data mining task.
2. What All are the Benefits of Data Mining?
Ans.The benefits of Data Mining are as follows:
Data mining helps businesses acquire knowledge-based information.
It is applicable to both new and existing systems. Businesses can use data mining to make profitable changes to their operations and production. It aids in the prediction of trends and behaviors, as well as the automated discovery of hidden patterns.
3.Explain some common steps in data pre-processing.
Ans. Data preprocessing is an important step in the data mining process that involves cleaning and transforming raw data to make it suitable for analysis. Some common steps in data preprocessing include: Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation, removal, and transformation. Data Integration: This involves combining data from multiple sources to create a unified dataset. Data integration can be challenging as it requires handling data with different formats, structures, and semantics. Techniques such as record linkage and data fusion can be used for data integration. Data Transformation: This involves converting the data into a suitable format for analysis. Common techniques used in data transformation include normalization, standardization, and discretization. Normalization is used to scale the data to a common range, while standardization is used to transform the data to have zero mean and unit variance. Discretization is used to convert continuous data into discrete categories. Data Reduction: This involves reducing the size of the dataset while preserving the important information. Data reduction can be achieved through techniques such as feature selection and feature extraction. Feature selection involves selecting a subset of relevant features from the dataset, while feature extraction involves transforming the data into a lower- dimensional space while preserving the important information. Data Discretization: This involves dividing continuous data into discrete categories or intervals. Discretization is often used in data mining and machine learning algorithms that require categorical data. Discretization can be achieved through techniques such as equal width binning, equal frequency binning, and clustering.
4. What is Data Summarization? State its types.
Ans. The term Data Summarization can be defined as the presentation of a summary/report of generated data in a comprehensible and informative manner. To relay information about the dataset, summarization is obtained from the entire dataset. It is a carefully performed summary that will convey trends and patterns from the dataset in a simplified manner.
The different types of Data Summarization in Data Mining are:
Tabular Summarization: This method instantly conveys patterns
such as frequency distribution, cumulative frequency, etc, and Data Visualization: Visualisations from a chosen graph style such as histogram, time-series line graph, column/bar graphs, etc. can help to spot trends immediately in a visually appealing way.
5. What is a Concept Hierarchies?
Ans. A concept hierarchy represents a series of mappings from a set of low-level concepts to larger-level, more general concepts. Concept hierarchy organizes information or concepts in a hierarchical structure or a specific partial order, which are used for defining knowledge in brief, high-level methods, and creating possible mining knowledge at several levels of abstraction. There are several types of concept hierarchies which are as follows − Schema Hierarchy − Schema hierarchy represents the total or partial order between attributes in the database. It can define existing semantic relationships between attributes. Set-Grouping Hierarchy − A set-grouping hierarchy constructs values for a given attribute or dimension into groups or constant range values. It is also known as instance hierarchy because the partial series of the hierarchy is represented on the set of instances or values of an attribute. Operation-Derived Hierarchy − Operation-derived hierarchy is represented by a set of operations on the data. These operations are defined by users, professionals, or the data mining system. These hierarchies are usually represented for mathematical attributes. Rule-based Hierarchy − In a rule-based hierarchy either a whole concept hierarchy or an allocation of it is represented by a set of rules and is computed dynamically based on the current information and rule definition.
Transformation Ans. i)Feature Extraction: Feature Extraction is basically a process of dimensionality reduction where the raw data obtained is separated into related manageable groups. A distinctive feature of these large datasets is that they contain a large number of variables and additionally these variables require a lot of computing resources in order to process them. Hence Feature Extraction can be useful in this case in selecting particular variables and also combining some of the related variables which in a way would reduce the amount of data. ii) Feature selection: Feature selection refers to the process of reducing the inputs for processing and analysis, or of finding the most meaningful inputs. A related term, feature engineering (or feature extraction), refers to the process of extracting useful information or features from existing data. iii) Feature Transformation: Feature transformation is a mathematical transformation in which we apply a mathematical formula to a particular column (feature) and transform the values, which are useful for our further analysis. It is a technique by which we can boost our model performance. It is also known as Feature Engineering, which creates new features from existing features that may help improve the model performance.
7. What is Dimensionality reduction?
Ans. Dimensionality reduction is the process in which we reduced the number of unwanted variables, attributes, and. Dimensionality reduction is a very important stage of data pre-processing. Dimensionality reduction is considered a significant task in data mining applications.
8.What are the applications of dimensionality reduction?
Ans. The applications of dimensionality reduction Text mining Image retrieval Microarray data analysis Protein classification Face and image recognition Intrusion detection Customer relationship management Handwritten digit recognition
The different methods used for dimensionality reduction are mentioned
below; Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA) Generalized Discriminant Analysis (GDA)
9.What is CUR decomposition?
Ans. CUR matrix decomposition is a low-rank matrix decomposition algorithm that is explicitly expressed in a small number of actual columns and/or actual rows of data matrix. CUR matrix decomposition was developed as an alternative to Singular Value Decomposition (SVD) and Principal Component Analysis (PCA). CUR matrix decomposition selects columns and rows that exhibit high statistical leverage or large influence from the data matrix. By implementing the CUR matrix decomposition algorithm, a small number of most important attributes and/or rows can be identified from the original data matrix. Therefore, CUR matrix decomposition is an important tool for exploratory data analysis. CUR matrix decomposition can be applied to a variety of areas and facilitates Regression, Classification, and Clustering.
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB