0% found this document useful (0 votes)
41 views14 pages

DM - MOD - 1 Part II

The document provides an overview of data mining and knowledge discovery. It discusses desirable properties of discovered knowledge, data mining functionalities, and the importance of data mining. It also covers major issues in data mining including mining different types of knowledge, incorporating background knowledge, and handling noisy data. Classification, clustering, and regression techniques are also summarized. Key steps in building a classification model are outlined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views14 pages

DM - MOD - 1 Part II

The document provides an overview of data mining and knowledge discovery. It discusses desirable properties of discovered knowledge, data mining functionalities, and the importance of data mining. It also covers major issues in data mining including mining different types of knowledge, incorporating background knowledge, and handling noisy data. Classification, clustering, and regression techniques are also summarized. Key steps in building a classification model are outlined.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Advanced Data Mining 221ECS001 Module -1 Part -2

Module - 1

Data Mining and Knowledge Discovery

Desirable properties of Discovered Knowledge - Knowledge representation, Data Mining


Functionalities, Motivation and Importance of Data Mining, Classification of Data Mining
Systems, Integration of a data mining system with a Database or Data Warehouse System,
Classification, Clustering, Regression, Data Pre-Processing: Data Cleaning, Data Integration and
Transformation, normalization, standardization, Data Reduction, Feature vector representation,
importance of feature engineering in machine learning; forward selection and backward selection
for feature selection; curse of dimensionality; data imputation techniques; No Free Lunch
theorem in the context of machine learning, Data Discretization and Concept Hierarchy
Generation.

Major Issues In Data Mining

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Mining Methodology and User Interaction Issues

● Mining different kinds of knowledge in databases


Different users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge discovery tasks.

● Interactive mining of knowledge at multiple levels of abstraction − The data mining


process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
● Incorporation of background knowledge − To guide the discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms but
at multiple levels of abstraction.
● Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
● Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
● Handling noisy or incomplete data − The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
● Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues

● Efficiency and scalability of data mining algorithms − In order to effectively extract


the information from huge amounts of data in databases, data mining algorithms must be
efficient and scalable.
● Parallel, distributed, and incremental mining algorithms − The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which are further processed in a parallel fashion.
Then the results from the partitions are merged. The incremental algorithms update
databases without mining the data again from scratch.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Diverse Data Types Issues

● Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for
one system to mine all this kind of data.
● Mining information from heterogeneous databases and global information systems −
The data is available at different data sources on LAN or WAN. These data sources may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.

Classification, Clustering and Regression

Classification
Classification in data mining is a technique used to assign labels or classify each instance,
record, or data object in a dataset based on their features or attributes.
It is a supervised learning technique that uses labeled data to build a model that can predict the
class of new, unseen data. It is an important task in data mining because it enables organizations
to make informed decisions based on their data.

Classification techniques can be divided into categories - binary classification and multi-class
classification. Binary classification assigns labels to instances into two classes, such as
fraudulent or non-fraudulent. Multi-class classification assigns labels into more than two classes,
such as happy, neutral, or sad.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Another main types of Classification Categories available are:

● Generative Classification: This Data Mining Classification Algorithm models the


distribution of Individual Classes and learns from the model that generates data through
estimations and assumptions. The Generative Classification algorithm is used to predict
the data that is unseen. An example of a Generative Data Mining Classification
Algorithm is the Naive Bayes Classifier. Example: Naive Bayes Classifier – Detecting
Spam emails by looking at the previous data.
● Discriminative Classification: The Discriminative Data Mining Classification algorithm
is a basic Classifier that determines classes for the entire rows of the data. The classes are
decided based on the data quality. An example of a Discriminative Classifier is Logistic
Regression. Example: Logistic Regression – Acceptance into university based on student
grades and test results.

Steps Involved in Data Mining Classification

Step 1: Learning Phase

This phase of Data Mining Classification mainly deals with the construction of the Classification
model based on different algorithms available. This step requires a training set for the model to
learn. The trained model gives accurate results based on the target dataset. When the test data is
added to the model it provides accuracy to the Classification Model created.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Step 2: Classification Phase

This phase of Data Mining Classification deals with testing the model that was created by
predicting the class labels. This also helps in determining the accuracy of the model in real test
cases.

Steps to Build a Classification Model


There are several steps involved in building a classification model, as shown below -

● Data preparation - The first step in building a classification model is to prepare the data.
This involves collecting, cleaning, and transforming the data into a suitable format for
further analysis.
● Feature selection - The next step is to select the most important and relevant features that
will be used to build the classification model. This can be done using various techniques,
such as correlation, feature importance analysis, or domain knowledge.
● Prepare train and test data - Once the data is prepared and relevant features are selected,
the dataset is divided into two parts - training and test datasets. The training set is used to
build the model, while the testing set is used to evaluate the model's performance.
● Model selection - Many algorithms can be used to build a classification model, such as
decision trees, logistic regression, k-nearest neighbors, and neural networks. The choice
of algorithm depends on the type of data, the number of features, and the desired
accuracy.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

● Model training - Once the algorithm is selected, the model is trained on the training
dataset. This involves adjusting the model parameters to minimize the error between the
predicted and actual class labels.
● Model evaluation - The model's performance is evaluated using the test dataset. The
accuracy, precision, recall, and F1 score are commonly used metrics to evaluate the
model performance.
● Model tuning - If the model's performance is not satisfactory, the model can be tuned by
adjusting the parameters or selecting a different algorithm. This process is repeated until
the desired performance is achieved.
● Model deployment - Once the model is built and evaluated, it can be deployed in
production to classify new data. The model should be monitored regularly to ensure its
accuracy and effectiveness over time.

Attributes – Represents different features of an object. Different types of attributes are:

Binary: Possesses only two values i.e. True or False. Example: Suppose there is a survey
evaluating some products. We need to check whether it’s useful or not. So, the Customer has to
answer it in Yes or No. Product usefulness: Yes / No

● Symmetric: Both values are equally important in all aspects


● Asymmetric: When both the values may not be important.

Nominal: When more than two outcomes are possible. It is in Alphabet form rather than being in
Integer form. Example: One needs to choose some material but of different colors. So, the color
might be Yellow, Green, Black, Red. Different Colors: Red, Green, Black, Yellow

● Ordinal: Values that must have some meaningful order. Example: Suppose there are
grade sheets of few students which might contain different grades as per their
performance such as A, B, C, D Grades: A, B, C, D
● Continuous: May have an infinite number of values, it is in float type. Example:
Measuring the weight of few Students in a sequence or orderly manner i.e. 50, 51, 52, 53.
Weight: 50, 51, 52, 53
● Discrete: Finite number of values. Example: Marks of a Student in a few subjects: 65,
70, 75, 80, 90. Marks: 65, 70, 75, 80, 90

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Classifiers for Mining Data/Data Mining

1. Logistic Regression

Logistic Regression is a statistical method that creates a Binomial Classification for a particular
event or class. This model gives the probability of every trial and decides which side of the
Binary Classification will move. Logistic Regression also helps in determining multiple
independent parameters impacting a single outcome.Logistic Regression is only viable when the
predicted variable is binary and there are no missing values in the target dataset. It also requires
all the predictors to be independent of each other.

2. Linear Regression

Linear Regression is a Supervised Learning algorithm that performs simple Regression to predict
the values based on the independent variables. To find the value of the dependent variable
relation between independent variables. The main issue with the model is it is highly prone to
overfitting, and it is not always feasible to separate data in a linear manner.

3. Decision Trees

This is the most robust Classification Technique for Data Mining. It follows a flowchart
similar to the structure of a tree. The leaf nodes hold the classes and their labels. The internodes
have a Decision algorithm that routes it to the nearest leaf node. There can be multiple internal
nodes to do this. The horizontal and vertical phases can be prediction boundaries. The only
challenge is that it is complex, and requires expertise to create and ingest data into it.

4. Random Forest

As the name suggests this model employs multiple Decision Trees and applies sub-sets to these
models. Then an average is taken for all the trees to predict the class accuracy. The subsets
created are of the same size as that of the true dataset but the samples are replaced for every
subgroup. It is efficient in reducing overfitting and increasing accuracy. The drawback is, that
it is very slow for real-time applications and is highly complex to implement.

5. Naive Bayes

The Naive Bayes Algorithm makes the assumption that every independent parameter will equally
affect the outcome and has almost equal importance. It calculates the probability of the event
occurring, given that an event has already occurred. Naive Bayes requires smaller training sets to
learn. It is faster in predicting when compared to other models. It is plagued with the poor
estimation issue where all the parameters have equal importance. It doesn’t provide results that
are true in the real world.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

6. Support Vector Machine

The Support vector machine algorithm, also known as SVM, represents the training data in space
differentiated into categories by large gaps. New data points are then mapped into the same
space, and their categories are predicted according to the side of the gap they fall into. This
algorithm is especially useful in high dimensional spaces and is quite memory efficient because
it only employs a subset of training points in its decision function.

7. K-Nearest Neighbors

The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised
learning classifier, which uses proximity to make classifications or predictions about the
grouping of an individual data point.

Real-Life Examples
● There are many real-life examples and applications of classification in data mining. Some
of the most common examples of applications include -
● Email spam classification - This involves classifying emails as spam or non-spam based
on their content and metadata.
● Image classification - This involves classifying images into different categories, such as
animals, plants, buildings, and people.
● Medical diagnosis - This involves classifying patients into different categories based on
their symptoms, medical history, and test results.
● Credit risk analysis - This involves classifying loan applications into different categories,
such as low-risk, medium-risk, and high-risk, based on the applicant's credit score,
income, and other factors.
● Sentiment analysis - This involves classifying text data, such as reviews or social media
posts, into positive, negative, or neutral categories based on the language used.
● Customer segmentation - This involves classifying customers into different segments
based on their demographic information, purchasing behavior, and other factors.
● Fraud detection - This involves classifying transactions as fraudulent or non-fraudulent
based on various features such as transaction amount, location, and frequency.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Clustering
Clustering in data mining is a technique that groups similar data points together based on their
features and characteristics. It can also be referred to as a process of grouping a set of objects so
that objects in the same group (called a cluster) are more similar to each other than those in other
groups (clusters). It is an unsupervised learning technique that aims to identify similarities and
patterns in a dataset. Clustering algorithms typically require defining the number of clusters,
similarity measures, and clustering methods.Clustering techniques in data mining can be used in
various applications, such as image segmentation, document clustering, and customer
segmentation. The goal is to obtain meaningful insights from the data and improve
decision-making processes.
Clustering Methods in Data Mining
There are several clustering techniques in data mining, each with its own strengths and
weaknesses. Some of the most commonly used clustering techniques in data mining include -

● K-means Clustering
K-means clustering is a partitioning method that divides the data points into k clusters,
where k is a pre-defined number. It works by iteratively moving the centroid of each
cluster to the mean of the data points assigned to it until convergence. K-means aims to
minimize the sum of squared distances between each data point and its assigned cluster
centroid.
● Hierarchical Clustering
Hierarchical clustering in data mining is a method that builds a tree-like hierarchy of
clusters, either by merging smaller clusters into larger ones (agglomerative or bottom-up)
or by splitting larger clusters into smaller ones (divisive or top-down). It does not require
a pre-defined number of clusters.
● Density-Based Clustering
Density-based clustering is a method that identifies clusters based on regions of high
density in the data space. Points that are not in any high-density region are considered
noise or outliers. The most commonly used density-based clustering algorithm is
DBSCAN.
● Model-Based Clustering
Model-based clustering is a method that assumes that a probabilistic model, such as a
mixture of Gaussian distributions generates the data points. It seeks to identify the model
parameters that best fit the data and assigns data points to clusters based on their
likelihood under the model.
● Fuzzy Clustering
Fuzzy clustering is a method that assigns data points to clusters based on their degree of
membership in each cluster. This allows a data point to belong to multiple clusters with
different degrees of membership.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Applications of Clustering in Data Mining

● Customer Segmentation
● Image Segmentation
● Anomaly Detection
● Text Mining
● Biological Data Analysis
● Recommender Systems

Regression
Regression refers to a type of supervised machine learning technique that is used to predict
any continuous-valued attribute. In other words, Regression is a statistical technique used in
various fields to identify the strength and nature of a connection between one dependent variable
(typically indicated by Y) and a set of other variables called independent variables.The goal is to
predict the value of the dependent variable based on the values of the independent variables. The
dependent variable is also called the response variable, while the independent variable(s) is also
known as the predictor(s).

● For example, let's say we want to predict the price of a house based on its size, number of
bedrooms, and location. In this case, price is the dependent variable, while size, number
of bedrooms, and location are the independent variables. By analyzing the historical data
of houses with similar characteristics, we can build a regression model that predicts the
price of a new house based on its size, number of bedrooms, and location.
● Regression analysis is a predictive modeling technique
● Used to forecast numerical outcomes such as sales, prices, etc.
● Helps identify important factors affecting the dependent variable
● Develops a mathematical equation to make predictions based on input variables.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Difference Between Regression, Classification and Clustering in Data Mining

Data Pre-Processing

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different formats,
structures, and semantics. Techniques such as record linkage and data fusion can be used for data
integration.

Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while standardization
is used to transform the data to have zero mean and unit variance. Discretization is used to
convert continuous data into discrete categories.

Data Reduction: This involves reducing the size of the dataset while preserving the important
information. Data reduction can be achieved through techniques such as feature selection and
feature extraction. Feature selection involves selecting a subset of relevant features from the
dataset, while feature extraction involves transforming the data into a lower-dimensional space
while preserving the important information.

Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require
categorical data. Discretization can be achieved through techniques such as equal width binning,
equal frequency binning, and clustering.

Data Normalization: This involves scaling the data to a common range, such as between 0 and
1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization, and
decimal scaling.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

1. Data Cleaning:

The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.

(a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:

Ignore the tuples: This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.

Fill the Missing values: There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.

b). Noisy Data:

Noisy data is meaningless data that can’t be interpreted by machines.It can be generated due to
faulty data collection, data entry errors etc. It can be handled in following ways :

Binning Method: This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to complete the task.
Each segment is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

Regression: Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple independent
variables).

Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or
it will fall outside the clusters.

2. Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for the mining
process. This involves following ways:

Normalization:

It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

Attribute Selection:

In this strategy, new attributes are constructed from the given set of attributes to help the mining
process.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE


Advanced Data Mining 221ECS001 Module -1 Part -2

Discretization:

This is done to replace the raw values of numeric attributes by interval levels or conceptual
levels.

Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:

Data reduction is a crucial step in the data mining process that involves reducing the size of the
dataset while preserving the important information. This is done to improve the efficiency of data
analysis and to avoid overfitting of the model. Some common steps involved in data reduction
are:

Feature Selection: This involves selecting a subset of relevant features from the dataset. Feature
selection is often performed to remove irrelevant or redundant features from the dataset. It can be
done using various techniques such as correlation analysis, mutual information, and principal
component analysis (PCA).

Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original features
are high-dimensional and complex. It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix factorization (NMF).

Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be done
using techniques such as random sampling, stratified sampling, and systematic sampling.

Clustering: This involves grouping similar data points together into clusters. Clustering is often
used to reduce the size of the dataset by replacing similar data points with a representative
centroid. It can be done using techniques such as k-means, hierarchical clustering, and
density-based clustering.

Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gzip compression.

Prepared by : Sandra Raju , Asst. Professor, Department of CSE, MLMCE

You might also like