0% found this document useful (0 votes)
4 views

Module 4

Uploaded by

Bhavana
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module 4

Uploaded by

Bhavana
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Module 4

Introduction to Data mining


Introduction and concept of Data
mining
 Data mining is the process of automatically discovering useful
information in large data repositories.
 Data mining techniques are deployed to scour large databases in order
to find novel and useful patterns that might otherwise remain unknown.
 They also provide capabilities to predict the outcome of a future
observation, such as predicting whether a newly arrived customer will
spend more than $100 at a department store.
Data mining as a process of
Knowledge discovery
 Knowledge discovery in databases (KDD) is the process of discovering useful
knowledge from a collection of data.
 This widely used data mining technique is a process that includes data preparation and
selection, data cleansing, incorporating prior knowledge on data sets and interpreting
accurate solutions from the observed results.
 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
 Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
 Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.
Figure 1: Data
mining as a
process of
knowledge
discovery
Data mining tasks
 Data mining tasks are generally divided into two major categories:
 Predictive tasks. The objective of these tasks is to predict the value of a
particular attribute based on the values of other attributes. The attribute to
be predicted is commonly known as the target or dependent variable, while
the attributes used for making the prediction are known as the explanatory or
independent variables.
 Descriptive tasks. Here, the objective is to derive patterns (correlations,
trends, clusters, trajectories, and anomalies) that summarize the underlying
relationships in data. Descriptive data mining tasks are often exploratory in
nature and frequently require postprocessing techniques to validate and
explain the results.
Data mining parameters
 Data mining uses processes, based on parameters and rules, to pull out critical information
from vast amounts of data. We have a choice between parametric and non-parametric
models. In parametric models, the distribution is known or assumed.
 The model may have hyperparameters and parameters. Some examples of
hyperparameters are train-test split ratio, learning rate and choice of optimization
algorithm. Usually, the parameters are weights that are estimated in the model.
 Data mining parameters include:

 Sequence or path analysis: This technique identifies those patterns where one event
leads to a subsequent event. For example, consumers may demand a backpack/carry bag
depending on the items and the quantity of items they are buying.
 Classification: This technique identifies new groups from the stored data and explores the
previously unknown facts. For example, a restaurant could mine the customer data to
identify when the maximum number of customers visit and what do they order. Based on
this information, special daily offers can be introduced to increase customers and revenue.
 Forecasting: This technique is used for discovering patterns in data that can lead to
practicable predictions about the future. For example, life insurance companies frame
policies on the basis of prediction on human life.
Architecture of data mining
 Data Source:
 The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other
repositories of data. Sometimes, even plain text files or spreadsheets may contain
information. Another primary source of data is the World Wide Web or the internet.

 Different processes:
 Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources and in
different formats, it can't be used directly for the data mining procedure because the data
may not be complete and accurate. So, the first data requires to be cleaned and unified.
More information than needed will be collected from various data sources, and only the
data of interest will have to be selected and passed to the server. These procedures are not
as easy as we think. Several methods may be performed on the data as part of selection,
integration, and cleaning.
 Database or Data Warehouse Server:
 The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data
mining as per user request.
 Data Mining Engine:
 The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc. In other words, we can say data mining is the
root of our data mining architecture. It comprises instruments and software used to obtain
insights and knowledge from data collected from various data sources and stored within the
data warehouse.

 Pattern Evaluation Module:


 The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the
search on exciting patterns. This segment commonly employs stake measures that cooperate
with the data mining modules to focus the search towards fascinating patterns. It might utilize a
stake threshold to filter out discovered patterns. On the other hand, the pattern evaluation
module might be coordinated with the mining module, depending on the implementation of the
data mining techniques used. For efficient data mining, it is abnormally suggested to push the
evaluation of pattern stake as much as possible into the mining procedure to confine the search
to only fascinating patterns.
 Graphical User Interface:
 The graphical user interface (GUI) module communicates between the data mining
system and the user. This module helps the user to easily and efficiently use the
system without knowing the complexity of the process. This module cooperates
with the data mining system when the user specifies a query or a task and displays
the results.

 Knowledge Base:
 The knowledge base is helpful in the entire process of data mining. It might be
helpful to guide the search or evaluate the stake of the result patterns. The
knowledge base may even contain user views and data from user experiences that
might be helpful in the data mining process. The data mining engine may receive
inputs from the knowledge base to make the result more accurate and reliable. The
pattern assessment module regularly interacts with the knowledge base to get
inputs, and also update it.
Functionalities of data mining
 Class/Concept Descriptions

 A class or concept implies there is a data set or set of features that define the
class or a concept. A class can be a category of items on a shop floor, and a
concept could be the abstract idea on which data may be categorized like
products to be put on clearance sale and non-sale products. There are two
concepts here, one that helps with grouping and the other that helps in
differentiating.

 Data Characterization: This refers to the summary of general characteristics or


features of the class, resulting in specific rules that define a target class. A data
analysis technique called Attribute-oriented Induction is employed on the data
set for achieving characterization.
 Data Discrimination: Discrimination is used to separate distinct data sets based
on the disparity in attribute values. It compares features of a class with
features of one or more contrasting classes.g., bar charts, curves and pie
charts.
 Mining Frequent Patterns
 One of the functions of data mining is finding data patterns. Frequent patterns
are things that are discovered to be most common in data. Various types of
frequency can be found in the dataset.

 Frequent item set: This term refers to a group of items that are commonly
found together, such as milk and sugar.
 Frequent substructure: It refers to the various types of data structures that can
be combined with an item set or subsequences, such as trees and graphs.
 Frequent Subsequence: A regular pattern series, such as buying a phone
followed by a cover.
 Association Analysis

 It analyses the set of items that generally occur together in a transactional dataset. It
is also known as Market Basket Analysis for its wide use in retail sales. Two parameters
are used for determining the association rules:
 It provides which identifies the common item set in the database.
 Confidence is the conditional probability that an item occurs when another item occurs
in a transaction.

 Classification

 Classification is a data mining technique that categorizes items in a collection based


on some predefined properties. It uses methods like if-then, decision trees or neural
networks to predict a class or essentially classify a collection of items. A training set
containing items whose properties are known is used to train the system to predict the
category of items from an unknown collection of items.
 Prediction

 It defines predict some unavailable data values or spending trends. An object can be
anticipated based on the attribute values of the object and attribute values of the classes. It
can be a prediction of missing numerical values or increase or decrease trends in time-related
information. There are primarily two types of predictions in data mining: numeric and class
predictions.
 Numeric predictions are made by creating a linear regression model that is based on historical
data. Prediction of numeric values helps businesses ramp up for a future event that might
impact the business positively or negatively.
 Class predictions are used to fill in missing class information for products using a training data
set where the class for products is known.
 Cluster Analysis

 In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class
label is not known. Clustering algorithms group data based on similar features and
dissimilarities.
 Outlier analysis
 Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of
turn in the data and whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped into any classes by the
algorithms is pulled up.

 Evolution and Deviation Analysis

 Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify, cluster
or discriminate time-related data.

 Correlation Analysis

 Correlation is a mathematical technique for determining whether and how strongly two attributes is
related to one another. It refers to the various types of data structures, such as trees and graphs,
that can be combined with an item set or subsequence. It determines how well two numerically
measured continuous variables are linked.
Types of data
Kinds of data to be mined
 Basically there are two types of data set:
1. Record data set
a) Transactional data
b) Data Matrix
c) Sparse data matrix
d) Graph-based data
e) Ordered data
i)Sequential data
ii)Sequence data
iii)Time-series data
iv) Spatial data

2. Non-Record data set


 Transaction or Market Basket Data: Transaction data is a special type of record
data, where each record (transaction) involves a set of items. Consider a grocery store.
The set of products purchased by a customer during one shopping trip constitutes a
transaction, while the individual products that were purchased are the items. This type
of data is called market basket data because the items in each record are the products
in a person’s “market basket.” Transaction data is a collection of sets of items, but it
can be viewed as a set of records whose fields are asymmetric attributes.
 The Data Matrix: If the data objects in a collection of data all have the same fixed set
of numeric attributes, then the data objects can be thought of as points (vectors) in a
multidimensional space, where each dimension represents a distinct attribute
describing the object.
 The Sparse Data Matrix: A sparse data matrix is a special case of a data matrix in
which the attributes are of the same type and are asymmetric; i.e., only non-zero
values are important. Transaction data is an example of a sparse data matrix that has
only 0–1 entries. Another common example is document data. In particular, if the order
of the terms (words) in a document is ignored, then a document can be represented as
a term vector, where each term is a component (attribute) of the vector and the value
of each component is the number of times the corresponding term occurs in the
document.
 Graph-Based Data: A graph can sometimes be a convenient and powerful
representation for data. We consider two specific cases: (1) the graph captures
relationships among data objects and (2) the data objects themselves are
represented as graphs.
Data mining system classification
 A data mining system can be classified according to the following criteria −
 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines
Different types of Machine learning
 Supervised Classification
 The type of data mining classification fits when you are on the lookout for a specific target value to
make classifications and predictions in data mining. The targets can generate one or two results
pertaining to the possibilities.

 Unsupervised Classification
 Here there is no concentration drawn to predetermined attributes. It also has very few things to do with
the target value. It is only used to map out hidden relations and data structures.

 Semi-Supervised Classification
 This classification of data mining is the middle ground between the supervised and unsupervised
classification of data mining. Here, the fusion of unlabeled and labelled datasets becomes functional at
the time of the training period.

 Reinforcement Classification
 This data mining classification type involves trial and error to determine ways to best react to
situations. It also allows the software agent to understand its behaviour depending on the
environmental reviews.
Classification Based on the kind of Knowledge Mined

 We can classify a data mining system according to the kind of


knowledge mined. It means the data mining system is classified on the
basis of functionalities such as −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Outlier Analysis
 Evolution Analysis
Classification Based on the Techniques
Utilized
Decision Trees
 Here, the classification of data mining is robust. The flowchart it includes resembles
the structure of a tree with the classes hanging on leaf nodes with labels.
 The internodes contain decision classification algorithms in data mining. These are
routed to the neighboring leaf node.
Naive Bayes
 This algorithm assumes that each parameter standing on an individual foot will
have equal effects on the results. That is why they are also equally important.
Naive Bayes estimates the event probability that it is to occur.
K-nearest Neighbors
 It includes non-linear prediction boundaries. This is due to K-nearest Neighbors
falling into the category of non-linear classifier. The classifier uses the k nearest
neighbours class to make classifications and predictions in data mining concerning
a new test data point.
Support Vendor Machines
 Also called SVM, it reflects the in-space training data. These are segregated into
distinctive categories using large gaps. Then, new data points are identified in this
space, following which the prediction of categories is conducted with a focus on the gap
side they belong to.
Random Forests
 This one is compatible with numerous decision trees on diverse subsamples of
databases. Here, the average is implemented to improve the accuracy of its predictions
and administer overfitting.
Neural Networks
 This classification of data mining method obtains the input and then learns to identify its
patterns. This helps neural networks to make output predictions for similar new inputs.
Ensemble Method
 It combines diverse models to empower the outcomes of machine learning. This process
leverages better predictive performance production when in comparison to a singular
model.
Advantages of Data mining
 Marketing/advertising
 Data mining supports advertising and marketing professionals by imparting them useful, valuable
and accurate information of the trends about their clients’ purchasing behaviour. Based on these
trends, marketers can focus their attention on their customers with more precision. Moreover, data
mining may also help these professionals in identifying products and services that are less liked by
their customers. On this basis, marketers can give suggestions or recommendations to their
customers and enhance their shopping experience.

 Banking/crediting
 Data mining aids financial institutions in areas, such as credit default and loan delivery. Data mining
can also support credit card issuers in detecting potentially fraudulent credit card transactions.

 Law enforcement
 Data mining assists law enforcement agencies in identifying criminal suspects, as well as in
catching them by investigating trends in location, habits, crime type and other behaviour patterns.
 Researchers
 Data mining supports researchers by increasing the pace of their data analysis
process; thus, providing them more time to work on other projects.

 Manufacturing
 Data mining is applied widely to determine the range of control parameters in
the manufacturing sector. These optimal control parameters are then used to
manufacture products with the desired quality.

 Government
 Data mining supports government agencies by extracting and analysing records
of financial transactions, for example, it helps banks to discover patterns that
can identify money laundering or criminal activities.
Challenges of Data mining
 Security and social issues
 User interface issues
 Mining methodology issues
 Performance issues
 Data source issues
Ethical considerations
 Suitability and validity
Guard against the possibility that a predisposition by investigators or data
providers might predetermine the analytic result. Employ data selection or
sampling methods and analytic approaches that are designed to assure valid
analyses in either frequentist or Bayesian approaches.
 Privacy and Confidentiality
 The aims of data mining effort
 Moral imperatives
Global issues
 Challenges + Ethical issues
Classification
 Classification is used to classify each item in a set of data into one of predefined set of
classes or groups.
 The data analysis task classification is where a model or classifier is constructed to
predict categorical labels (the class label attributes).
 Classification is a data mining function that assigns items in a collection to target
categories or classes.
 The goal of classification is to accurately predict the target class for each case in the
data.
 For example, a classification model could be used to identify loan applicants as low,
medium, or high credit risks.
 A classification task begins with a data set in which the class assignments are known.
For example, a classification model that predicts credit risk could be developed based
on observed data for many loan applicants over a period of time.
 In addition to the historical credit rating, the data might track employment history,
home ownership or rental, years of residence, number and type of investments, and
so on.
 Classifications are discrete and do not imply order.
 Continuous, floating-point values would indicate a numerical, rather than a categorical,
target.
 A predictive model with a numerical target uses a regression algorithm, not a classification
algorithm.
 The simplest type of classification problem is binary classification.
 In binary classification, the target attribute has only two possible values: for example, high
credit rating or low credit rating.
 Multiclass targets have more than two values: for example, low, medium, high, or unknown
credit rating.
 In the model build (training) process, a classification algorithm finds relationships between
the values of the predictors and the values of the target.
 Different classification algorithms use different techniques for finding relationships.
 These relationships are summarized in a model, which can then be applied to a different data
set in which the class assignments are unknown.
 Classification has many applications in customer segmentation, business modeling,
marketing, credit analysis, and biomedical and drug response modeling.
Figure: Classification model illustration
 Step 1: A classifier is built describing a predetermined set of data classes or
concepts. (This is also known as supervised learning).
 Step 2: Here, the model is used for classification. First, the predictive accuracy
of the classifier is estimated. (This is also known as unsupervised learning).
 The commonly used methods for data mining classification tasks can be
classified into the following groups.
1.Decision tree induction methods,
2.Rule-based methods,
3.Memory based learning,
4.Neural networks,
5.Bayesian network,
6.Support vector machines.
Clustering
 The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis.
 This method is defined under the branch of Unsupervised Learning, which aims at
gaining insights from unlabelled data points, that is, unlike supervised learning we
don’t have a target variable.
 Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset.
 It evaluates the similarity based on a metric like Euclidean distance, Cosine
similarity, Manhattan distance, etc. and then group the points with highest
similarity score together.
Example: Clustering
Types of clustering

 Hard Clustering: In this type of clustering, each data point belongs to a


cluster completely or not. For example, Let’s say there are 4 data point and
we have to cluster them into 2 clusters. So each data point will either
belong to cluster 1 or cluster 2.
 Soft Clustering: In this type of clustering, instead of assigning each data
point into a separate cluster, a probability or likelihood of that point being
that cluster is evaluated. For example, Let’s say there are 4 data point and
we have to cluster them into 2 clusters. So we will be evaluating a
probability of a data point belonging to both clusters. This probability is
calculated for all data points.
Uses of clustering
 Market Segmentation – Businesses use clustering to group their customers and use
targeted advertisements to attract more audience.
 Market Basket Analysis – Shop owners analyze their sales and figure out which items
are majorly bought together by the customers. For example, In USA, according to a study
diapers and beers were usually bought together by fathers.
 Social Network Analysis – Social media sites use your data to understand your
browsing behaviour and provide you with targeted friend recommendations or content
recommendations.
 Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic
images like X-rays.
 Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting
fraudulent transactions we can use clustering to identify them.
 Simplify working with large datasets – Each cluster is given a cluster ID after
clustering is complete. Now, you may reduce a feature set’s whole feature set into its
cluster ID. Clustering is effective when it can represent a complicated case with a
straightforward cluster ID. Using the same principle, clustering data can make complex
datasets simpler.
Types of clustering algorithms
 Centroid-based Clustering (Partitioning methods)
 Density-based Clustering (Model-based methods)
 Connectivity-based Clustering (Hierarchical clustering)
 Distribution-based Clustering
Neural networks
 Neural Network is an information processing paradigm that is inspired by the
human nervous system.
 As in the Human Nervous system, we have Biological neurons in the same way
in Neural networks we have Artificial Neurons which is a Mathematical Function
that originates from biological neurons.
 The human brain is estimated to have around 10 billion neurons each
connected on average to 10,000 other neurons.
 Each neuron receives signals through synapses that control the effects of the
signal on the neuron.
Figure: Neural network function
Uses of neural network
 Fraud Detection: As we know that fraudsters have been exploiting
businesses, banks for their own financial gain for many past years, and the
problem is going to increase in today’s modern world because of the
advancement of technology, which makes fraud relatively easy to commit
but on the other hand technology also helps is fraud detection and in this
neural network help us a lot in detecting fraud.

 Healthcare: In healthcare, Neural Network helps us in Diagnosing diseases,


as we know that there are many diseases and there are large datasets
having records of these diseases. With neural networks and these records,
we diagnosed these diseases in the early stage as soon as possible.
Different neural network method in data
mining
 Feed-Forward Neural Networks: In Feed-Forward Network, if the output values cannot be
traced back to the input values and if for every input node, an output node is calculated, then
there is a forward flow of information and no feedback between the layers. In simple words, the
information moves in only one direction (forward) from the input nodes, through the hidden nodes
(if any), and to the output nodes. Such a type of network is known as a feedforward network.
 Feedback Neural Network: Signals can travel in both directions in a feedback network.
Feedback neural networks are very powerful and can become very complex. feedback networks
are dynamic. The “states” in such a network are constantly changing until an equilibrium point is
reached. They stay at equilibrium until the input changes and a new equilibrium needs to be
found. Feedback neural network architectures are also known as interactive or recurrent.
Feedback loops are allowed in such networks. They are used for content addressable memory.
 Self Organization Neural Network: Self Organizing Neural Network (SONN) is a type of
artificial neural network but is trained using competitive learning rather than error-correction
learning (e.g., backpropagation with gradient descent) used by other artificial neural networks. A
Self Organizing Neural Network (SONN) is an unsupervised learning model in Artificial Neural
Network termed as Self-Organizing Feature Maps or Kohonen Maps. It is used to produce a low-
dimensional (typically two-dimensional) representation of a higher-dimensional data set while
preserving the topological structure of the data.
Decision trees
 A decision tree is a type of algorithm that classifies information so that a
tree-shaped model is generated.
 It is a schematic model of the information that represents the different
alternatives and the possible results for each chosen alternative.
 Decision trees are a widely used model because they greatly facilitate
understanding of the different options.
Example:

 The above example of a decision


tree helps to determine if one
should play cricket or not.
 If the weather forecast suggests
that it is overcast then you
should definitely play cricket.
 If it is rainy, you should play only
if the wind is weak and if it is
sunny then you should play if the
humidity is normal or low.
Decision tree components
 The decision tree is made up of nodes and branches.
 There are different types of nodes and branches depending on what you
want to represent.
 Decision nodes represent a decision to be made, probability nodes
represent possible uncertain outcomes and terminal nodes that
represent the final outcome.
 On the other hand, the branches are differentiated into alternative
branches, where each branch leads to a type of result and, the
“rejected” branches, which represent the results that are rejected.
 The model is characterized in that the same problem can be represented
with different trees.
Types of decision trees
 Categorical Variable Decision Tree
A categorical variable decision tree comprises categorical target variables,
which are further bifurcated categories, such as Yes or No. Categories specify
that the stages of a decision process are categorically divided.

 Continuous Variable Decision Tree


A continuous variable decision tree has a continuous target variable. One
example to understand this could be – the unknown salary of an employee can
be predicted bases on the available profile information, such as his/her job role,
age, experience, and other continuous variables.
Advantages of decision trees
 Decision trees in data mining provide us with various advantages to analyze and classify the data in your
information base. However, experts highlight the following –

1. Ease of Understanding
 Because data mining tools can visually capture this model in a very practical way, people can understand
how it works after a short explanation. It is not necessary to have extensive knowledge of data mining or
web programming languages.

2. Does Not Require Data Normalization


 Most data mining techniques require the preparation of data for processing, that is, the analysis and
discard of data in poor condition. This is not the case for decision trees in data mining, as they can start
working directly.

3. Handling of Numbers and Categorized Data


 One of the main differences between neural networks and decision trees is that the latter analyze a large
number of variables.
 While neural networks simply focus on numerical variables, decision trees encompass both numerical
and nominal variables. Therefore, they will help you to analyze a large amount of information together.
4. “White Box” Model
 In web programming and data mining, the white box model brings together a type of
software test in which the variables are evaluated to determine what are the possible
scenarios or execution paths based on a decision.

5. Uses of Statistics
 Decision trees and statistics work hand in hand to provide greater reliability to the model
that is being developed. Since each result is supported by various statistical tests, the
probability of any of the options analyzed can be known exactly.

6. Handles Big Data


 Do you have large amounts of information to analyze? With decision trees, you can
process them seamlessly. This model works perfectly with big data, as it uses computer
and web programming resources to manipulate each point of information.

You might also like