0% found this document useful (0 votes)
23 views

Data Mining

Data mining

Uploaded by

sahana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Mining

Data mining

Uploaded by

sahana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Data Mining - Tasks

Data mining deals with the kind of patterns that can be mined. On the basis of the kind of
data to be mined, there are two categories of functions involved in Data Mining −

 Descriptive
 Classification and Prediction
Descriptive Function

The descriptive function deals with the general properties of data in the database. Here is the
list of descriptive functions −

 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters
Class/Concept Description

Class/Concept refers to the data to be associated with the classes or concepts. For example, in
a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a
concept are called class/concept descriptions. These descriptions can be derived by the
following two ways −

 Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
 Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
Mining of Frequent Patterns

Frequent patterns are those patterns that occur frequently in transactional data. Here is the list
of kind of frequent patterns −

 Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
 Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
Mining of Association

Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.

Mining of Correlations

It is a kind of additional analysis performed to uncover interesting statistical correlations


between associated-attribute-value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.

Mining of Clusters

Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.

Classification and Prediction

Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −

 Classification (IF-THEN) Rules


 Decision Trees
 Mathematical Formulae
 Neural Networks

The list of functions involved in these processes are as follows −

 Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.
 Prediction − It is used to predict missing or unavailable numerical data values rather
than class labels. Regression Analysis is generally used for prediction. Prediction can
also be used for identification of distribution trends based on available data.
 Outlier Analysis − Outliers may be defined as the data objects that do not comply
with the general behavior or model of the data available.
 Evolution Analysis − Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.
Data Mining Task Primitives
 We can specify a data mining task in the form of a data mining query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
 Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined

This is the portion of database in which the user is interested. This portion includes the
following −

 Database Attributes
 Data Warehouse dimensions of interest
Kind of knowledge to be mined

It refers to the kind of functions to be performed. These functions are −

 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
Background knowledge

The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.

Interestingness measures and thresholds for pattern evaluation

This is used to evaluate the patterns that are discovered by the process of knowledge
discovery. There are different interesting measures for different kind of knowledge.

Representation for visualizing the discovered patterns

This refers to the form in which discovered patterns are to be displayed. These
representations may include the following. −

 Rules
 Tables
 Charts
 Graphs
 Decision Trees
 Cubes

Data mining Vs Knowledge Discover in databases


KDD vs Data Mining
KDD (Knowledge Discovery in Databases) is a field of computer
science, which includes the tools and theories to help humans in
extracting useful and previously unknown information (i.e.,
knowledge) from large collections of digitized data. KDD consists
of several steps, and Data Mining is one of them. Data Mining is
the application of a specific algorithm to extract patterns from
data. Nonetheless, KDD and Data Mining are used
interchangeably.

What is KDD?
KDD is a computer science field specializing in extracting
previously unknown and interesting information from raw data.
KDD is the whole process of trying to make sense of data by
developing appropriate methods or techniques. This process deals
with low-level mapping data into other forms that are more
compact, abstract, and useful. This is achieved by creating short
reports, modeling the process of generating data, and developing
predictive models that can predict future cases.

Due to the exponential growth of data, especially in areas such as


business, KDD has become a very important process to convert
this large wealth of data into business intelligence, as manual
extraction of patterns has become seemingly impossible in the
past few decades.

For example, it is currently used for various applications such as


social network analysis, fraud detection, science, investment,
manufacturing, telecommunications, data cleaning, sports,
information retrieval, and marketing. KDD is usually used to
answer questions like what are the main products that might help
to obtain high-profit next year in V-Mart.

KDD Process Steps


Knowledge discovery in the database process includes the
following steps, such as:
1. Goal identification: Develop and understand the application domain and
the relevant prior knowledge and identify the KDD process's goal from the
customer perspective.
2. Creating a target data set: Selecting the data set or focusing on a set
of variables or data samples on which the discovery was made.
3. Data cleaning and preprocessing:Basic operations include removing
noise if appropriate, collecting the necessary information to model or
account for noise, deciding on strategies for handling missing data fields,
and accounting for time sequence information and known changes.
4. Data reduction and projection: Finding useful features to represent the
data depending on the purpose of the task. The effective number of
variables under consideration may be reduced through dimensionality
reduction methods or conversion, or invariant representations for the data
can be found.
5. Matching process objectives: KDD with step 1 a method of mining
particular. For example, summarization, classification, regression,
clustering, and others.
6. Modeling and exploratory analysis and hypothesis
selection: Choosing the algorithms or data mining and selecting the
method or methods to search for data patterns. This process includes
deciding which model and parameters may be appropriate (e.g., definite
data models are different models on the real vector) and the matching of
data mining methods, particularly with the general approach of the KDD
process (for example, the end-user might be more interested in
understanding the model in its predictive capabilities).
7. Data Mining: The search for patterns of interest in a particular
representational form or a set of these representations, including
classification rules or trees, regression, and clustering. The user can
significantly aid the data mining method to carry out the preceding steps
properly.
8. Presentation and evaluation: Interpreting mined patterns, possibly
returning to some of the steps between steps 1 and 7 for additional
iterations. This step may also involve the visualization of the extracted
patterns and models or visualization of the data given the models drawn.
9. Taking action on the discovered knowledge: Using the knowledge
directly, incorporating the knowledge in another system for further action,
or simply documenting and reporting to stakeholders. This process also
includes checking and resolving potential conflicts with previously
believed knowledge (or extracted).

What is Data Mining?


Data mining, also known as Knowledge Discovery in Databases, refers to
the nontrivial extraction of implicit, previously unknown, and potentially
useful information from data stored in databases.

Data Mining is only a step within the overall KDD process. There are two
major Data Mining goals defined by the application's goal: verification of
discovery. Verification verifies the user's hypothesis about data, while
discovery automatically finds interesting patterns.

There are four major data mining tasks: clustering, classification,


regression, and association (summarization). Clustering is identifying
similar groups from unstructured data. Classification is learning rules that
can be applied to new data. Regression is finding functions with minimal
error to model data. And the association looks for relationships between
variables. Then, the specific data mining algorithm needs to be selected.
Different algorithms like linear regression, logistic regression, decision
trees, and Naive Bayes can be selected depending on the goal. Then
patterns of interest in one or more symbolic forms are searched. Finally,
models are evaluated either using predictive accuracy or
understandability.

Why do we need Data Mining?


The volume of information is increasing every day that we can handle
from business transactions, scientific data, sensor data, Pictures, videos,
etc. So, we need a system that will be capable of extracting the essence
of information available and that can automatically generate reports,
views, or summaries of data for better decision-making.

Why is Data Mining used in business?


Data mining is used in business to make better managerial decisions by:
o Automatic summarization of data.
o Discovering patterns in raw data.
o Extracting the essence of information stored.

Why KDD and Data Mining?


In an increasingly data-driven world, there would never be such a
thing as too much data. However, data is only valuable when you
can parse, sort, and sift through it to extrapolate the actual value.

Most industries collect massive volumes of data, but without a


filtering mechanism that graphs, charts, and trends data models,
pure data itself has little use.

However, the sheer volume of data and the speed with which it is
collected makes sifting through it challenging. Thus, it has
become economically and scientifically necessary to scale up our
analysis capability to handle the vast amount of data that we now
obtain.

Since computers have allowed humans to collect more data than


we can process, we naturally turn to computational techniques to
help us extract meaningful patterns and structures from vast
amounts of data.

Difference between KDD and Data Mining


Although the two terms KDD and Data Mining are heavily used
interchangeably, they refer to two related yet slightly different
concepts.

KDD is the overall process of extracting knowledge from data,


while Data Mining is a step inside the KDD process, which deals
with identifying patterns in data.

And Data Mining is only the application of a specific algorithm


based on the overall goal of the KDD process.

KDD is an iterative process where evaluation measures can be


enhanced, mining can be refined, and new data can be integrated
and transformed to get different and more appropriate results.
Data Mining – Issues

Data mining is not an easy task, as the algorithms used can get very complex and data is
not always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues. Here in this tutorial, we will discuss the major
issues regarding −

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues

The following diagram describes the major issues.

Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −


 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data mining to
cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data
mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to
express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated
with a data warehouse query language and optimized for efficient and flexible data
mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.
 Pattern evaluation − The patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
Performance Issues

There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms − In order to effectively


extract the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as
huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining algorithms.
These algorithms divide the data into partitions which is further processed in a
parallel fashion. Then the results from the partitions is merged. The incremental
algorithms, update databases without mining the data again from scratch.
Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
 Mining information from heterogeneous databases and global information
systems − The data is available at different data sources on LAN or WAN. These data
source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.
What is Data Mining Metrics?

Data mining is one of the forms of artificial intelligence that uses perception models,
analytical models, and multiple algorithms to simulate the techniques of the human
brain. Data mining supports machines to take human decisions and create human
choices.

The user of the data mining tools will have to direct the machine rules, preferences, and
even experiences to have decision support data mining metrics are as follows −

Usefulness − Usefulness involves several metrics that tell us whether the model
provides useful data. For instance, a data mining model that correlates save the location
with sales can be both accurate and reliable, but cannot be useful, because it cannot
generalize that result by inserting more stores at the same location.

Furthermore, it does not answer the fundamental business question of why specific
locations have more sales. It can also find that a model that appears successful is
meaningless because it depends on cross-correlations in the data.

Return on Investment (ROI) − Data mining tools will find interesting patterns buried
inside the data and develop predictive models. These models will have several measures
for denoting how well they fit the records. It is not clear how to create a decision based
on some of the measures reported as an element of data mining analyses.
Access Financial Information during Data Mining − The simplest way to frame
decisions in financial terms is to augment the raw information that is generally mined to
also contain financial data. Some organizations are investing and developing data
warehouses, and data marts.

The design of a warehouse or mart contains considerations about the types of analyses
and data needed for expected queries. It is designing warehouses in a way that allows
access to financial information along with access to more typical data on product
attributes, user profiles, etc. can be useful.

Converting Data Mining Metrics into Financial Terms − A general data mining
metric is the measure of "Lift". Lift is a measure of what is achieved by using the specific
model or pattern relative to a base rate in which the model is not used. High values
mean much is achieved. It can seem then that one can simply create a decision based on
Lift.
Accuracy − Accuracy is a measure of how well the model correlates results with the
attributes in the data that has been supported. There are several measures of accuracy,
but all measures of accuracy are dependent on the information that is used. In reality,
values can be missing or approximate, or the data can have been changed by several
processes.

It is the procedure of exploration and development, it can decide to accept a specific


amount of error in the data, especially if the data is fairly uniform in its characteristics.
For example, a model that predicts sales for a specific store based on past sales can be
powerfully correlated and very accurate, even if that store consistently used the wrong
accounting techniques. Thus, measurements of accuracy should be balanced by
assessments of reliability.

What are the social implications


of data mining?

Data mining is the process of finding useful new correlations, patterns, and trends by
transferring through a high amount of data saved in repositories, using pattern
recognition technologies including statistical and mathematical techniques. It is the
analysis of factual datasets to discover unsuspected relationships and to summarize the
records in novel methods that are both logical and helpful to the data owner.

Data mining systems are designed to promote the identification and classification of
individuals into different groups or segments. From the aspect of the commercial firm,
and possibly for the industry as a whole, it can interpret the use of data mining as a
discriminatory technology in the rational search of profits.

There are various social implications of data mining which are as follows −

Privacy − It is a loaded issue. In current years privacy concerns have taken on a more
important role in American society as merchants, insurance companies, and government
agencies amass warehouses including personal records.

The concerns that people have over the group of this data will generally extend to some
analytic capabilities used to the data. Users of data mining should start thinking about
how their use of this technology will be impacted by legal problems associated with
privacy.

Profiling − Data Mining and profiling is a developing field that attempts to organize,
understand, analyze, reason, and use the explosion of data in this information age. The
process contains using algorithms and experience to extract design or anomalies that are
very complex, difficult, or time-consuming to recognize.

The founder of Microsoft's Exploration Team used complex data mining algorithms to
solve an issue that had haunted astronomers for some years. The problem of reviewing,
describing, and categorizing 2 billion sky objects recorded over 3 decades. The algorithm
extracted the relevant design to allocate the sky objects like stars or galaxies. The
algorithms were able to extract the feature that represented sky objects as stars or
galaxies. This developing field of data mining and profiling has several frontiers where it
can be used.

Unauthorized Used − Trends obtain through data mining designed to be used for
marketing goals or some other ethical goals, can be misused. Unethical businesses or
people can use the data obtained through data mining to take benefit of vulnerable
people or discriminate against a specific group of people. Furthermore, the data mining
technique is not 100 percent accurate; thus mistakes do appear which can have serious
results.

Data Mining Techniques


Data mining includes the utilization of refined data analysis tools to find
previously unknown, valid patterns and relationships in huge data sets. These
tools can incorporate statistical models, machine learning techniques, and
mathematical algorithms, such as neural networks or decision trees. Thus, data
mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of


machine learning, database management, and statistics, professionals in data
mining have devoted their careers to better understanding how to process and
make conclusions from the huge amount of data, but what are the methods they
use to make it happen?

In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
1. Classification:
This technique is used to obtain important and relevant information about data
and metadata. This data mining technique helps to classify data in different
classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data


sources mined:
This classification is as per the type of data handled. For example,
multimedia, spatial data, text data, time-series data, World Wide Web, and
so on..
ii. Classification of data mining frameworks as per the database
involved:
This classification based on the data model involved. For example. Object-
oriented database, transactional database, relational database, and so
on..
iii. Classification of data mining frameworks as per the kind of
knowledge discovered:
This classification depends on the types of knowledge discovered or data
mining functionalities. For example, discrimination, classification,
clustering, characterization, etc. some frameworks tend to be extensive
frameworks offering a few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining
techniques used:
This classification is as per the data analysis approach utilized, such as
neural networks, machine learning, genetic algorithms, visualization,
statistics, data warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction
involved in the data mining procedure, such as query-driven systems,
autonomous systems, or interactive exploratory systems.

2. Clustering:
Clustering is a division of information into groups of connected objects.
Describing the data by a few clusters mainly loses certain confine details, but
accomplishes improvement. It models data by its clusters. Data modeling puts
clustering from a historical point of view rooted in statistics, mathematics, and
numerical analysis. From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised learning, and the
subsequent framework represents a data concept. From a practical point of view,
clustering plays an extraordinary job in data mining applications. For example,
scientific data exploration, text mining, information retrieval, spatial database
applications, CRM, Web analysis, computational biology, medical diagnostics,
and much more.

In other words, we can say that Clustering analysis is a data mining technique to
identify similar data. This technique helps to recognize the differences and
similarities between the data. Clustering is very similar to the classification, but it
involves grouping chunks of data together based on their similarities.

3. Regression:
Regression analysis is the data mining process is used to identify and analyze
the relationship between variables because of the presence of the other factor. It
is used to define the probability of the specific variable. Regression, primarily a
form of planning and modeling. For example, we might use it to project certain
costs, depending on other factors such as availability, consumer demand, and
competition. Primarily it gives the exact relationship between two or more
variables in the given data set.

4. Association Rules:
This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of
databases. Association rule mining has several applications and is commonly
used to help sales correlations in data or medical data sets.

The way the algorithm works is that you have various data, For example, a list of
grocery items that you have been buying for the last six months. It calculates a
percentage of items being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence
over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased
when item A is purchased as well.
(Item A + Item B)/ (Item A)

5. Outer detection:
This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This
technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outilier mining. The outlier
is a data point that diverges too much from the rest of the dataset. The majority
of the real-world datasets have an outlier. Outlier detection plays a significant
role in the data mining field. Outlier detection is valuable in numerous fields like
network interruption identification, credit or debit card fraud detection, detecting
outlying in wireless sensor network data, etc.

6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating
sequential data to discover sequential patterns. It comprises of finding
interesting subsequences in a set of sequences, where the stake of a sequence
can be measured in terms of different criteria like length, occurrence frequency,
etc.

In other words, this technique of data mining helps to discover or recognize


similar patterns in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right
sequence to predict a future event.

Statistical Methods in Data Mining


Data mining refers to extracting or mining knowledge from large amounts of data. In
other words, data mining is the science, art, and technology of discovering large and
complex bodies of data in order to discover useful patterns. Theoreticians and
practitioners are continually seeking improved techniques to make the process more
efficient, cost-effective, and accurate. Any situation can be analyzed in two ways in
data mining:
 Statistical Analysis: In statistics, data is collected, analyzed, explored, and
presented to identify patterns and trends. Alternatively, it is referred to as
quantitative analysis.
 Non-statistical Analysis: This analysis provides generalized information and
includes sound, still images, and moving images.
In statistics, there are two main categories:
 Descriptive Statistics: The purpose of descriptive statistics is to organize data and
identify the main characteristics of that data. Graphs or numbers summarize the
data. Average, Mode, SD(Standard Deviation), and Correlation are some of the
commonly used descriptive statistical methods.
 Inferential Statistics: The process of drawing conclusions based on probability
theory and generalizing the data. By analyzing sample statistics, you can infer
parameters about populations and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with
statistics. Some of these are:
 Population
 Sample
 Variable
 Quantitative Variable
 Qualitative Variable
 Discrete Variable
 Continuous Variable
As a matter of fact, today’s statistical methods used in the data mining field typically
are derived from the vast statistical toolkit developed to answer problems arising in
other fields. These techniques are taught in science curriculums. It is necessary to
check and test several hypotheses. The hypotheses described above help us assess
the validity of our data mining endeavor when attempting to infer any inferences from
the data under study. When using more complex and sophisticated statistical
estimators and tests, these issues become more pronounced.
For extracting knowledge from databases containing different types of observations, a
variety of statistical methods are available in Data Mining and some of these are:
 Logistic regression analysis
 Correlation analysis
 Regression analysis
 Discriminate analysis
 Linear discriminant analysis (LDA)
 Classification
 Clustering
 Outlier detection
 Classification and regression trees,
 Correspondence analysis
 Nonparametric regression,
 Statistical pattern recognition,
 Categorical data analysis,
 Time-series methods for trends and periodicity
 Artificial neural networks
Now, let’s try to understand some of the important statistical methods which are used
in data mining:
 Linear Regression: The linear regression method uses the best linear relationship
between the independent and dependent variables to predict the target variable. In
order to achieve the best fit, make sure that all the distances between the shape
and the actual observations at each point are as small as possible. A good fit can
be determined by determining that no other position would produce fewer errors
given the shape chosen. Simple linear regression and multiple linear regression are
the two major types of linear regression. By fitting a linear relationship to the
independent variable, the simple linear regression predicts the dependent variable.
Using multiple independent variables, multiple linear regression fits the best linear
relationship with the dependent variable. For more details, you can refer linear
regression.
 Classification: This is a method of data mining in which a collection of data is
categorized so that a greater degree of accuracy can be predicted and analyzed.
An effective way to analyze very large datasets is to classify them. Classification is
one of several methods aimed at improving the efficiency of the analysis process. A
Logistic Regression and a Discriminant Analysis stand out as two major
classification techniques.
 Logistic Regression: It can also be applied to machine learning
applications and predictive analytics. In this approach, the dependent
variable is either binary (binary regression) or multinomial (multinomial
regression): either one of the two or a set of one, two, three, or four
options. With a logistic regression equation, one can estimate
probabilities regarding the relationship between the independent variable
and the dependent variable. For understanding logistic regression
analysis in detail, you can refer to logistic regression.
 Discriminant Analysis: A Discriminant Analysis is a statistical method of
analyzing data based on the measurements of categories or clusters and
categorizing new observations into one or more populations that were
identified a priori. The discriminant analysis models each response class
independently then uses Bayes’s theorem to flip these projections around
to estimate the likelihood of each response category given the value of X.
These models can be either linear or quadratic.
 Linear Discriminant Analysis: According to Linear
Discriminant Analysis, each observation is assigned a
discriminant score to classify it into a response variable class.
By combining the independent variables in a linear fashion,
these scores can be obtained. Based on this model,
observations are drawn from a Gaussian distribution, and the
predictor variables are correlated across all k levels of the
response variable, Y. and for further details linear discriminant
analysis
 Quadratic Discriminant Analysis: An alternative approach is
provided by Quadratic Discriminant Analysis. LDA and QDA
both assume Gaussian distributions for the observations of the
Y classes. Unlike LDA, QDA considers each class to have its
own covariance matrix. As a result, the predictor variables have
different variances across the k levels in Y.
 Correlation Analysis: In statistical terms, correlation analysis captures
the relationship between variables in a pair. The value of such variables
is usually stored in a column or rows of a database table and represents
a property of an object.
 Regression Analysis: Based on a set of numeric data, regression is a
data mining method that predicts a range of numerical values (also
known as continuous values). You could, for instance, use regression to
predict the cost of goods and services based on other variables. A
regression model is used across numerous industries for forecasting
financial data, modeling environmental conditions, and analyzing trends.
The first step in creating good statistics is having good data that was derived with an
aim in mind. There are two main types of data: an input (independent or predictor)
variable, which we control or are able to measure, and an output (dependent or
response) variable which is observed. A few will be quantitative measurements, but
others may be qualitative or categorical variables (called factors).
Don't miss your chance to ride the wave of the data revolution! Every industry is
scaling new heights by tapping into the power of data. Sharpen your skills, become a
part of the hottest trend in the 21st century.
Dive into the future of technology - explore the Complete Machine Learning and Data
Science Program by GeeksforGeeks and stay ahead of the curve.

Decision Tree Induction


Decision Tree is a supervised learning method used in data mining for
classification and regression methods. It is a tree that helps us in decision-
making purposes. The decision tree creates classification or regression models
as a tree structure. It separates a data set into smaller subsets, and at the same
time, the decision tree is steadily developed. The final tree is a tree with the
decision nodes and leaf nodes. A decision node has at least two branches. The
leaf nodes show a classification or decision. We can't accomplish more split on
leaf nodes-The uppermost decision node in a tree that relates to the best
predictor called the root node. Decision trees can deal with both categorical and
numerical data.

Key factors:
Entropy:

Entropy refers to a common way to measure impurity. In the decision tree, it


measures the randomness or impurity in data sets.
nformation Gain:

Information Gain refers to the decline in entropy after the dataset is split. It is
also called Entropy Reduction. Building a decision tree is all about discovering
attributes that return the highest data gain.

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing
decisions. Starting with the dataset, we can measure the entropy to find a way to segment the
set until the data belongs to the same class.
ADVERTISEMENT
Why are decision trees useful?
It enables us to analyze the possible consequences of a decision thoroughly.

It provides us a framework to measure the values of outcomes and the


probability of accomplishing them.

It helps us to make the best decisions based on existing data and best
speculations.

In other words, we can say that a decision tree is a hierarchical tree structure
that can be used to split an extensive collection of records into smaller sets of
the class by implementing a sequence of simple decision rules. A decision tree
model comprises a set of rules for portioning a huge heterogeneous population
into smaller, more homogeneous, or mutually exclusive classes. The attributes of
the classes can be any variables from nominal, ordinal, binary, and quantitative
values, in contrast, the classes must be a qualitative type, such as categorical or
ordinal or binary. In brief, the given data of attributes together with its class, a
decision tree creates a set of rules that can be used to identify the class. One
rule is implemented after another, resulting in a hierarchy of segments within a
segment. The hierarchy is known as the tree, and each segment is called
a node. With each progressive division, the members from the subsequent sets
become more and more similar to each other. Hence, the algorithm used to build
a decision tree is referred to as recursive partitioning. The algorithm is known
as CART (Classification and Regression Trees)

Consider the given example of a factory where


Expanding factor costs $3 million, the probability of a good economy is 0.6
(60%), which leads to $8 million profit, and the probability of a bad
economy is 0.4 (40%), which leads to $6 million profit.

Not expanding factor with 0$ cost, the probability of a good economy is


0.6(60%), which leads to $4 million profit, and the probability of a bad
economy is 0.4, which leads to $2 million profit.

The management teams need to take a data-driven decision to expand or


not based on the given data.

Net Expand = ( 0.6 *8 + 0.4*6 ) - 3 = $4.2M


Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.

Decision tree Algorithm:


The decision tree algorithm may appear long, but it is quite simply the
basis algorithm techniques is as follows:

The algorithm is based on three parameters: D, attribute_list, and


Attribute _selection_method.

Generally, we refer to D as a data partition.

Initially, D is the entire set of training tuples and their related class
levels (input training data).

The parameter attribute_list is a set of attributes defining the tuples.

Attribute_selection_method specifies a heuristic process for


choosing the attribute that "best" discriminates the given tuples according
to class.

Attribute_selection_method process applies an attribute selection


measure.

Advantages of using decision trees:


A decision tree does not need scaling of information.

Missing values in data also do not influence the process of building a


choice tree to any considerable extent.

A decision tree model is automatic and simple to explain to the technical


team as well as stakeholders.

Compared to other algorithms, decision trees need less exertion for data
preparation during pre-processing.
A decision tree does not require a standardization of data.

How Neural Networks Can Be Used For Data


Mining?
As all of us are aware that how technology is growing day-by-day and a Large amount
of data is produced every second, analyzing data is going to be very important
because it helps us in fraud detection, identifying spam e-mail, etc. So Data
Mining comes into existence to help us find hidden patterns, discover knowledge from
large datasets.
In this article, we basically look at Neural networks and what is the application of
neural networks for Data Mining work.

Neural Network:

Neural Network is an information processing paradigm that is inspired by the human


nervous system. As in the Human Nervous system, we have Biological neurons in the
same way in Neural networks we have Artificial Neurons which is a Mathematical
Function that originates from biological neurons. The human brain is estimated to have
around 10 billion neurons each connected on average to 10,000 other neurons. Each
neuron receives signals through synapses that control the effects of the signal on the
neuron.
How Artificial Neural Network Work?

Let us Suppose that there are n input like X1,X2,…,Xn to a neuron.


=> The weight connecting n number of inputs to a neuron are represented by
[W]=[W1,W2,..,Wn].
=> The Function of summing junction of an artificial neuron is to collect the weighted
inputs and sum them up.
Yin=[X1*W1+X2*W2+….+Xn*Wn]
=> The output of summing junction may sometimes become equal to zero and to
prevent such a situation, a bias of fixed value Bo is added to it.
Yin =[X1*W1+X2*W2+….+Xn*Wn] + Bo
// Yin then move toward the Activation Function.
=> The output Y of a neuron largely depends on its Activation Function (also known as
transfer function).
=> There are different types of Activation Function are in use, Such as
1. Identity Function
2. Binary Step Function With Threshold
3. Bipolar Step Function With Threshold
4. Binary Sigmoid Function
5. Bipolar Sigmoid Function

Neural Network Architecture:

While there are numerous different neural network architectures that have been
created by researchers, the most successful applications in data mining neural
networks have been multilayer feedforward networks. These are networks in which
there is an input layer consisting of nodes that simply accept the input values and
successive layers of nodes that are neurons as depicted in the above figure of Artificial
Neuron. The outputs of neurons in a layer are inputs to neurons in the next layer. The
last layer is called the output layer. Layers between the input and output layers are
known as hidden layers.
As you know that we have two types of Supervised learning one is Regression and
another one is classification. So in the Regression type problem neural network is used
to predict a numerical quantity there is one neuron in the output layer and its output is
the prediction. While on another hand in the classification type problem the output
layer has as many nodes as the number of classes and the output layer node with the
largest output values gives the network’s estimate of the class for a given input. In the
special case of two classes, it is common to have just one node in the output layer, the
classification between the two classes being made by applying a cut-off to the output
value at the node.

Why use Neural Network Method in Data Mining?

Neural networks help in mining large amounts of data in various sectors such as retail,
banking (Fraud detection), bioinformatics(genome sequencing), etc. Finding useful
information for large data which is hidden is very challenging and very necessary also.
Data Mining uses Neural networks to harvest information from large datasets from data
warehousing organizations. Which helps the user in decision making.
Some of the Applications of Neural Network In Data Mining are given below:
 Fraud Detection: As we know that fraudsters have been exploiting businesses,
banks for their own financial gain for many past years, and the problem is going to
increase in today’s modern world because of the advancement of technology, which
makes fraud relatively easy to commit but on the other hand technology also helps
is fraud detection and in this neural network help us a lot in detecting fraud.
 Healthcare: In healthcare, Neural Network helps us in Diagnosing diseases, as we
know that there are many diseases and there are large datasets having records of
these diseases. With neural networks and these records, we diagnosed these
diseases in the early stage as soon as possible.
Different Neural Network Method in Data Mining

Neural Network Method is used For Classification, Clustering, Feature mining,


prediction, and pattern recognition. McCulloch-Pitts model is considered to be the first
neural network and the Hebbian learning rule is one of the earliest and simplest
learning rules for the neural network. The neural network model can be broadly divided
into the following three types:

 Feed-Forward Neural Networks: In Feed-Forward Network, if the output values


cannot be traced back to the input values and if for every input node, an output
node is calculated, then there is a forward flow of information and no feedback
between the layers. In simple words, the information moves in only one direction
(forward) from the input nodes, through the hidden nodes (if any), and to the output
nodes. Such a type of network is known as a feedforward network.
 Feedback Neural Network: Signals can travel in both directions in a feedback
network. Feedback neural networks are very powerful and can become very
complex. feedback networks are dynamic. The “states” in such a network are
constantly changing until an equilibrium point is reached. They stay at equilibrium
until the input changes and a new equilibrium needs to be found. Feedback neural
network architectures are also known as interactive or recurrent. Feedback loops
are allowed in such networks. They are used for content addressable memory.
 Self Organization Neural Network: Self Organizing Neural Network (SONN) is a
type of artificial neural network but is trained using competitive learning rather than
error-correction learning (e.g., backpropagation with gradient descent) used by
other artificial neural networks. A Self Organizing Neural Network (SONN) is an
unsupervised learning model in Artificial Neural Network termed as Self-Organizing
Feature Maps or Kohonen Maps. It is used to produce a low-dimensional (typically
two-dimensional) representation of a higher-dimensional data set while preserving
the topological structure of the data.
Don't miss your chance to ride the wave of the data revolution! Every industry is
scaling new heights by tapping into the power of data. Sharpen your skills, become a
part of the hottest trend in the 21st century.
Dive into the future of technology - explore the Complete Machine Learning and Data
Science Program by GeeksforGeeks and stay ahead of the curve.

You might also like