0% found this document useful (0 votes)
16 views

Data mining 3

Uploaded by

shivasingh38025
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Data mining 3

Uploaded by

shivasingh38025
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

What is Data

According to the Oxford “Data is distinct pieces of information, usually


formatted in a special way”.
Data is measured, collected and reported, and analyzed, whereupon it is
often visualized using graphs, images or other analysis tools. Raw data
(“unprocessed data”) may be a collection of numbers or characters before
it’s been “cleaned” and corrected by researchers. It must be corrected so
that we can remove outliers, instrument or data entry errors. Data
processing commonly occurs by stages, and therefore the “processed data”
from one stage could also be considered the “raw data” of subsequent stage.
Field data is data that’s collected in an uncontrolled “in situ” environment.
Experimental data is the data that is generated within the observation of
scientific investigations.
Data can be generated by:
 Humans
 Machines
 Human-Machine combines.
It can often generate anywhere where any information is generated and
stored in structured or unstructured formats.
Why data is important?
 Data helps in make better decisions.
 Data helps in solve problems by finding the reason for
underperformance.
 Data helps one to evaluate the performance.
 Data helps one improve processes.
 Data helps one understand consumers and the market.
Types of Data:
Generally data can be classified into two parts:
1. Categorical Data:
In categorical data we see the data which have a defined category, for
example:
 Marital Status
 Political Party
 Eye color
2. Numerical Data:
Numerical data can further be classified into two categories:
 Discrete Data:
Discrete data contains the data which have discrete numerical
values for example Number of Children, Defects per Hour etc.
 Continuous Data:
Continuous data contains the data which have continuous
numerical values for example Weight, Voltage etc.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful
data that would allow the business to take the data-driven decision from
huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating
hidden patterns of information to various perspectives for categorization into
useful data, which is collected and assembled in particular areas such as
data warehouses, efficient analysis, data mining algorithm, helping decision
making and other data requirement to eventually cost-cutting and
generating revenue.
Data mining is the act of automatically searching for large stores of
information to find trends and patterns that go beyond simple analysis
procedures. Data mining utilizes complex mathematical algorithms for data
segments and evaluates the probability of future events. Data Mining is also
called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from
huge databases to solve business problems. It primarily turns raw data into
useful information.
Data Mining is similar to Data Science carried out by a person, in a specific
situation, on a particular data set, with an objective. This process includes
various types of services such as text mining, web mining, audio and video
mining, pictorial data mining, and social media mining. It is done through
software that is simple or highly specific. By outsourcing data mining, all the
work can be done faster with low operation costs. Specialized firms can also
use new technologies to collect data that is impossible to locate manually.
There are tones of information available on various platforms, but very little
knowledge is accessible. The biggest challenge is to analyze the data to
extract important information that can be used to solve a problem or for
company development. There are many powerful instruments and
techniques available to mine data and find better insight from it.
Types of Data Mining
Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized
by tables, records, and columns from which data can be accessed in various
ways without having to recognize the database tables. Tables convey and
share information, which facilitates data search ability, reporting, and
organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various
sources within the organization to provide meaningful business insights. The
huge amount of data comes from multiple places such as Marketing and
Finance. The extracted data is utilized for analytical purposes and helps in
decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage.
However, many IT professionals utilize the term more clearly to refer to a
specific kind of setup within an IT structure.
For example: - a group of databases, where an organization has kept various
kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database
model is called an object-relational model. It supports Classes, Objects,
Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close
the gap between the Relational database and the object-oriented model
practices frequently utilized in many programming languages, for example,
C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS)
that has the potential to undo a database transaction if it is not performed
appropriately. Even though this was a unique capability a very long while
back, today, most of the relational database systems support transactional
database activities.

Data Quality: Why do we preprocess the data?


Data preprocessing is an essential step in data mining and machine learning
as it helps to ensure the quality of data used for analysis. There are several
factors that are used for data quality assessment, including:
1. Incompleteness:
This refers to missing data or information in the dataset. Missing data can
result from various factors, such as errors during data entry or data loss
during transmission. Preprocessing techniques, such as imputation, can be
used to fill in missing values to ensure the completeness of the dataset.
2. Inconsistency:
This refers to conflicting or contradictory data in the dataset. Inconsistent
data can result from errors in data entry, data integration, or data storage.
Preprocessing techniques, such as data cleaning and data integration, can be
used to detect and resolve inconsistencies in the dataset.
3. Noise:
This refers to random or irrelevant data in the dataset. Noise can result from
errors during data collection or data entry. Preprocessing techniques, such as
data smoothing and outlier detection, can be used to remove noise from the
dataset.
4. Outliers:
Outliers are data points that are significantly different from the other data
points in the dataset. Outliers can result from errors in data collection, data
entry, or data transmission. Preprocessing techniques, such as outlier
detection and removal, can be used to identify and remove outliers from the
dataset.
5. Redundancy:
Redundancy refers to the presence of duplicate or overlapping data in the
dataset. Redundant data can result from data integration or data storage.
Preprocessing techniques, such as data deduplication, can be used to
remove redundant data from the dataset.
5. Data format:
This refers to the structure and format of the data in the dataset. Data may
be in different formats, such as text, numerical, or categorical. Preprocessing
techniques, such as data transformation and normalization, can be used to
convert data into a consistent format for analysis.

Statistical Methods in Data Mining


Data mining refers to extracting or mining knowledge from large amounts of
data. In other words, data mining is the science, art, and technology of
discovering large and complex bodies of data in order to discover useful
patterns. Theoreticians and practitioners are continually seeking improved
techniques to make the process more efficient, cost-effective, and
accurate. Any situation can be analyzed in two ways in data mining:
 Statistical Analysis: In statistics, data is collected, analyzed,
explored, and presented to identify patterns and trends. Alternatively,
it is referred to as quantitative analysis.
 Non-statistical Analysis: This analysis provides generalized
information and includes sound, still images, and moving images.
In statistics, there are two main categories:
 Descriptive Statistics: The purpose of descriptive statistics is to
organize data and identify the main characteristics of that data. Graphs
or numbers summarize the data. Average, Mode, SD(Standard
Deviation), and Correlation are some of the commonly used descriptive
statistical methods.
 Inferential Statistics: The process of drawing conclusions based on
probability theory and generalizing the data. By analyzing sample
statistics, you can infer parameters about populations and make
models of relationships within data.
There are various statistical terms that one should be aware of while dealing
with statistics. Some of these are:
 Population
 Sample
 Variable
 Quantitative Variable
 Qualitative Variable
 Discrete Variable
 Continuous Variable
Now, let’s start discussing statistical methods. This is the analysis of raw
data using mathematical formulas, models, and techniques. Through the use
of statistical methods, information is extracted from research data, and
different ways are available to judge the robustness of research outputs.
As a matter of fact, today’s statistical methods used in the data mining field
typically are derived from the vast statistical toolkit developed to answer
problems arising in other fields. These techniques are taught in science
curriculums. It is necessary to check and test several hypotheses. The
hypotheses described above help us assess the validity of our data mining
endeavor when attempting to infer any inferences from the data under study.
When using more complex and sophisticated statistical estimators and tests,
these issues become more pronounced.
For extracting knowledge from databases containing different types of
observations, a variety of statistical methods are available in Data Mining
and some of these are:
 Logistic regression analysis
 Correlation analysis
 Regression analysis
 Discriminate analysis
 Linear discriminate analysis (LDA)
 Classification
 Clustering
 Outlier detection
 Classification and regression trees,
 Correspondence analysis
 Nonparametric regression,
 Statistical pattern recognition,
 Categorical data analysis,
 Time-series methods for trends and periodicity
 Artificial neural networks
Now, let’s try to understand some of the important statistical methods which
are used in data mining:
 Linear Regression: The linear regression method uses the best linear
relationship between the independent and dependent variables to
predict the target variable. In order to achieve the best fit, make sure
that all the distances between the shape and the actual observations
at each point are as small as possible. A good fit can be determined by
determining that no other position would produce fewer errors given
the shape chosen. Simple linear regression and multiple linear
regressions are the two major types of linear regression. By fitting a
linear relationship to the independent variable, the simple linear
regression predicts the dependent variable. Using multiple
independent variables, multiple linear regressions fits the best linear
relationship with the dependent variable. For more details, you can
refer linear regression.
 Classification: This is a method of data mining in which a collection of
data is categorized so that a greater degree of accuracy can be
predicted and analyzed. An effective way to analyze very large
datasets is to classify them. Classification is one of several methods
aimed at improving the efficiency of the analysis process. A Logistic
Regression and a Discriminate Analysis stand out as two major
classification techniques.
 Logistic Regression: It can also be applied to machine learning
applications and predictive analytics. In this approach, the
dependent variable is either binary (binary regression) or
multinomial (multinomial regression): either one of the two or a
set of one, two, three, or four options. With a logistic regression
equation, one can estimate probabilities regarding the
relationship between the independent variable and the
dependent variable. For understanding logistic regression
analysis in detail, you can refer to logistic regression.
 Discriminate Analysis: A Discriminate Analysis is a statistical
method of analyzing data based on the measurements of
categories or clusters and categorizing new observations into
one or more populations that were identified a priori. The
discriminate analysis models each response class independently
then uses Bayes’s theorem to flip these projections around to
estimate the likelihood of each response category given the
value of X. These models can be either linear or quadratic.
 Linear Discriminant Analysis: According to Linear
Discriminant Analysis, each observation is assigned a
discriminant score to classify it into a response variable
class. By combining the independent variables in a linear
fashion, these scores can be obtained. Based on this
model, observations are drawn from a Gaussian
distribution, and the predictor variables are correlated
across all k levels of the response variable, Y. and for
further details linear discriminant analysis
 Quadratic Discriminant Analysis: An alternative
approach is provided by Quadratic Discriminant Analysis.
LDA and QDA both assume Gaussian distributions for the
observations of the Y classes. Unlike LDA, QDA considers
each class to have its own covariance matrix. As a result,
the predictor variables have different variances across the
k levels in Y.
 Correlation Analysis: In statistical terms, correlation analysis
captures the relationship between variables in a pair. The value
of such variables is usually stored in a column or rows of a
database table and represents a property of an object.
 Regression Analysis: Based on a set of numeric data,
regression is a data mining method that predicts a range of
numerical values (also known as continuous values). You could,
for instance, use regression to predict the cost of goods and
services based on other variables. A regression model is used
across numerous industries for forecasting financial data,
modeling environmental conditions, and analyzing trends.

Introduction of Statistical Data Distributions


Distribution simply means collection or gathering of data, or scores, on
variable. Generally, all these scores are arranged in specific order from
smallest to largest. Then these scores can be presented graphically. Many
data comply with rules of well-known and highly understood functions of
mathematics.
A function can usually fit data with some modifications and changes in
parameters of functions. As soon as distribution function is known and
identified, it can be used as shorthand for describing and calculating related
quantities. These quantities can be likelihood of observations, and plotting
relationship between observations in domain.
Distributions are generally described in terms of their density or density
functions. Density functions are simply described as functions that explain
how proportion of data or likelihood of proportion of observations changes
over wide range of distribution. Density functions are of two types –
 Probability Density Function (PDF) –
It calculates probability of observing given value.
 Cumulative Density Function (CDF) –
It calculates probability of an observation equal or less than value.
Both PDFs and CDFs are type of continuous functions. For discrete
distribution, equivalent of PDF is called Probability Mass Function (PMF).
Types of Statistical Data Distributions:
1. Gaussian distribution –
It is named after Carl Friedrich Gauss. Gaussian Distribution is focus of
much of field of statistics. It is also known as Normal Distribution.
With use of Gaussian distribution, data from different study fields can
be described. Generally, Gaussian Distribution is described using two
parameters :
 Mean:
It is denoted with Greek lowercase letter “mu”. It is expected
value of distribution.
 Variance:
It is denoted with Greek lowercase letter “sigma” raised to
second power (this is because units of variables are squared.). It
generally describes spread of observation from mean.
It is very common and easy to use normalized calculation of variance called
Standard Deviation. Standard Deviation is denoted with Greek lowercase
letter “sigma”. It generally describes normalized spread of observations from
mean.
2. Example –
The example given below creates Gaussian PDF with sample space
from -5 to 5, mean of 0, and standard deviation of 1. Such type of
gaussian with these values of mean and standard deviation is
called Standard Gaussian.
3. Python Code for Line Plot of Gaussian Probability Density Function :
4. # plot the gaussian pdf
5. from numpy import arrange
6. from matplotlib import pyplot
7. from scipy.stats import norm
8. # define the distribution parameters
9. sample_space= arange (-5, 5, 0.001)
10. mean= 0.0
11. stdev= 1.0
12. # calculate the pdf
13. pdf= norm.pdf (sample_space, mean, stdev)
14. # plot
15. pyplot.plot (sample_space, pdf)
16. pyplot.show ()
17. When we run above example, it creates lined plot that shows
sample space in x-axis and likelihood of each value of Y-axis. Line plot
generally shows and represents familiar bell shape for Gaussian
distribution.
18.

19. In this plot, top of bell shows expected value or mean, which in
this is zero, as we have already specified it while creating distribution.

20. T- Distribution –
It is named after Willian Sealy Gosset. T- distribution generally arises
when we attempt to find out mean of normal distribution with different
sized samples. It is very helpful when describing uncertainty or error
related to estimating or finding out population statistics for data drawn
from Gaussian Distributions when size of sample must be considered.
T-distribution can be described using single parameter.
Number of Degrees of Freedom :
It is denoted with Greek lowercase letter “nu (v)”. It simply denotes
number of degrees of freedom. Number of degrees of freedom
generally explains number of pieces of information that is used to
describe population quantity.
Example –
The example given below creates t-distribution with sample space from -5 to
5 and (10, 000-1) degrees of freedom.
Python Code for Line Plot of Student’s t-distribution Probability Density
Function :
# plot the t-distribution pdf
from numpy import arange
from matplotlib import pyplot
from scipy.stats import t
# define the distribution parameters
sample_space= arange (-5, 5, 0.001)
dof= len(sample_space) - 1
# calculate the pdf
pdf= t.pdf (sample_space, dof)
# plot
pyplot.plot (sample_space, pdf)
pyplot.show ()
When we run above example, it creates and plots t-distribution PDF.
You can see similar bell-shape to distribution much like normal. The main
difference is fatter tails in distribution, highlighting increased likelihood of
observations in tails as compared to that of Gaussian Distribution.

Data Mining - Tasks


Data mining deals with the kind of patterns that can be mined. On the basis
of the kind of data to be mined, there are two categories of functions
involved in Data Mining −
 Descriptive
 Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the
database. Here is the list of descriptive functions −
 Class/Concept Description
 Mining of Frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales include
computer and printers, and concepts of customers include big spenders and
budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the
following two ways −
 Data Characterization − This refers to summarizing data of class
under study. This class under study is called as Target Class.
 Data Discrimination − It refers to the mapping or classification of a
class with some predefined group or class.
Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional
data. Here is the list of kind of frequent patterns −
 Frequent Item Set − It refers to a set of items that frequently appear
together, for example, milk and bread.
 Frequent Subsequence − A sequence of patterns that occur
frequently such as purchasing a camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural
forms, such as graphs, trees, or lattices, which may be combined with
item-sets or subsequences.
Mining of Association
Associations are used in retail sales to identify patterns that are frequently
purchased together. This process refers to the process of uncovering the
relationship among data and determining association rules.
For example, a retailer generates an association rule that shows that 70% of
time milk is sold with bread and only 30% of times biscuits are sold with
bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item
sets to analyze that if they have positive, negative or no effect on each
other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to
forming group of objects that are very similar to each other but are highly
different from the objects in other clusters.
Classification and Prediction
Classification is the process of finding a model that describes the data
classes or concepts. The purpose is to be able to use this model to predict
the class of objects whose class label is unknown. This derived model is
based on the analysis of sets of training data. The derived model can be
presented in the following forms −
 Classification (IF-THEN) Rules
 Decision Trees
 Mathematical Formulae
 Neural Networks
The list of functions involved in these processes are as follows −
 Classification − It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on
the analysis set of training data i.e. the data object whose class label is
well known.
 Prediction − It is used to predict missing or unavailable numerical
data values rather than class labels. Regression Analysis is generally
used for prediction. Prediction can also be used for identification of
distribution trends based on available data.
 Outlier Analysis − Outliers may be defined as the data objects that
do not comply with the general behavior or model of the data
available.
 Evolution Analysis − Evolution analysis refers to the description and
model regularities or trends for objects whose behavior changes over
time.
Data Mining Task Primitives
 We can specify a data mining task in the form of a data mining query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner
with the data mining system. Here is the list of Data Mining Task Primitives −
 Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion
includes the following −
 Database Attributes
 Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Clustering
 Outlier Analysis
 Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels of
abstraction. For example, the Concept hierarchies are one of the background
knowledge that allows data to be mined at multiple levels of abstraction.
Interestingness measures and thresholds for pattern evaluation
This is used to evaluate the patterns that are discovered by the process of
knowledge discovery. There are different interesting measures for different
kind of knowledge.
Representation for visualizing the discovered patterns
This refers to the form in which discovered patterns are to be displayed.
These representations may include the following. −
 Rules
 Tables
 Charts
 Graphs
 Decision Trees
 Cubes

KDD vs Data Mining


KDD (Knowledge Discovery in Databases) is a field of computer science, which
includes the tools and theories to help humans in extracting useful and previously
unknown information (i.e., knowledge) from large collections of digitized data. KDD
consists of several steps, and Data Mining is one of them. Data Mining is the
application of a specific algorithm to extract patterns from data. Nonetheless, KDD
and Data Mining are used interchangeably.
What is KDD?
KDD is a computer science field specializing in extracting previously unknown and
interesting information from raw data. KDD is the whole process of trying to make
sense of data by developing appropriate methods or techniques. This process deals
with low-level mapping data into other forms that are more compact, abstract, and
useful. This is achieved by creating short reports, modeling the process of
generating data, and developing predictive models that can predict future cases.

Due to the exponential growth of data, especially in areas such as business, KDD
has become a very important process to convert this large wealth of data into
business intelligence, as manual extraction of patterns has become seemingly
impossible in the past few decades.

For example, it is currently used for various applications such as social network
analysis, fraud detection, science, investment, manufacturing, telecommunications,
data cleaning, sports, information retrieval, and marketing. KDD is usually used to
answer questions like what are the main products that might help to obtain high-
profit next year in V-Mart.

KDD Process Steps


Knowledge discovery in the database process includes the following steps, such as:

1. Goal identification: Develop and understand the application domain and


the relevant prior knowledge and identify the KDD process's goal from the
customer perspective.
2. Creating a target data set: Selecting the data set or focusing on a set of
variables or data samples on which the discovery was made.
3. Data cleaning and preprocessing:Basic operations include removing noise
if appropriate, collecting the necessary information to model or account for
noise, deciding on strategies for handling missing data fields, and accounting
for time sequence information and known changes.
4. Data reduction and projection: Finding useful features to represent the
data depending on the purpose of the task. The effective number of variables
under consideration may be reduced through dimensionality reduction
methods or conversion, or invariant representations for the data can be
found.
5. Matching process objectives: KDD with step 1 a method of mining
particular. For example, summarization, classification, regression, clustering,
and others.
6. Modeling and exploratory analysis and hypothesis selection: Choosing
the algorithms or data mining and selecting the method or methods to search
for data patterns. This process includes deciding which model and
parameters may be appropriate (e.g., definite data models are different
models on the real vector) and the matching of data mining methods,
particularly with the general approach of the KDD process (for example, the
end-user might be more interested in understanding the model in its
predictive capabilities).
7. Data Mining: The search for patterns of interest in a particular
representational form or a set of these representations, including
classification rules or trees, regression, and clustering. The user can
significantly aid the data mining method to carry out the preceding steps
properly.
8. Presentation and evaluation: Interpreting mined patterns, possibly
returning to some of the steps between steps 1 and 7 for additional
iterations. This step may also involve the visualization of the extracted
patterns and models or visualization of the data given the models drawn.
9. Taking action on the discovered knowledge: Using the knowledge
directly, incorporating the knowledge in another system for further action, or
simply documenting and reporting to stakeholders. This process also includes
checking and resolving potential conflicts with previously believed knowledge
(or extracted).

What is Data Mining?


Data mining, also known as Knowledge Discovery in Databases, refers to the
nontrivial extraction of implicit, previously unknown, and potentially useful
information from data stored in databases.

Data Mining is only a step within the overall KDD process. There are two major Data
Mining goals defined by the application's goal: verification of discovery. Verification
verifies the user's hypothesis about data, while discovery automatically finds
interesting patterns.
There are four major data mining tasks: clustering, classification, regression, and
association (summarization). Clustering is identifying similar groups from
unstructured data. Classification is learning rules that can be applied to new data.
Regression is finding functions with minimal error to model data. And the
association looks for relationships between variables. Then, the specific data mining
algorithm needs to be selected. Different algorithms like linear regression, logistic
regression, decision trees, and Naive Bayes can be selected depending on the goal.
Then patterns of interest in one or more symbolic forms are searched. Finally,
models are evaluated either using predictive accuracy or understandability.

Why do we need Data Mining?


The volume of information is increasing every day that we can handle from business
transactions, scientific data, sensor data, Pictures, videos, etc. So, we need a
system that will be capable of extracting the essence of information available and
that can automatically generate reports, views, or summaries of data for better
decision-making.

Why is Data Mining used in business?


Data mining is used in business to make better managerial decisions by:

o Automatic summarization of data.


o Discovering patterns in raw data.
o Extracting the essence of information stored.

Why KDD and Data Mining?


In an increasingly data-driven world, there would never be such a thing as too much
data. However, data is only valuable when you can parse, sort, and sift through it to
extrapolate the actual value.

Most industries collect massive volumes of data, but without a filtering mechanism
that graphs, charts, and trends data models, pure data itself has little use.

However, the sheer volume of data and the speed with which it is collected makes
sifting through it challenging. Thus, it has become economically and scientifically
necessary to scale up our analysis capability to handle the vast amount of data that
we now obtain.

Since computers have allowed humans to collect more data than we can process,
we naturally turn to computational techniques to help us extract meaningful
patterns and structures from vast amounts of data.
Difference between KDD and Data Mining
Although the two terms KDD and Data Mining are heavily used interchangeably,
they refer to two related yet slightly different concepts.

KDD is the overall process of extracting knowledge from data, while Data Mining is a
step inside the KDD process, which deals with identifying patterns in data.

And Data Mining is only the application of a specific algorithm based on the overall
goal of the KDD process.

KDD is an iterative process where evaluation measures can be enhanced, mining


can be refined, and new data can be integrated and transformed to get different
and more appropriate results.

Data Mining - Issues

Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge
discovery task.
 Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should
be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods
are required to handle the noise and incomplete objects while mining
the data regularities. If the data cleaning methods are not there then
the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting
because either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data
again from scratch.
Diverse Data Types Issues
 Handling of relational and complex types of data − The database
may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all
these kind of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data
sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.

What is Fuzzy Logic?


The 'Fuzzy' word means the things that are not clear or are vague. Sometimes, we
cannot decide in real life that the given problem or statement is either true or false.
At that time, this concept provides many values between the true and false and
gives the flexibility to find the best solution to that problem.
Example of Fuzzy Logic as comparing to Boolean Logic

Fuzzy logic contains the multiple logical values and these values are the truth
values of a variable or problem between 0 and 1. This concept was introduced
by Lofti Zadeh in 1965 based on the Fuzzy Set Theory. This concept provides
the possibilities which are not given by computers, but similar to the range of
possibilities generated by humans.

In the Boolean system, only two possibilities (0 and 1) exist, where 1 denotes the
absolute truth value and 0 denotes the absolute false value. But in the fuzzy
system, there are multiple possibilities present between the 0 and 1, which are
partially false and partially true.

The Fuzzy logic can be implemented in systems such as micro-controllers,


workstation-based or large network-based systems for achieving the definite output.
It can also be implemented in both hardware or software.

Characteristics of Fuzzy Logic


Following are the characteristics of fuzzy logic:

1. This concept is flexible and we can easily understand and implement it.
2. It is used for helping the minimization of the logics created by the human.
3. It is the best method for finding the solution of those problems which are
suitable for approximate or uncertain reasoning.
4. It always offers two values, which denote the two possible solutions for a
problem and statement.
5. It allows users to build or create the functions which are non-linear of
arbitrary complexity.
6. In fuzzy logic, everything is a matter of degree.
7. In the Fuzzy logic, any system which is logical can be easily fuzzified.
8. It is based on natural language processing.
9. It is also used by the quantitative analysts for improving their algorithm's
execution.
10.It also allows users to integrate with the programming.

Architecture of a Fuzzy Logic System


In the architecture of the Fuzzy Logic system, each component plays an important
role. The architecture consists of the different four components which are given
below.

1. Rule Base
2. Fuzzification
3. Inference Engine
4. Defuzzification

Following diagram shows the architecture or process of a Fuzzy Logic system:


1. Rule Base
Rule Base is a component used for storing the set of rules and the If-Then conditions
given by the experts are used for controlling the decision-making systems. There
are so many updates that come in the Fuzzy theory recently, which offers effective
methods for designing and tuning of fuzzy controllers. These updates or
developments decreases the number of fuzzy set of rules.

2. Fuzzification
Fuzzification is a module or component for transforming the system inputs, i.e., it
converts the crisp number into fuzzy steps. The crisp numbers are those inputs
which are measured by the sensors and then fuzzification passed them into the
control systems for further processing. This component divides the input signals
into following five states in any Fuzzy Logic system:

o Large Positive (LP)


o Medium Positive (MP)
o Small (S)
o Medium Negative (MN)
o Large negative (LN)
3. Inference Engine
This component is a main component in any Fuzzy Logic system (FLS), because all
the information is processed in the Inference Engine. It allows users to find the
matching degree between the current fuzzy input and the rules. After the matching
degree, this system determines which rule is to be added according to the given
input field. When all rules are fired, then they are combined for developing the
control actions.

4. Defuzzification
Defuzzification is a module or component, which takes the fuzzy set inputs
generated by the Inference Engine, and then transforms them into a crisp value.
It is the last step in the process of a fuzzy logic system. The crisp value is a type of
value which is acceptable by the user. Various techniques are present to do this,
but the user has to select the best one for reducing the errors.

Membership Function
The membership function is a function which represents the graph of fuzzy sets,
and allows users to quantify the linguistic term. It is a graph which is used for
mapping each element of x to the value between 0 and 1.

This function is also known as indicator or characteristics function.

This function of Membership was introduced in the first papers of fuzzy set
by Zadeh. For the Fuzzy set B, the membership function for X is defined as: μB:X →
[0,1]. In this function X, each element of set B is mapped to the value between 0
and 1. This is called a degree of membership or membership value.

Applications of Fuzzy Logic


Following are the different application areas where the Fuzzy Logic concept is
widely used:

1. It is used in Businesses for decision-making support system.


2. It is used in Automative systems for controlling the traffic and speed, and
for improving the efficiency of automatic transmissions. Automative
systems also use the shift scheduling method for automatic transmissions.
3. This concept is also used in the Defence in various areas. Defence mainly
uses the Fuzzy logic systems for underwater target recognition and the
automatic target recognition of thermal infrared images.
4. It is also widely used in the Pattern Recognition and Classification in the
form of Fuzzy logic-based recognition and handwriting recognition. It is also
used in the searching of fuzzy images.
5. Fuzzy logic systems also used in Securities.
6. It is also used in microwave oven for setting the lunes power and cooking
strategy.
7. This technique is also used in the area of modern control systems such as
expert systems.
8. Finance is also another application where this concept is used for predicting
the stock market, and for managing the funds.
9. It is also used for controlling the brakes.
10.It is also used in the industries of chemicals for controlling the ph, and
chemical distillation process.
11.It is also used in the industries of manufacturing for the optimization of
milk and cheese production.
12.It is also used in the vacuum cleaners, and the timings of washing machines.
13.It is also used in heaters, air conditioners, and humidifiers.

Advantages of Fuzzy Logic


Fuzzy Logic has various advantages or benefits. Some of them are as follows:

1. The methodology of this concept works similarly as the human reasoning.


2. Any user can easily understand the structure of Fuzzy Logic.
3. It does not need a large memory, because the algorithms can be easily
described with fewer data.
4. It is widely used in all fields of life and easily provides effective solutions to
the problems which have high complexity.
5. This concept is based on the set theory of mathematics, so that's why it is
simple.
6. It allows users for controlling the control machines and consumer products.
7. The development time of fuzzy logic is short as compared to conventional
methods.
8. Due to its flexibility, any user can easily add and delete rules in the FLS
system.
Disadvantages of Fuzzy Logic
Fuzzy Logic has various disadvantages or limitations. Some of them are as follows:

1. The run time of fuzzy logic systems is slow and takes a long time to produce
outputs.
2. Users can understand it easily if they are simple.
3. The possibilities produced by the fuzzy logic system are not always accurate.
4. Many researchers give various ways for solving a given statement using this
technique which leads to ambiguity.
5. Fuzzy logics are not suitable for those problems that require high accuracy.
6. The systems of a Fuzzy logic need a lot of testing for verification and
validation.

Fuzzy Set
The set theory of classical is the subset of Fuzzy set theory. Fuzzy logic is based on
this theory, which is a generalisation of the classical theory of set (i.e., crisp set)
introduced by Zadeh in 1965.

A fuzzy set is a collection of values which exist between 0 and 1. Fuzzy sets are
denoted or represented by the tilde (~) character. The sets of Fuzzy theory were
introduced in 1965 by Lofti A. Zadeh and Dieter Klaua. In the fuzzy set, the partial
membership also exists. This theory released as an extension of classical set theory.

This theory is denoted mathematically asA fuzzy set (Ã) is a pair of U and M, where
U is the Universe of discourse and M is the membership function which takes on
values in the interval [ 0, 1 ]. The universe of discourse (U) is also denoted by Ω or
X.

Operations on Fuzzy Set


Given à and B are the two fuzzy sets, and X be the universe of discourse with the
following respective member functions:
The operations of Fuzzy set are as follows:

1. Union Operation: The union operation of a fuzzy set is defined by:

μA∪B(x) = max (μA(x), μB(x))

Example:

Let's suppose A is a set which contains following elements:

A = {( X1, 0.6 ), (X2, 0.2), (X3, 1), (X4, 0.4)}

And, B is a set which contains following elements:

B = {( X1, 0.1), (X2, 0.8), (X3, 0), (X4, 0.9)}

then,

AUB = {( X1, 0.6), (X2, 0.8), (X3, 1), (X4, 0.9)}

Because, according to this operation

For X1

μA∪B(X1)=max(μA(X1),μB(X1))
μA∪B(X1)=max(0.6,0.1)
μA∪B(X1) = 0.6

For X2

μA∪B(X2)=max(μA(X2),μB(X2))
μA∪B(X2)=max(0.2,0.8)
μA∪B(X2) = 0.8

For X3
μA∪B(X3)=max(μA(X3),μB(X3))
μA∪B(X3)=max(1,0)
μA∪B(X3) = 1

For X4

μA∪B(X4)=max(μA(X4),μB(X4))
μA∪B(X4)=max(0.4,0.9)
μA∪B(X4) = 0.9

2. Intersection Operation:The intersection operation of fuzzy set is defined by:

μA∩B(x) = min (μA(x), μB(x))

Example:

Let's suppose A is a set which contains following elements:

A = {( X1, 0.3 ), (X2, 0.7), (X3, 0.5), (X4, 0.1)}

And, B is a set which contains following elements:

B = {( X1, 0.8), (X2, 0.2), (X3, 0.4), (X4, 0.9)}

then,

A∩B = {( X1, 0.3), (X2, 0.2), (X3, 0.4), (X4, 0.1)}

Because, according to this operation

For X1

μA∩B(X1)=min(μA(X1)μB(X1))
μA∩B(X1)=min(0.3,0.8)
μA∩B(X1) = 0.3

For X2
μA∩B(X2)=min(μA(X2),μB(X2))
μA∩B(X2)=min(0.7,0.2)
μA∩B(X2) = 0.2

For X3

μA∩B(X3)=min(μA(X3),μB(X3))
μA∩B(X3)=min(0.5,0.4)
μA∩B(X3) = 0.4

For X4

μA∩B(X4)=min(μA(X4),μB(X4))
μA∩B(X4)=min(0.1,0.9)
μA∩B(X4) = 0.1

3. Complement Operation: The complement operation of fuzzy set is defined by:

μĀ(x) = 1-μA(x),

Example:

Let's suppose A is a set which contains following elements:

A = {( X1, 0.3 ), (X2, 0.8), (X3, 0.5), (X4, 0.1)}

then,

Ā= {( X1, 0.7 ), (X2, 0.2), (X3, 0.5), (X4, 0.9)}

Because, according to this operation

For X1

μĀ(X1)=1-μA(X1)
μĀ(X1)=1-0.3
μĀ(X1) = 0.7

For X2
μĀ(X2)=1-μA(X2)
μĀ(X2)=1-0.8
μĀ(X2) = 0.2

For X3

μĀ(X3)=1-μA(X3)
μĀ(X3)=1-0.5
μĀ(X3) = 0.5

For X4

μĀ(X4)=1-μA(X4)
μĀ(X4)=1-0.1
μĀ(X4) = 0.9

You might also like