Data mining 3
Data mining 3
19. In this plot, top of bell shows expected value or mean, which in
this is zero, as we have already specified it while creating distribution.
20. T- Distribution –
It is named after Willian Sealy Gosset. T- distribution generally arises
when we attempt to find out mean of normal distribution with different
sized samples. It is very helpful when describing uncertainty or error
related to estimating or finding out population statistics for data drawn
from Gaussian Distributions when size of sample must be considered.
T-distribution can be described using single parameter.
Number of Degrees of Freedom :
It is denoted with Greek lowercase letter “nu (v)”. It simply denotes
number of degrees of freedom. Number of degrees of freedom
generally explains number of pieces of information that is used to
describe population quantity.
Example –
The example given below creates t-distribution with sample space from -5 to
5 and (10, 000-1) degrees of freedom.
Python Code for Line Plot of Student’s t-distribution Probability Density
Function :
# plot the t-distribution pdf
from numpy import arange
from matplotlib import pyplot
from scipy.stats import t
# define the distribution parameters
sample_space= arange (-5, 5, 0.001)
dof= len(sample_space) - 1
# calculate the pdf
pdf= t.pdf (sample_space, dof)
# plot
pyplot.plot (sample_space, pdf)
pyplot.show ()
When we run above example, it creates and plots t-distribution PDF.
You can see similar bell-shape to distribution much like normal. The main
difference is fatter tails in distribution, highlighting increased likelihood of
observations in tails as compared to that of Gaussian Distribution.
Due to the exponential growth of data, especially in areas such as business, KDD
has become a very important process to convert this large wealth of data into
business intelligence, as manual extraction of patterns has become seemingly
impossible in the past few decades.
For example, it is currently used for various applications such as social network
analysis, fraud detection, science, investment, manufacturing, telecommunications,
data cleaning, sports, information retrieval, and marketing. KDD is usually used to
answer questions like what are the main products that might help to obtain high-
profit next year in V-Mart.
Data Mining is only a step within the overall KDD process. There are two major Data
Mining goals defined by the application's goal: verification of discovery. Verification
verifies the user's hypothesis about data, while discovery automatically finds
interesting patterns.
There are four major data mining tasks: clustering, classification, regression, and
association (summarization). Clustering is identifying similar groups from
unstructured data. Classification is learning rules that can be applied to new data.
Regression is finding functions with minimal error to model data. And the
association looks for relationships between variables. Then, the specific data mining
algorithm needs to be selected. Different algorithms like linear regression, logistic
regression, decision trees, and Naive Bayes can be selected depending on the goal.
Then patterns of interest in one or more symbolic forms are searched. Finally,
models are evaluated either using predictive accuracy or understandability.
Most industries collect massive volumes of data, but without a filtering mechanism
that graphs, charts, and trends data models, pure data itself has little use.
However, the sheer volume of data and the speed with which it is collected makes
sifting through it challenging. Thus, it has become economically and scientifically
necessary to scale up our analysis capability to handle the vast amount of data that
we now obtain.
Since computers have allowed humans to collect more data than we can process,
we naturally turn to computational techniques to help us extract meaningful
patterns and structures from vast amounts of data.
Difference between KDD and Data Mining
Although the two terms KDD and Data Mining are heavily used interchangeably,
they refer to two related yet slightly different concepts.
KDD is the overall process of extracting knowledge from data, while Data Mining is a
step inside the KDD process, which deals with identifying patterns in data.
And Data Mining is only the application of a specific algorithm based on the overall
goal of the KDD process.
Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different
users may be interested in different kinds of knowledge. Therefore it is
necessary for data mining to cover a broad range of knowledge
discovery task.
Interactive mining of knowledge at multiple levels of
abstraction − The data mining process needs to be interactive
because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery
process and to express the discovered patterns, the background
knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at
multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data
Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and
optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once
the patterns are discovered it needs to be expressed in high level
languages, and visual representations. These representations should
be easily understandable.
Handling noisy or incomplete data − The data cleaning methods
are required to handle the noise and incomplete objects while mining
the data regularities. If the data cleaning methods are not there then
the accuracy of the discovered patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting
because either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data
again from scratch.
Diverse Data Types Issues
Handling of relational and complex types of data − The database
may contain complex data objects, multimedia data objects, spatial
data, temporal data etc. It is not possible for one system to mine all
these kind of data.
Mining information from heterogeneous databases and global
information systems − The data is available at different data
sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.
Fuzzy logic contains the multiple logical values and these values are the truth
values of a variable or problem between 0 and 1. This concept was introduced
by Lofti Zadeh in 1965 based on the Fuzzy Set Theory. This concept provides
the possibilities which are not given by computers, but similar to the range of
possibilities generated by humans.
In the Boolean system, only two possibilities (0 and 1) exist, where 1 denotes the
absolute truth value and 0 denotes the absolute false value. But in the fuzzy
system, there are multiple possibilities present between the 0 and 1, which are
partially false and partially true.
1. This concept is flexible and we can easily understand and implement it.
2. It is used for helping the minimization of the logics created by the human.
3. It is the best method for finding the solution of those problems which are
suitable for approximate or uncertain reasoning.
4. It always offers two values, which denote the two possible solutions for a
problem and statement.
5. It allows users to build or create the functions which are non-linear of
arbitrary complexity.
6. In fuzzy logic, everything is a matter of degree.
7. In the Fuzzy logic, any system which is logical can be easily fuzzified.
8. It is based on natural language processing.
9. It is also used by the quantitative analysts for improving their algorithm's
execution.
10.It also allows users to integrate with the programming.
1. Rule Base
2. Fuzzification
3. Inference Engine
4. Defuzzification
2. Fuzzification
Fuzzification is a module or component for transforming the system inputs, i.e., it
converts the crisp number into fuzzy steps. The crisp numbers are those inputs
which are measured by the sensors and then fuzzification passed them into the
control systems for further processing. This component divides the input signals
into following five states in any Fuzzy Logic system:
4. Defuzzification
Defuzzification is a module or component, which takes the fuzzy set inputs
generated by the Inference Engine, and then transforms them into a crisp value.
It is the last step in the process of a fuzzy logic system. The crisp value is a type of
value which is acceptable by the user. Various techniques are present to do this,
but the user has to select the best one for reducing the errors.
Membership Function
The membership function is a function which represents the graph of fuzzy sets,
and allows users to quantify the linguistic term. It is a graph which is used for
mapping each element of x to the value between 0 and 1.
This function of Membership was introduced in the first papers of fuzzy set
by Zadeh. For the Fuzzy set B, the membership function for X is defined as: μB:X →
[0,1]. In this function X, each element of set B is mapped to the value between 0
and 1. This is called a degree of membership or membership value.
1. The run time of fuzzy logic systems is slow and takes a long time to produce
outputs.
2. Users can understand it easily if they are simple.
3. The possibilities produced by the fuzzy logic system are not always accurate.
4. Many researchers give various ways for solving a given statement using this
technique which leads to ambiguity.
5. Fuzzy logics are not suitable for those problems that require high accuracy.
6. The systems of a Fuzzy logic need a lot of testing for verification and
validation.
Fuzzy Set
The set theory of classical is the subset of Fuzzy set theory. Fuzzy logic is based on
this theory, which is a generalisation of the classical theory of set (i.e., crisp set)
introduced by Zadeh in 1965.
A fuzzy set is a collection of values which exist between 0 and 1. Fuzzy sets are
denoted or represented by the tilde (~) character. The sets of Fuzzy theory were
introduced in 1965 by Lofti A. Zadeh and Dieter Klaua. In the fuzzy set, the partial
membership also exists. This theory released as an extension of classical set theory.
This theory is denoted mathematically asA fuzzy set (Ã) is a pair of U and M, where
U is the Universe of discourse and M is the membership function which takes on
values in the interval [ 0, 1 ]. The universe of discourse (U) is also denoted by Ω or
X.
Example:
then,
For X1
μA∪B(X1)=max(μA(X1),μB(X1))
μA∪B(X1)=max(0.6,0.1)
μA∪B(X1) = 0.6
For X2
μA∪B(X2)=max(μA(X2),μB(X2))
μA∪B(X2)=max(0.2,0.8)
μA∪B(X2) = 0.8
For X3
μA∪B(X3)=max(μA(X3),μB(X3))
μA∪B(X3)=max(1,0)
μA∪B(X3) = 1
For X4
μA∪B(X4)=max(μA(X4),μB(X4))
μA∪B(X4)=max(0.4,0.9)
μA∪B(X4) = 0.9
Example:
then,
For X1
μA∩B(X1)=min(μA(X1)μB(X1))
μA∩B(X1)=min(0.3,0.8)
μA∩B(X1) = 0.3
For X2
μA∩B(X2)=min(μA(X2),μB(X2))
μA∩B(X2)=min(0.7,0.2)
μA∩B(X2) = 0.2
For X3
μA∩B(X3)=min(μA(X3),μB(X3))
μA∩B(X3)=min(0.5,0.4)
μA∩B(X3) = 0.4
For X4
μA∩B(X4)=min(μA(X4),μB(X4))
μA∩B(X4)=min(0.1,0.9)
μA∩B(X4) = 0.1
μĀ(x) = 1-μA(x),
Example:
then,
For X1
μĀ(X1)=1-μA(X1)
μĀ(X1)=1-0.3
μĀ(X1) = 0.7
For X2
μĀ(X2)=1-μA(X2)
μĀ(X2)=1-0.8
μĀ(X2) = 0.2
For X3
μĀ(X3)=1-μA(X3)
μĀ(X3)=1-0.5
μĀ(X3) = 0.5
For X4
μĀ(X4)=1-μA(X4)
μĀ(X4)=1-0.1
μĀ(X4) = 0.9