Data mining module - New
Data mining module - New
Required Texts:
1. Text Book: Data Mining Concepts and Techniques, Jaiwei Han and Michelin
Kamber, Second Edition, Morgan Kaufmann Publishers, Elsevier
2. M. H. Dunham, 2003, Data Mining: Introductory and Advanced Topics, Pearson
Education, Delhi.
Summary of Teaching Learning Methods:
Policies
Attendance: It is compulsory to come to class on time and every time. If you are going to
miss more than a couple of classes during the term, you should not take this course.
Assignments: You should submit individual and group assignments on due date, late
assignments won’t entertain.
Tests/Quizzes: You should take all quizzes and assignments as scheduled. If you miss quiz
or assignment without any reason, no makeup will be arranged for you.
Cheating/plagiarism: you must do your own work and not copy and get answers from
someone else.
Student Workload: Taking into consideration that 1ECTS accounts for 27 hours of student
work, the course Introduction to Data mining and Warehousing has 5*27hr =135
Grading policies
Student grade and performance will be evaluated as the whole activities
(Tests) (30%) + lab exam (10%) + Quizzes (10%) + Attendance (5%) + Assignment
(5%) +Final Exam (40%)) = Total (100%).
Data mining refers to extracting or mining knowledge from large amounts of data. The term
is actually a misnomer. Thus, data mining should have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at
the intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
transform it into an understandable structure for further use.
Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require
either sifting through an immense amount of material, or intelligently probing it to find
exactly where the value resides. Given databases of sufficient size and quality, data mining
technology can generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviours: Data mining automates the process of
finding predictive information in large databases. Questions that traditionally required
extensive hands-on analysis can now be answered directly from the data — quickly. A typical
example of a predictive problem is targeted marketing. Data mining uses data on past
promotional mailings to identify the targets most likely to maximize return on investment in
future mailings. Other predictive problems include forecasting bankruptcy and other forms of
default, and identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns: Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern
discovery is the analysis of retail sales data to identify seemingly unrelated products that are
often purchased together. Other pattern discovery problems include detecting fraudulent
credit card transactions and identifying anomalous data that could represent data entry keying
errors.
Clustering – is the task of discovering groups and structures in the data that are in some way
or another "similar", without using known structures in the data.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
Regression – is a data mining function that is used to determine the relationships between the
dependant variables (target field) and one or more independent variables. The dependent
variable is the one whose values you want to predict, where as independent variables are the
variables on which your prediction is based on.
1. Knowledge Base:
This is the domain
knowledge that is used to
guide the search or evaluate
the interestingness of
resulting patterns. Such
knowledge can include
concept hierarchies, used to
organize attributes or
attribute values into different
levels of abstraction.
2. Data Mining Engine:
This is essential to the data
mining system and ideally consists of a set of functional modules for tasks such as
characterization, association and correlation analysis, classification, prediction, cluster
analysis, outlier analysis, and evolution analysis.
3. Pattern Evaluation Module:
Statistics
Machine Learning
Visualization
Other Disciplines
A multidimensional OLAP
(MOLAP) model, that is, a special-
purpose server that directly
implements multidimensional data
and operations.
Tier-3: The top tier is a front-end
client layer, which contains query
and reporting tools, analysis tools,
and/or data mining tools (e.g.,
trend analysis, prediction, and so
Tier-1: on)
The bottom tier is a data warehouse 1.9.3 Data Warehouse Models:
database server that is almost There are three data warehouse
always a relational database models.
system. Back-end tools and utilities 1. Enterprise warehouse:
are used to feed data into the An enterprise warehouse collects
bottom tier from operational all of the information about
databases or other external sources subjects spanning the entire
(such as customer profile organization. It provides corporate-
2. Reduced Minimum Support:
2.5.1 Approaches for Mining Each level of abstraction has its
Multilevel Association Rules: own minimum support threshold.
The deeper the level of abstraction,
1. Uniform Minimum Support: the smaller the corresponding
threshold is.
5.In order to predict the class label
of X, P(XjCi)P(Ci) is evaluated for
each class Ci. The classifier
predicts that the class label of tuple
X is the class Ciif and only if
3.4 A Multilayer Feed-Forward
Neural Network: The inputs to the network
The back propagation algorithm correspond to the attributes
performs learning on a multilayer measured for each training tuple.
feed-forward neural network. The inputs are fed simultaneously
into the units making up the input
It iteratively learns a set of weights layer. These inputs pass through the
for prediction of the class label of input layer and are then weighted
tuples. and fed simultaneously to a second
layer known as a hidden layer.
A multilayer feed-forward neural
network consists of an input layer, The outputs of the hidden layer units
one or more hidden layers, and an can be input to another hidden layer,
output layer. and so on. The number of hidden
layers is arbitrary.
The weighted outputs of the last
hidden layer are input to units
making up the output layer, which
emits the network’s prediction for
given tuples
3.4.1 Classification by
Backpropagation:
Backpropagation is a neural
network learning algorithm. A
neural network is a set of
connected input/output units in
which each connection has a
weight associated with it.