Lecture - 2 - Data Mining Concepts
Lecture - 2 - Data Mining Concepts
LECTURE 2
Representation of Data
Semi-structured
Data
Unstructured
Data
Road Map
2. Data to be mined
3. Knowledge to be discovered
This task present the general properties of data stored in database. The descriptive task are used to find
out patterns in data that is cluster,Correlation trends and anomalies etc.
2. Predictive tasks: Predict the value of a specific attribute (target/dependent variable)based on the
value of other attributes (explanatory).
Predictive data mining task predict the value of one attribute on the basis of values of other attributes,
which is known as target or dependent variable and the attribute used for making the prediction are
known as independent variables.
Types of Variables/Attributes
Data can help us solve specific problems.
How should these pictures be placed
into 3 groups?
How should these pictures be placed into groups? How many groups
should there be?
Which genes are associated with a disease? How can expression values be used
to predict survival?
What items should Amazon display for me?
Is it likely that this stock was traded based on illegal
insider information?
Where are the faces in this
picture?
Is this
spam?
What techniques people
apply on data?
They apply data mining algorithms and discover useful
knowledge
Motif Prediction
Discovery
Visualization Classification
In,
Summary
Types of Data
Data Mining
Transactional Data Methods
Sequence Data Frequent
Interval Data Pattern
Time Series Data Clustering
Algorithm Discovery
Spatial Data Outlier Detection
s Classification
Spatio-Temporal Data Statistical
Data Set with Analysis
Multiple Kinds of …
Data
….
Activity 1( Complete Till 2nd
Class of this week)
Find top 3 recent research activities around the world
that are analyzing data. You need to write short
summary for each research activities. First three line
must follow following format:
Line 1: Problem they are trying to sole along with dataset
they are using
Line 2: How they are solving the problem
Line 3: Justify yourself why you rate this work as a top 5
activities
Remaining lines… you can think yourself ….
43
Related
Field
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part of
data mining
45
Clustering
Find “natural” grouping of instances given
un- labeled data
46
Association Rules & Frequent Itemsets
Transactions
Frequent Itemsets:
TID Produce
1 MILK, BREAD, EGGS Milk, Bread (4)
2 BREAD, SUGAR
Bread, Cereal (3)
3 BREAD, CEREAL Milk, Bread, Cereal
4 MILK, BREAD, SUGAR (2)
5 MILK, CEREAL …
6 BREAD, CEREAL
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Rules:
Milk => Bread
(66%)
47
Visualization & Data
Mining
Visualizing the data to
facilitate human
discovery
Presenting the
discovered results in a
visually "nice" way
48
Summarization
49