0% found this document useful (0 votes)
65 views73 pages

2020 - UNIT 2 Chapter 1

The document provides an introduction to data mining. It discusses why data mining is needed due to the vast amounts of data being generated daily from various sources. It defines data mining as the process of extracting useful patterns from large data sets. The document outlines different chapters that will explore the data mining process in more detail, including understanding data, statistical analysis, visualization, and measuring similarity. It provides examples of common data sources for mining like databases, data warehouses, and transactions. The document also discusses the types of patterns that can be discovered through descriptive and predictive mining tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views73 pages

2020 - UNIT 2 Chapter 1

The document provides an introduction to data mining. It discusses why data mining is needed due to the vast amounts of data being generated daily from various sources. It defines data mining as the process of extracting useful patterns from large data sets. The document outlines different chapters that will explore the data mining process in more detail, including understanding data, statistical analysis, visualization, and measuring similarity. It provides examples of common data sources for mining like databases, data warehouses, and transactions. The document also discusses the types of patterns that can be discovered through descriptive and predictive mining tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 73

UNIT - II

Introduction to Data Mining


Outline
 Chapter 1: Introduction to Data Mining
– 1.1 Why Data Mining,
– 1.2 What is Data Mining,
– 1.3 What Kinds of data can be Mined,
– 1.4 What Kinds of patterns can be Mined,
– 1.5 Which Technologies are used,
– 1.6 Which kinds of Applications are Targeted
– 1.7 Major issues in Data Mining.
 Chapter 2: Getting to Know Your Data
– 2.1 Data Objects and Attribute Types,
– 2.2 Basic Statistical Description of Data,
– 2.3 Data Visualization,
– 2.4 Measuring Data Similarity and Dissimilarity.

2
1.1 Why Data Mining

 We live in a world where vast amounts of data are collected daily.

 Terabytes or petabytes of data pour into our computer networks,


WWW, and various data storage devices every day from,
– Business, society, science and engineering, medicine, and
almost every other aspect of daily life.

 This explosive growth of data volume is a result of the


computerization and the fast development of powerful data
collection and storage tools.

3
1.1 Why Data Mining

 Businesses worldwide generate gigantic data sets,


including
– sales transactions,
– stock trading records,
– product descriptions,
– sales promotions,
– performance, and customer feedback.

4
1.1 Why Data Mining

 Scientific and engineering practices generate high orders of


petabytes of data in a continuous manner, from
– remote sensing,

– process measuring,

– scientific experiments,

– system performance,

– Environment surveillance.

5
1.1 Why Data Mining

 The medical and health industry generates tremendous


amounts of data from medical records, patient monitoring, and
medical imaging
 Billions of Web searches supported by search engines process
tens of petabytes of data daily.

 Social media have become important data sources, producing


digital pictures and videos, blogs, Web communities, and
various kinds of social networks.
 List of sources that generate huge amounts of data is endless

6
1.1 Why Data Mining

 Analyzing such data is an important need

 Powerful and versatile tools are needed to


automatically uncover valuable information from the
tremendous amount of data and to transform such
data into organized knowledge.
 This necessity has led to the birth of data mining.

7
1.1 Why Data Mining
 Extracting useful information is extremely challenging.

 Traditional data analysis tools and techniques cannot be used because of the
massive size of a data set

 Additional data analysis tools are required for in-depth analysis, such as data
classification, clustering, and the characterization of data that changes over
time

 Data mining is a technology that blends traditional data analysis methods with
sophisticated algorithms for processing large volumes of data.

 Note: Data Mining as Evolution of Information Technology (Ignore)

8
1.2 What Is Data Mining?

 Data mining is an interdisciplinary subject, can be defined in many


different ways.

 Simply stated, data mining refers to extracting or “mining” knowledge


from large amounts of data.

 Data mining is the process of uncovering/discovering hidden and


potentially useful information/pattern from database (data warehouse)

9
1.2 What Is Data Mining?: Data Mining Process

 Data Mining turns large collection of data into information and


then to knowledge

10
1.2 What Is Data Mining? : What is Data ?

11
1.2 What Is Data Mining?: What is Information ?

12
 What is Knowledge?
– Pattern of Relationships in data
and information that exhibit a
high degree of certainty

13
1.2 What Is Data Mining?: Knowledge Discovery from Data(KDD)

 Many treat data mining as a synonym for another popularly used term,
knowledge discovery from data(KDD)
– KDD is the process of discovering useful knowledge from a collection of
data.

 Also... KDD refers to the overall process of discovering useful knowledge


from data, and data mining refers to a particular step in this process.

14
1.2 What Is Data Mining?: KDD Process
Knowledge discovery process is an iterative sequence of steps:

1 Data Cleaning Remove noise and inconsistent data


2 Data Integration Integrate (compile) multiple data sources
3 Data Selection Data relevant to analysis is selected
4 Data Transformation Summary, normalization, aggregation operations
are performed
5 Data Mining Intelligent methods are applied to data to discover
knowledge or patterns
6 Pattern Evaluation Interesting patterns representing knowledge are
identified based on threshold
7 Knowledge Discovery Visualization and knowledge representation
techniques are used to present mined knowledge to
users

15
1.2 What Is Data Mining?: KDD Process

16
1.3 What Kinds of Data can be Mined?

 DM can be applied to any kind of data as long as the data are meaningful for a target
application.

 However, algorithms and approaches may differ when applied to different types of data.

 Most basic forms of data for mining applications are


– Database Data
– Data Warehouse Data
– Transactional Data
– Other Kinds of Data

17
1.3 What Kinds of Data can be Mined?: Database Data

 Data that consists of a collection of records, each of


which consists of a fixed set of attributes
 Traditional queries: List of books
 DM queries: Find credit risk, Tid Refund Marital
Status
Taxable
Income Cheat

loan profile 1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 18
10
1.3 What Kinds of Data can be Mined?:Data Warehouse Data

 Company has branches all over the world, each


branch has its own database.

 DWH is a repository of information from multiple


sources.

 OLAP queries using drill-down and roll-up


 Ex: Drill down on sales data summarized by quarter to
see data summarized by months

19
1.3 What Kinds of Data can be Mined?: Transactional Data

Set of products purchased by a customer during one shopping


trip constitute a transaction, while the individual products that
were purchased are the items.

Transaction Items

T1 Bread, Jelly, Jam


T2 Bread, Jam
T3 Bread, Milk, Jam
T4 Coffee, bread
T5 Milk, coffee

 Queries: Which items were sold together 20


1.3 What Kinds of Data can be Mined?: (Other) Time-related
or sequence data
Database that contains data for each point in time eg.
Weather data, stock market data

Queries: Next month rain prediction in Karnataka 21


1.3 What Kinds of Data can be Mined? (Other)

 Data mining can also be applied to other forms of data


– Time-series data, temporal data (recorded at regular times)
– Data streams (video surveillance and sensor data)
– Spatial data (maps),
– Hypertext and multimedia data (including text, image, video,
and audio)
– Graph and networked data (social and information networks),

22
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

 Kinds of patterns that can be mined are


– descriptive (what happened: effective visualization,
comprehensive, accurate)
– characterize properties of the data in a target data set.

– prescriptive (what will happen: future)

– perform induction on the current data in order to make predictions.

23
23
23
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

– Classification [Predictive]
– Regression [Predictive]
– Outlier Analysis [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]

24
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

25
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

 Classification [Predictive]
– Classification is the process of
• finding a model (or function)
• that describes and distinguishes data classes or concepts,
• maps data into predefined classes or groups
– The derived model is based on the analysis of a set of training
data (i.e., data objects whose class label is known).

26
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Classification

 The derived model may be represented in various


forms, such as
– IF – Then rules

– Decision trees

– Neural network

27
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Classification

 IF – THEN
– IF (Attendance = 75) AND (IA=10) THEN class= ‘F’
– IF (Attendance = 85) AND (IA=25) THEN class = ‘D’
– IF (Attendance = 75) AND (IA=45) THEN class = ‘S’

28
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Classification
 Decision Tree
– A decision tree is a flow-chart-like tree structure,
– where each node denotes a test on an attribute value,
– each branch represents an outcome of the test, and
– tree leaves represent classes or class distributions.
– Decision trees can easily be converted to classification rules.

Attendance

>75 <75

IA F

45 - 50
0-24
38 - 44
S A F 29
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Outlier Analysis (Deviation Detection)

 The goal of deviation detection is to Detect significant deviations from


normal behavior
 Some data objects do not comply with the general behavior or model of
the data. Data objects are different from or inconsistent with the
remaining set are called outliers

 Application
– Credit card fraud detection
– Telecom fraud detection
– Medical analysis
30
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Outlier Analysis (Deviation Detection)

 Goal : To detect fraudulent credit card transaction


 Approach
– Based on past usage patterns, develop model for authorized credit
card transactions
– Check for deviation from model, before authenticating new credit
transactions
– Hold payment and verify authenticity of doubtful transaction by
other means (phone, sms, email)

31
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Clustering
 Clustering is a process of partitioning a set of data (or objects)
into a set of meaningful sub-classes, called clusters.
 Given a set of data points, each having a set of attributes, and
a similarity measure among them, find clusters such that
– Data points in that are similar to one another and collectively
should be treated as group
– As a collection, are sufficiently different from other groups

 Clustering: unsupervised classification: no predefined classes.

32
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Clustering

 Similarity measures
– Euclidean distance
 Types of clustering
– Group based clustering
– Hierarchical clustering
 Application
– Market segmentation
– Document clustering

33
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Clustering - Market segmentation


 Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers where any subset may
conceivably be selected as a market target to be reached with a distinct marketing
mix.
– Approach:
• Collect different attributes of customers based on their geographical and lifestyle
related information.

• Find clusters of similar customers.

• Measure the clustering quality by observing buying patterns of customers in same


cluster vs. those from different clusters.

34
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Association Rule Discovery
 Given a set of transactions, each of which contain some
number of items from a given collection
– Produce dependency rules which will predict the occurrence
of an item based on the occurrences of other items in the
transaction
Rules Discovered:
Transaction Items
T1 Bread, Jelly, Jam
T2 Bread, Jam
{bread}  {Jam}
T3 Bread, Milk, Jam {jelly }  {bread}
T4 Coffee, bread
T5 Milk, coffee
{jelly}  {jam}
{jelly}  {milk}
35
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery


 Applications
– Market basket analysis (marketing strategy : items to put on sale
at reduced prices)

36
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery

 Rule form:
Body  Head [support, confidence]

37
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery: Market Basket Analysis

38
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery

 Example
– When customer buys a shirt, in 70% of cases, he or
she will buy a tie!!
– We find this happen in 13.5 % of all purchases

39
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery

 Shirt  Tie (support=13.5% and confidence =70%)


40
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery

41
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery


Let Minimum support 50% and minimum confidence 50%

 Rules Discovered:
{bread}  {Jam}
Transaction Items
support= , confidence =
T1 Bread, Jelly, Jam
{jelly }  {bread}
T2 Bread, Jam
support= , confidence = T3 Bread, Milk, Jam
{jelly}  {jam} T4 Coffee, bread
support= , confidence = {jelly} T5 Milk, coffee
 {milk}
42
support= , confidence =
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery


Let Minimum support 50% and minimum confidence 50%

 Rules Discovered:
{bread}  {Jam}
Transaction Items
support=60%, confidence = 75%
T1 Bread, Jelly, Jam
{jelly }  {bread}
T2 Bread, Jam
support=20%, confidence = 100% T3 Bread, Milk, Jam
{jelly}  {jam} T4 Coffee, bread
support=20%, confidence = 100% T5 Milk, coffee

{jelly}  {milk}
43
support=0%
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery: Application

 Supermarket shelf management.


– Goal: To identify items that are bought together by sufficiently
many customers.
– Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items.
– A classic rule
• If a customer buys Bread, then he is likely to by jam
• {Bread}  {jam} ; support=60%, confidence = 70%

44
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery: Application

45
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Association Rule Discovery: Application

46
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Sequential Pattern Discovery

 Sequence in which customer purchase items

 Given is a set of objects, with each object


associated with its own timeline of events, find
rules that predict strong sequential
dependencies among different events:

47
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks

Sequential Pattern Discovery


 Point-of-sale transaction sequences,
– Computer Bookstore:

(Intro_To_Visual_C) (C++_Primer) -->


(Perl_for_dummies,Tcl_Tk)

60% of customer who buy intro to visual c and c++ primer also buy
Perl for dummies and TCL/Tk with in a month
– Athletic Apparel Store:

(Shoes) (Racket, Racketball) --> (Sports_Jacket)

48
2.5 Which Technologies are used
 As a highly application-driven domain, data mining has incorporated
many techniques

49
Data Mining Applications

 Data mining is being used for a wide variety of applications.


Applications are categorized in the following six group
– Prediction and Description

– Relationship Marketing

– Customer Profiling

– Customer Segmentation

– Outliers Identification and Detecting Fraud

– Website Design and Promotion

50
Data Mining Applications

 Prediction and Description


– Data mining may be used to answer questions like “ would
this customer buy a product? ” or “ is this customer likely to
leave?
– Usually prediction involves selecting some or all the
attributes available in the database to predict variables of
interest

51
Data Mining Applications

 Relationship Marketing
– Usually customers have a lifetime value, not just the value of a single state.

– This may include customer identification, customer value, and customer


retention as well as customer development
– This also includes analyzing customer profiles and improving direct
marketing plan
– It may be possible to use cluster analysis to identify customers suitable for
cross-selling other products.

52
Data Mining Applications

 Customer Profiling
– It is the process of using relevant and available information to describe
the characteristics of a group of customers
– Profiling can help an enterprise identify its most valuable customers so
that the enterprise may differentiate their needs and values
– Customer profiles may include information on how customers spend
money, where and what they tend to buy, who are the most profitable
customers and so on

53
Data Mining Applications

 Customer Segmentation
– Essentially, it is a process of finding sub-groups of similar
people within a data set and can be useful in marketing.
– Furthermore, data mining may be used to understand and
predict customer behavior and profitability, to develop
new products and services, and to effectively market new
offerings.

54
Data Mining Applications

 Outliers Identification and Detecting Fraud


– There are many applications of data mining in identifying outliers
including fraud or unusual cases.
– Identifying unusual expense by member, and identifying anomalies
in expenditure patterns
– Data mining techniques are being used in a variety of fraud
detection.
– For example: Credit card fraud, Insurance fraud, Medical fraud, Tax
fraud, Customs fraud and smuggling, Telecommunication fraud

55
Data Mining Applications

 Website Design and Promotion


– Web mining may be used to discover how users navigate a
website and the result can help in improving the site design
and making it more visible on the web
– Data mining may also be used in cross selling by suggesting,
with a database of items that other customers have ordered
previously

56
Potential Application

 Other  Applications  of  Data  Mining  


– Stock  Market  Trends

– Text  and  Multimedia  Data   Mining

– Sports  Scouting

– Web  Advertising

– Recommendation  Systems

– Sports

– Weather forecasting

57
Major Issues in Data Mining
 Data mining is not an easy task, as the algorithms used are very complex
and data is not always available at one place. It needs to be integrated
from various heterogeneous data sources.
 The major issues in data mining can be divided into following categories
1. Mining methodology and User interaction,

2. Performance Issues

3. Diverse Data Types Issues

4. Data Mining and Society

58
Major Issues in Data Mining

  Mining Methodology and User Interaction Issues refers to :-


– Mining different kinds of knowledge in databases: Different users may be
interested in different kinds of knowledge. Data mining should cover a
broad range of knowledge discovery task.
– Interactive mining of knowledge at multiple levels of abstraction:

Mining knowledge at multiple levels one may find not only high level knowledge,
such as “milk and bread are likely to be purchased together” , but also lower – level
one such as “particular brand of milk and bread are purchased together ” .
Discovering knowledge at multiple levels extend the scope of knowledge discovery

59
Major Issues in Data Mining

  Mining Methodology and User Interaction Issues – contd..


– Incorporation of background knowledge: Background knowledge is
necessary to express the discovered patterns. Ex. How customers move
inside the supermarket, Concept hierarchies
– Data mining query languages and ad hoc data mining: SQL provides
flexible searching and allow user to pose ad hoc queries. Similarly high
level flexible user interface provide flexible UI to define ad-hoc data
mining .
Should facilitate specification of the relevant sets of data for analysis,
the domain knowledge, the kinds of knowledge to be mined, and the
conditions and constraints to be enforced on the discovered patterns
60
Major Issues in Data Mining

 Mining Methodology and User Interaction Issues – contd..


– Presentation and visualization of data mining results: Once the patterns
are discovered it needs to be expressed using visual representations.
These representations should be easily understandable.
– Handling noisy or incomplete data: The data cleaning methods are
required to handle the noise and incomplete objects while mining the
data regularities. Otherwise the accuracy of the discovered patterns will
be poor.

61
Major Issues in Data Mining

 Performance Issues
– Efficiency and scalability of data mining algorithms: In
order to effectively extract the information from huge
amount of data in databases, data mining algorithm must
be efficient and scalable.

62
Major Issues in Data Mining

Performance Issues – contd..


 Parallel, distributed and incremental mining algorithms:
– Due to huge size and complexity, data mining methods motivate the
development of parallel and distributed data-intensive mining
algorithms,
– Such algorithms divide the data into partitions which is further
processed in a parallel by searching for patterns
– The pattern from each partition are merged

63
Major Issues in Data Mining

Performance Issues – contd..


 Parallel, distributed and incremental mining algorithms:
– Cloud computing and cluster computing,
• Which use computers in a distributed and collaborative way to
tackle very large-scale computational tasks, are also active
research themes in parallel data mining

64
Major Issues in Data Mining

 Diverse Data Types Issues


– Handling of relational and complex types of data: The database may contain
complex data objects, multimedia data objects, spatial data, temporal data
etc. It is not possible for one system to mine all these kind of data.
– Mining information from heterogeneous databases and global information
systems: The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore
mining the knowledge from them adds challenges to data mining.

65
Major Issues in Data Mining
 Data Mining and Society
– How does data mining impact society?

– What steps can data mining take to preserve the privacy


of individuals?
– Do we use data mining in our daily lives without even
knowing that we do?
– These questions raise the following issues:

66
Major Issues in Data Mining

Data Mining and Society

 Social impacts of data mining:


– With data mining penetrating our everyday lives, it is
important to study the impact of data mining on society.
– How can we use data mining technology to benefit
society?
– How can we guard against its misuse?

67
Major Issues in Data Mining

Data Mining and Society


 Privacy-preserving data mining
– Data mining will help scientific discovery, business
management, economy recovery, and security protection
– However, it poses the risk of disclosing an individual’s
personal information.
– Need to observe data sensitivity and preserve people’s
privacy while performing successful data mining

68
 Discuss whether or not each of the following activities is a data mining task.
a) Dividing the customers of a company according to their gender.

b) Dividing the customers of a company according to their profitability.

c) Computing the total sales of a company.

d) Sorting a student database based on student identification numbers.

e) Predicting the outcomes of tossing a (fair) pair of dice.

f) Predicting the future stock price of a company using historical records.

g) Monitoring the heart rate of a patient for abnormalities.

h) Monitoring seismic waves for earthquake activities.

i) Extracting the frequencies of a sound wave.

69
a) Dividing the customers of a company according to their gender.

No. This is a simple database query.

b) Dividing the customers of a company according to their profitability.

No. This is an accounting calculation, followed by the application of a threshold. However,


predicting the profitability of a new customer would be data mining.

c) Computing the total sales of a company.

No. Again, this is simple accounting.

d) Sorting a student database based on student identification numbers.

No. Again, this is a simple database query.

70
e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the die were not fair, and
we needed to estimate the probabilities of each outcome from the data, then this is
more like the problems considered by data mining. However, in this specific case,
solutions to this problem were developed by mathematicians a long time ago, and
thus, we wouldn’t consider it to be data mining.
f) Predicting the future stock price of a company using historical records.
Yes. We would attempt to create a model that can predict the continuous value of
the stock price. This is an example of the area of data mining known as predictive
modelling. We could use regression for this modelling, although researchers in
many fields have developed a wide variety of techniques for predicting time series.

71
g) Monitoring the heart rate of a patient for abnormalities.
Yes. We would build a model of the normal behavior of heart rate and
raise an alarm when an unusual heart behavior occurred. This would
involve the area of data mining known as anomaly detection. This could
also be considered as a classification problem if we had examples of both
normal and abnormal heart behavior.
h) Monitoring seismic waves for earthquake activities.
Yes. In this case, we would build a model of different types of seismic
wave behavior associated with earthquake activities and raise an alarm
when one of these different types of seismic activity was observed. This
is an example of the area of data mining known as classification.
i) Extracting the frequencies of a sound wave.
No. This is signal processing.
72
UNIT – I

Data Warehouse: Basic Concepts : What Is a Data Warehouse?, Differences between


Operational Database Systems and Data Warehouses, Comparison of OLTP and OLAP
Systems, Data Warehousing - A Multitiered Architecture, Extraction, Transformation,
and Loading, Metadata Repository

Data Warehouse Modeling: Data Cube and OLAP : Data Cube: A Multidimensional
Data Model, Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional
Data Models, The Role of Concept Hierarchies, Typical OLAP Operations, OLAP Systems
versus Statistical Databases
Data Warehouse Implementation : Efficient Data Cube Computation, DMQL,

Chapter 4 : Data Warehousing and Online Analytical Processing

Text book : Data Mining - Concepts and Techniques,


Authors : Jiawei Han and Micheline Kamber, 3rd Edition 73

You might also like