0% found this document useful (1 vote)

76 views55 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

Lecturer, CSE, CWU This document provides an overview of the DATA MINING course taught by Ayesha Aziz Prova at CWU. It outlines the course contents, assessment breakdown, recommended books, and what is data mining. Data mining involves extracting useful patterns from large amounts of data and can help organizations address the "data rich but information poor" problem.

Uploaded by

Dipty Sarker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

76 views55 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

Uploaded by

Dipty Sarker

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 55

DATA MINING

CSE-443

Ayesha Aziz Prova

Lecturer,
Dept. of CSE
CWU
CONTENTS

 Course Outline
 Recommended Book

Ayesha Aziz Prova,

Lecturer, CSE, CWU
2
THEORY EXAM

Number of class test :2

Number of Presentation : 1(Equal to a class test)
Class test : 10
Assignment : 10
Midterm : 30
Final Exam : 40
Attendance : 05
Class Performance : 05

3
Ayesha Aziz Prova,
Lecturer, CSE, CWU
PRESENTATION

 Members in a presentation : 2 (maximum)

 Number of presentation :1
(* Presentation will consider as a mandatory class test)

4
Ayesha Aziz Prova, Lecturer, CSE, CWU
BOOK
 Data Mining: Concepts and Techniques
 J. Han and M. Kamber
 Introduction to Data Mining
 Tan, Steinbach, Kumar

Ayesha Aziz Prova,

Lecturer, CSE, CWU 5
WHAT IS DATA MINING?

 After years of data mining there is still no unique answer to this question.

 A tentative definition:
Data mining is the use of efficient techniques for
the analysis of very large collections of data and the
extraction of useful and possibly unexpected
patterns in data.
6
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA MINING

 Data explosion problem:

 Automated data collection tools and mature database technology.
 Leading to tremendous amounts of data stored in databases, data
warehouses and other information repositories.
 We are drowning in data, but starving for knowledge!

7
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA RICH BUT INFORMATION POOR

Databases are too big

Data Mining can help

discover knowledge

Terrorbytes
8
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT IS DATA MINING?

 Data mining is also called knowledge discovery and data mining

(KDD)
 Data mining is
 extraction of useful patterns from data sources, e.g., databases, texts,
web, image.
 Patterns must be:
 valid, novel, potentially useful, understandable

9
Ayesha Aziz Prova,
Lecturer, CSE, CWU
KNOWLEDGE DISCOVERY

10
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE OF DISCOVERED PATTERNS

 Association rules:
“80% of customers who buy cheese and milk also buy bread, and 5% of
customers buy all of them together”
Cheese, Milk Bread [sup =5%, confid=80%]

11
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ORIGINS OF DATA MINING

 Draws ideas from machine learning/AI, pattern recognition, statistics,

and database systems
 Traditional Techniques may be unsuitable due to
 Enormity of data
 High dimensionality of data
 Heterogeneous, distributed nature
Statistics/ Machine Learning/
of data
AI Pattern
Recognition

Data Mining

Database 12
systems Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY DO WE NEED DATA MINING?

 Really, really huge amounts of raw data!!

 In the digital age, TB of data is generated by the second
 Mobile devices, digital photographs, web documents.
 Facebook updates, Tweets, Blogs, User-generated content
 Transactions, sensor data, surveillance data
 Queries, clicks, browsing
 Cheap storage has made possible to maintain this data
 Need to analyze the raw data to extract knowledge

13
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY DO WE NEED DATA MINING?
 Data is power!
 Today, the collected data is one of the biggest assets of an online
company
 Query logs of Google
 The friendship and updates of Facebook
 Tweets and follows of Twitter
 Amazon transactions
 We need a way to harness the collective intelligence

14
Ayesha Aziz Prova,
Lecturer, CSE, CWU
THE DATA IS ALSO VERY COMPLEX

 Multiple types of data: tables, images, graphs, etc

 Interconnected data of different types:
 From the mobile phone we can collect, location of the user, friendship
information, check-ins to venues, opinions through twitter, images though
cameras, queries to search engines

15
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: TRANSACTION DATA

 Billions of real-life customers:

 Credit card companies: billions of transactions per day.

 The point cards allow companies to collect information about specific users

16
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: DOCUMENT DATA

 Web as a document repository: estimated 50 billions of web pages

 Wikipedia: 4 million articles (and counting)
 Online news portals: steady stream of 100’s of new articles every
day

17
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: NETWORK DATA

 Web: 50 billion pages linked via hyperlinks

 Facebook: 500 million users
 Twitter: 300 million users
 Instant messenger: ~1billion users
 Blogs: 250 million blogs worldwide, presidential candidates run blogs

18
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: GENOMIC SEQUENCES

 http://www.1000genomes.org/page.php
 Full sequence of 1000 individuals
 3*109 nucleotides per person  3*1012 nucleotides
 Lots more data in fact: medical history of the persons, gene expression data

19
Ayesha Aziz Prova,
Lecturer, CSE, CWU
EXAMPLE: ENVIRONMENTAL DATA
 Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php

 “a database of temperature, precipitation and pressure records managed by the

National Climatic Data Center, Arizona State University and the Carbon
Dioxide Information Analysis Center”

 “6000 temperature stations, 7500 precipitation stations, 2000 pressure

stations”
 Spatiotemporal data

20
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Attributes
SO, WHAT IS DATA?
Tid Refund Marital Taxable
Status Income Cheat
 Collection of data objects and
their attributes 1 Yes Single 125K No
2 No Married 100K No
 An attribute is a property or 3 No Single 70K No

characteristic of an object 4 Yes Married 120K No

 Examples: eye color of a Objects
5 No Divorced 95K Yes
person, temperature, etc. 6 No Married 60K No
 Attribute is also known as 7 Yes Divorced 220K No
variable, field, characteristic, or 8 No Single 85K Yes
feature 9 No Married 75K No
 A collection of attributes 10 No Single 90K Yes
describe an object
10

 Object is also known as record, Size: Number of objects

point, case, sample, entity, or Dimensionality: Number of attributes
instance Sparsity: Number of populated
21
object-attribute pairs
Ayesha Aziz Prova, Lecturer, CSE, CWU
TYPES OF ATTRIBUTES
 There are different types of attributes
 Categorical
 Examples: eye color, zip codes, words, rankings (e.g, good, fair, bad), height
in {tall, medium, short}
 Numeric
 Examples: dates, temperature, time, length, value, count.
 Discrete (counts) vs Continuous (temperature)
 Special case: Binary attributes (yes/no, exists/not exists)

22
Ayesha Aziz Prova,
Lecturer, CSE, CWU
NUMERIC RECORD DATA

 If data objects have the same fixed set of numeric attributes, then the
data objects can be thought of as points in a multi-dimensional
space, where each dimension represents a distinct attribute

 Such data set can be represented by an n-by-d data matrix, where

there are n rows, one for each object, and d columns, one for each
attribute

23
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CATEGORICAL DATA
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single High No

 Data that consists of a collection of 2 No Married Medium No

records, each of which consists of a 3 No Single Low No

fixed set of categorical attributes 4 Yes Married High No

5 No Divorced Medium Yes
6 No Married Low No
7 Yes Divorced High No
8 No Single Medium Yes
9 No Married Medium No
10 No Single Medium Yes
10

24
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DOCUMENT DATA
 Each document becomes a
`term' vector,

timeout

season
coach

game
score
team

ball

lost
pla

wi
each term is a component

n
y

(attribute) of the vector,
 the value of each component
is the number of times the Document 1 3 0 5 0 2 6 0 2 0 2
corresponding term occurs in
the document. Document 2 0 7 0 2 1 0 0 3 0 0

 Bag-of-words representation –
Document 3 0 1 0 0 1 2 2 0 3 0
no ordering

25
Ayesha Aziz Prova,
Lecturer, CSE, CWU
TRANSACTION DATA

 Each record (transaction) is a set of items.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
 A set of items can also be represented as a binary vector, where
each attribute is an item.
 A document can also be represented as a set of words (no counts)

Sparsity: average number of products bought by a customer

26
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ORDERED DATA

 Genomic sequence data

 Data is a long ordered string

27
Ayesha Aziz Prova,
Lecturer, CSE, CWU
GRAPH DATA

 Examples: Web graph and HTML Links

2
5 1
2
5

28
Ayesha Aziz Prova,
Lecturer, CSE, CWU
TYPES OF DATA
 Numeric data: Each object is a point in a multidimensional space
 Categorical data: Each object is a vector of categorical values
 Set data: Each object is a set of values (with or without counts)
 Sets can also be represented as binary vectors, or vectors of counts
 Ordered sequences: Each object is an ordered sequence of values.
 Graph data

29
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT CAN YOU DO WITH THE DATA?

 Suppose that you are the owner of a supermarket and you have collected billions of market
basket data. What information would you extract from it and how would you use it?

TID Items Product placement

1 Bread, Coke, Milk
2 Beer, Bread Catalog creation
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk Recommendations
5 Coke, Diaper, Milk

 What if this was an online store?

30
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHAT CAN YOU DO WITH THE DATA?
 Suppose you are biologist who has microarray expression data: thousands of genes, and their expression
values over thousands of different settings (e.g. tissues). What information would you like to get out of
your data?

Groups of genes and tissues

31
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY MINE DATA? COMMERCIAL VIEWPOINT
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions

 Computers have become cheaper and

more powerful
 Competitive Pressure is Strong
 Provide better, customized
services for an edge (e.g. in
Customer Relationship
Management) 32
Ayesha Aziz Prova,
Lecturer, CSE, CWU
WHY MINE DATA? SCIENTIFIC VIEWPOINT

 Data collected and stored at

enormous speeds (GB/hour)

 remote sensors on a satellite

 telescopes scanning the skies
 microarrays generating gene
expression data
 scientific simulations
generating terabytes of data
 Traditional techniques infeasible for raw data
 Data mining may help scientists
 in classifying and segmenting data
in Hypothesis Formation
33


Ayesha Aziz Prova,

Lecturer, CSE, CWU
WHAT IS DATA MINING AGAIN?
 “Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and
useful to the data analyst” (Hand, Mannila, Smyth)

 “Data mining is the discovery of models for data” (Rajaraman, Ullman)

 We can have the following types of models
 Models that explain the data (e.g., a single function)
 Models that predict the future data instances.
 Models that summarize the data
 Models the extract the most prominent features of the data.

34
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DATA MINING TASKS...
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation Detection [Predictive]

35
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: DEFINITION

 Given a collection of records (training set )

 Each record contains a set of attributes, one of the attributes is the class.
 Find a model for class attribute as a function of the values of other
attributes.
 Goal: previously unseen records should be assigned a class as
accurately as possible.
 A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.

36
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION EXAMPLE
l l s
ir ca ir ca uou
go g o tin
t e t e n ss
ca ca co cla
Refund Marital Taxable
Tid Refund Marital Taxable
Status Income Cheat
Status Income Cheat

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?

3 No Single 70K No No Married 150K ?

4 Yes Married 120K No Yes Divorced 90K ?

No Single 40K ?
5 No Divorced 95K Yes Test
6 No Married 60K No No Married 80K ? Set
10

7 Yes Divorced 220K No

8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10

Set Classifier
37
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: APPLICATION 1

38
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFICATION: APPLICATION 2

 Fraud Detection
 Goal: Predict fraudulent cases in credit card transactions.
 Approach:
 Use credit card transactions and the information on its account-holder as
attributes.
 When does a customer buy, what does he buy, how often he pays on
time, etc
 Label past transactions as fraud or fair transactions. This forms the class
attribute.
 Learn a model for the class of the transactions.
 Use this model to detect fraud by observing credit card transactions on an
account.

39
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLASSIFYING GALAXIES

Early Class: Attributes:

• Stages of Formation • Image features,
• Characteristics of light
waves received, etc.
Intermediate

Late

Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB 40
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING DEFINITION

 Given a set of data points, each having a set of attributes, and a

similarity measure among them, find clusters such that
 Data points in one cluster are more similar to one another.
 Data points in separate clusters are less similar to one another.
 Similarity Measures:
 Euclidean Distance if attributes are continuous.
 Other Problem-specific Measures.

41
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING DEFINITION

Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized

42
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 1

 Market Segmentation:
 Goal: subdivide a market into distinct subsets of customers where any subset
may conceivably be selected as a market target to be reached with a distinct
marketing mix.
 Approach:
 Collect different attributes of customers based on their geographical and
lifestyle related information.
 Find clusters of similar customers.
 Measure the clustering quality by observing buying patterns of customers
in same cluster vs. those from different clusters.

43
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 2

 Bioinformatics applications:
 Goal: Group genes and tissues together such that genes are co-expressed on the same tissues

44
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CLUSTERING: APPLICATION 2

 Document Clustering:
 Goal: To find groups of documents that are similar to each other based on
the important terms appearing in them.
 Approach:
 To identify frequently occurring terms in each document.
 Form a similarity measure based on the frequencies of different terms.
 Use it to cluster.
 Gain:
 Information Retrieval can utilize the clusters to relate a new document or search
term to clustered documents.

45
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ILLUSTRATING DOCUMENT CLUSTERING

 Clustering Points: 3204 Articles of Los Angeles Times.

 Similarity Measure: How many words are common in these documents (after
some word filtering).

Category Total Correctly

Articles Placed
Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278 46

Ayesha Aziz Prova,
Lecturer, CSE, CWU
FREQUENT ITEMSETS AND ASSOCIATION
RULES
 Given a set of records each of which contain some number of items from a
given collection;
 Identify sets of items (itemsets) occurring frequently
together
 Produce dependency rules which will predict occurrence of
an item based on occurrences of other items.

Itemsets
ItemsetsDiscovered:
Discovered:
TID Items {Milk,Coke}
{Milk,Coke}
1 Bread, Coke, Milk {Diaper,
{Diaper,Milk}
Milk}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
4
5
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk {Diaper,
{Coke}
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
47
Ayesha Aziz Prova,
Lecturer, CSE, CWU
FREQUENT ITEMSETS: APPLICATIONS

 Text mining: finding associated phrases in text

 There are lots of documents that contain the phrases “association rules”,
“data mining” and “efficient algorithm”
 Recommendations:
 Users who buy this item often buy this item as well
 Users who watched James Bond movies, also watched Jason Bourne
movies.
 Recommendations make use of item and user similarity

48
Ayesha Aziz Prova,
Lecturer, CSE, CWU
ASSOCIATION RULE DISCOVERY:
APPLICATION

 Supermarket shelf management.

 Goal: To identify items that are bought together by sufficiently many
customers.
 Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items.
 A classic rule --
 If a customer buys diaper and milk, then he is very likely to buy beer.
 So, don’t be surprised if you find six-packs stacked next to diapers!

49
Ayesha Aziz Prova,
Lecturer, CSE, CWU
SEQUENTIAL PATTERN MINING

 Sequential pattern mining:

A sequential rule: A B, says that event A will be immediately
followed by event B with a certain confidence.

50
Ayesha Aziz Prova,
Lecturer, CSE, CWU
REGRESSION

 Predict a value of a given continuous valued variable based on the values of

other variables, assuming a linear or nonlinear model of dependency.
 Greatly studied in statistics, neural network fields.
 Examples:
 Predicting sales amounts of new product based on advertising
expenditure.
 Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
 Time series prediction of stock market indices.

51
Ayesha Aziz Prova,
Lecturer, CSE, CWU
DEVIATION/ANOMALY DETECTION

 Detect significant deviations from

normal behavior
 Discovering the most significant
changes in data
 Applications:
 Credit Card Fraud Detection
 Network Intrusion Detection

Typical network traffic at University level

may reach over 100 million connections
per day
52
Ayesha Aziz Prova,
Lecturer, CSE, CWU
CHALLENGES OF DATA MINING

 Scalability
 Dimensionality
 Complex and Heterogeneous Data
 Data Quality
 Data Ownership and Distribution
 Privacy Preservation
 Streaming Data

53
Ayesha Aziz Prova,
Lecturer, CSE, CWU
THANKS

54
Ayesha Aziz Prova,
Lecturer, CSE, CWU
Any Question???

Final Year Project
100% (4)
Final Year Project
58 pages
Unwrapping The Gift: Benefits of Computer Technology
No ratings yet
Unwrapping The Gift: Benefits of Computer Technology
27 pages
Powerpoint Tabs
No ratings yet
Powerpoint Tabs
5 pages
UNIT-3 DATA MINING - Part1
No ratings yet
UNIT-3 DATA MINING - Part1
111 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
09.all We Like Sheep PDF
No ratings yet
09.all We Like Sheep PDF
9 pages
350 Laser Machine Operating Manual
No ratings yet
350 Laser Machine Operating Manual
109 pages
Tech Startup SEO Mastery - A Comprehensive Guide For Founders (Plus Free SEO Checklist) - Whitepaper by Rampiq
No ratings yet
Tech Startup SEO Mastery - A Comprehensive Guide For Founders (Plus Free SEO Checklist) - Whitepaper by Rampiq
44 pages
Ethics in Information Technology, Second Edition: Intellectual Property
No ratings yet
Ethics in Information Technology, Second Edition: Intellectual Property
34 pages
Risk Assessment of It Security Possible Solutions and Mechanisms To Control It Security Risk Unit 8: Security
No ratings yet
Risk Assessment of It Security Possible Solutions and Mechanisms To Control It Security Risk Unit 8: Security
15 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
Usability Engineering FinalTerm Paper Spring 2022 Solution 15082022 124653am
No ratings yet
Usability Engineering FinalTerm Paper Spring 2022 Solution 15082022 124653am
7 pages
Mining Graphs
No ratings yet
Mining Graphs
23 pages
SWI-Prolog-5 6 59 PDF
No ratings yet
SWI-Prolog-5 6 59 PDF
379 pages
Netflix Premium Cookie 1
No ratings yet
Netflix Premium Cookie 1
3 pages
Full
No ratings yet
Full
367 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
STM 32 H 757 Xi
No ratings yet
STM 32 H 757 Xi
250 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Ethics in Information Technology, Second Edition: Computer and Internet Crime
No ratings yet
Ethics in Information Technology, Second Edition: Computer and Internet Crime
55 pages
Training Report On Machine Learning
No ratings yet
Training Report On Machine Learning
27 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
Ethics in Information Technology, Second Edition: Ethics For IT Professionals and IT Users
No ratings yet
Ethics in Information Technology, Second Edition: Ethics For IT Professionals and IT Users
44 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
E1 - DA Unit 1 Slides
No ratings yet
E1 - DA Unit 1 Slides
335 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
DMML Notes
No ratings yet
DMML Notes
89 pages
Chapter 1
No ratings yet
Chapter 1
149 pages
Modified Module 2-DM
No ratings yet
Modified Module 2-DM
107 pages
Datamining 1class
No ratings yet
Datamining 1class
76 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
51 pages
Chapter 1 Introduction To Data Mining
No ratings yet
Chapter 1 Introduction To Data Mining
46 pages
HSD 28491 Camper Catalogue English
No ratings yet
HSD 28491 Camper Catalogue English
92 pages
CL 2
No ratings yet
CL 2
85 pages
Unit 1
No ratings yet
Unit 1
82 pages
Datamining Lect1
No ratings yet
Datamining Lect1
61 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Unit I
No ratings yet
Unit I
57 pages
Machine Learning
No ratings yet
Machine Learning
57 pages
Data Mining
No ratings yet
Data Mining
40 pages
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
No ratings yet
Penggalian Data & Analitika Bisnis: Faculties Teknologi Informasi - ITS
69 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Ethics in Information Technology
No ratings yet
Ethics in Information Technology
24 pages
Module 1 Part1
No ratings yet
Module 1 Part1
68 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
33 pages
10 Ict Css q3 m1 Css
No ratings yet
10 Ict Css q3 m1 Css
17 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
27 pages
Datamining Lect1
No ratings yet
Datamining Lect1
59 pages
Data - Part 1
No ratings yet
Data - Part 1
58 pages
1. Εισαγωγή στην Εξόρυξη Δεδομένων
No ratings yet
1. Εισαγωγή στην Εξόρυξη Δεδομένων
70 pages
Industrial Design Portfolio 2021
No ratings yet
Industrial Design Portfolio 2021
13 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
L1
No ratings yet
L1
44 pages
Datamining-Lect1 2
No ratings yet
Datamining-Lect1 2
44 pages
Ethics in Information Technology, Second Edition: Privacy
No ratings yet
Ethics in Information Technology, Second Edition: Privacy
19 pages
Dbms Mod 3
No ratings yet
Dbms Mod 3
28 pages
Updated DM
No ratings yet
Updated DM
72 pages
CS822 DataMining Week1
No ratings yet
CS822 DataMining Week1
97 pages
Digital Ethics - FINAL - 160616
No ratings yet
Digital Ethics - FINAL - 160616
36 pages
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
No ratings yet
Datamining-Lect2 - What Is Data - The Data Mining Pipeline. Preprocessing and Postprocessing. Samping and Normalization
94 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preprocessing 09112023 065121pm
No ratings yet
Data Preprocessing 09112023 065121pm
30 pages
1-Data Understanding
No ratings yet
1-Data Understanding
21 pages
4.0 Introduction To Data
No ratings yet
4.0 Introduction To Data
16 pages
Emmanuel NIYOMUGABO'S C.V
No ratings yet
Emmanuel NIYOMUGABO'S C.V
5 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Module 1 - Aug 2024
No ratings yet
Module 1 - Aug 2024
93 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Arduino Ventures - Mirza Naiem Beg
No ratings yet
Arduino Ventures - Mirza Naiem Beg
128 pages
I BCA - CPP Lab
No ratings yet
I BCA - CPP Lab
57 pages
Data-Mining FINAL
No ratings yet
Data-Mining FINAL
45 pages
Bi - Unit 3
No ratings yet
Bi - Unit 3
18 pages
Performance Analysis of RIP, EIGRP, OSPF and ISIS Routing Protocols
No ratings yet
Performance Analysis of RIP, EIGRP, OSPF and ISIS Routing Protocols
8 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
ML Lecture 4 Data
No ratings yet
ML Lecture 4 Data
22 pages
DM Lec1 2
No ratings yet
DM Lec1 2
39 pages
CAD Exercise 3
No ratings yet
CAD Exercise 3
15 pages
2020 Intro
No ratings yet
2020 Intro
58 pages
Object Oriented Software Engineering Using UML Patterns and Java 3rd Edition by Bernd Bruegge, Allen H Dutoit ISBN 0133002098 9780133002096
100% (12)
Object Oriented Software Engineering Using UML Patterns and Java 3rd Edition by Bernd Bruegge, Allen H Dutoit ISBN 0133002098 9780133002096
76 pages
DATA MINING For Search Engines
No ratings yet
DATA MINING For Search Engines
33 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Exploiting Temporal and Depth Information For Multi-Frame Face Anti-Spoofing
No ratings yet
Exploiting Temporal and Depth Information For Multi-Frame Face Anti-Spoofing
15 pages
Lecture 1
No ratings yet
Lecture 1
55 pages
Lecture2 IntroData
No ratings yet
Lecture2 IntroData
16 pages
Machine Learning Lecture 4 Data Types
No ratings yet
Machine Learning Lecture 4 Data Types
21 pages
TTDS Lecture 1
No ratings yet
TTDS Lecture 1
22 pages
1 Goal Programming
No ratings yet
1 Goal Programming
9 pages
Xhamster VR Manual
No ratings yet
Xhamster VR Manual
5 pages
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
No ratings yet
Lecture#1-Data Mining-MS (DEIM) - Spring 2025
33 pages
Oracle Break Glass For Fusion Cloud Ds
No ratings yet
Oracle Break Glass For Fusion Cloud Ds
5 pages
Entegra
No ratings yet
Entegra
4 pages
February
No ratings yet
February
2 pages
Tasks and Milestones
No ratings yet
Tasks and Milestones
2 pages
Bisf 2204 - Computer Forensics
No ratings yet
Bisf 2204 - Computer Forensics
2 pages

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

Uploaded by

Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU

Uploaded by

DATA MINING

Ayesha Aziz Prova

Ayesha Aziz Prova,

Number of class test :2

 Members in a presentation : 2 (maximum)

Ayesha Aziz Prova,

 Data explosion problem:

Databases are too big

Data Mining can help

 Data mining is also called knowledge discovery and data mining

 Draws ideas from machine learning/AI, pattern recognition, statistics,

 Really, really huge amounts of raw data!!

 Multiple types of data: tables, images, graphs, etc

 Billions of real-life customers:

 Web as a document repository: estimated 50 billions of web pages

 Web: 50 billion pages linked via hyperlinks

 “a database of temperature, precipitation and pressure records managed by the

 “6000 temperature stations, 7500 precipitation stations, 2000 pressure

characteristic of an object 4 Yes Married 120K No

 Object is also known as record, Size: Number of objects

 Such data set can be represented by an n-by-d data matrix, where

1 Yes Single High No

records, each of which consists of a 3 No Single Low No

fixed set of categorical attributes 4 Yes Married High No

 Each record (transaction) is a set of items.

Sparsity: average number of products bought by a customer

 Genomic sequence data

 Data is a long ordered string

 Examples: Web graph and HTML Links

TID Items Product placement

 What if this was an online store?

Groups of genes and tissues

 Computers have become cheaper and

 Data collected and stored at

 remote sensors on a satellite

Ayesha Aziz Prova,

 “Data mining is the discovery of models for data” (Rajaraman, Ullman)

 Given a collection of records (training set )

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?

3 No Single 70K No No Married 150K ?

4 Yes Married 120K No Yes Divorced 90K ?

7 Yes Divorced 220K No

Early Class: Attributes:

 Given a set of data points, each having a set of attributes, and a

 Clustering Points: 3204 Articles of Los Angeles Times.

Category Total Correctly

Foreign 341 260

Metro 943 746

Sports 738 573

Entertainment 354 278 46

 Text mining: finding associated phrases in text

 Supermarket shelf management.

 Sequential pattern mining:

 Predict a value of a given continuous valued variable based on the values of

 Detect significant deviations from

Typical network traffic at University level

You might also like