0% found this document useful (0 votes)
21 views44 pages

Datamining-Lect1 2

Uploaded by

ahmed.sherif3400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views44 pages

Datamining-Lect1 2

Uploaded by

ahmed.sherif3400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

DATA MINING

LECTURE 1
Introduction
2

What is data mining?


Data mining is the use of efficient techniques for
the analysis of very large collections of data and the
extraction of useful and possibly unexpected
patterns in data.
• Data mining (knowledge discovery from data)
• Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
3

Why do we need data mining?


• Really, huge amounts of raw data!!
• In the digital age, TB of data is generated by the second
• Mobile devices, digital photographs, web documents.
• Facebook updates, Tweets, Blogs
• Transactions, sensor data, Queries, clicks, browsing
• Cheap storage has made possible to maintain this data
• Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!


• “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets

• Need to analyze the raw data to extract knowledge


4

The data is also very complex


• Multiple types of data: tables, time series,
images, graphs, etc

• Spatial and temporal aspects


• spatial data provides the information that
identifies the location of features and boundaries
on Earth.
5

Example: transaction data


• Billions of real-life customers:
• WALMART: 20M transactions per day
• AT&T 300 M calls per day
• Credit card companies: billions of transactions per day.
6

Example: document data


• Web as a document repository: estimated 50
billions of web pages

• Wikipedia: 4 million articles (and counting)

• Online news portals: steady stream )‫)تدفق مستمر‬of


100’s of new articles every day

• Twitter: ~300 million tweets every day


7

Example: network data


• Web: 50 billion pages linked via hyperlinks

• Facebook: 500 million users

• Twitter: 300 million users

• Instant messenger: ~1billion users

• Blogs: 250 million blogs worldwide, presidential


candidates run blogs
8

Example: environmental data


• Climate data (just an example)
http://www.ncdc.gov/oa/climate/ghcn-monthly/index.php

• “a database of temperature, precipitation and


pressure records managed by the National Climatic
Data Center, Arizona State University and the Carbon
Dioxide Information Analysis Center”

• “6000 temperature stations, 7500 precipitation


stations, 2000 pressure stations”
• Spatiotemporal data
9

Behavioral data
• Mobile phones today record a large amount of information
about the user behavior
• GPS records position
• Camera produces images
• Communication via phone and SMS
• Text via facebook updates

• Amazon collects all the items that you browsed, placed into
your basket, read reviews about, purchased.

• Google record all your browsing activity via toolbar plugins.


They also record the queries you asked, the pages you saw
and the clicks you did.

• Data collected for millions of users on a daily basis


10

Types of Attributes
• There are different types of attributes
• Categorical
• Examples: eye color, id number, rankings (e.g, good, fair,
bad), height in {tall, medium, short}
• Nominal (no order) vs Ordinal (order)
• Nominal and Ordinal are collectively referred to as Categorical
or qualitative attributes

• Numeric(quantitative)
• Examples: dates, temperature, time, length, value.
• Interval
• ratio
11

Types of Attributes

An independent way of distinguishing between attributes is by the


number of values they can take:
• Discrete A discrete attribute has a finite or countably infinite set
of values.
• Such attributes can be categorical, such as zip codes or ID
numbers, or numeric, such as counts.
• Discrete attributes are often represented using integer variables.
• Binary attributes are a special case of discrete attributes and assume
only two values, e.g., true/false, yes/no,male/female, or 0/1.
• Continuous A continuous attribute is one whose values are real numbers.
• Examples include attributes such as temperature, height, or weight.
• Continuous attributes are typically represented as floating-point variables.
12
13

Record Data
• Much data mining work assumes that the data set is a
collection of records(data objects), each of which consists
of a fixed set of data fields (attributes).
14

Categorical Data
• Data that consists of a collection of records, each
of which consists of a fixed set of categorical
attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single High No


2 No Married Medium No
3 No Single Low No
4 Yes Married High No
5 No Divorced Medium Yes
6 No Married Low No
7 Yes Divorced High No
8 No Single Medium Yes
9 No Married Medium No
10 No Single Medium Yes
10
15

Document Data
• Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.
• Bag-of-words representation – no ordering

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
16

Transaction Data
• Each record (transaction) is a set of items.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk

• A document can also be represented as a set of


words (no counts)
17

Ordered Data
• Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
18

Ordered Data
• Time series
• Sequence of ordered (over “time”) numeric values.
19

Graph Data
• Examples: Web graph and HTML Links
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Attributes
So, what is Data?
Tid Refund Marital Taxable
• Collection of data objects and Status Income Cheat

their attributes 1 Yes Single 125K No


2 No Married 100K No
• An attribute is a property or 3 No Single 70K No
characteristic of an object 4 Yes Married 120K No
• Examples: eye color of a person, 5 No Divorced 95K Yes
temperature, etc.
Objects
6 No Married 60K No
• Attribute is also known as 7 Yes Divorced 220K No
variable, field, characteristic, or 8 No Single 85K Yes
feature, Dimension 9 No Married 75K No
• A collection of attributes describe 10 No Single 90K Yes
an object
10

• Object is also known as record,


Size: Number of objects
point, case, sample, entity, or
Dimensionality: Number of attributes
instance
21

Data Mining tasks


• Data Mining Tasks
Data mining tasks are generally divided into two major categories:
• Predictive tasks. The objective of these tasks is to predict the value of a particular
attribute based on the values of other attributes. The attribute
to be predicted is commonly known as the target or dependent variable,
while the attributes used for making the prediction are known as
the explanatory or independent variables.

• Descriptive tasks. Here, the objective is to derive patterns (correlations,


trends, clusters, and anomalies) that summarize the underlying
relationships in data.
22

Data Mining tasks


23

Data Mining tasks


24

Frequent Itemsets and Association Rules


• Given a set of records each of which contain some
number of items from a given collection;
• Identify sets of items (itemsets) occurring frequently
together
• Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
Itemsets Discovered:
TID Items {Milk,Coke}
1 Bread, Coke, Milk {Diaper, Milk}
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Rules Discovered:
4 Beer, Bread, Diaper, Milk {Milk} --> {Coke}
5 Coke, Diaper, Milk {Diaper, Milk} --> {Beer}

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


25

Clustering Definition
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
• Data points in one cluster are more similar to one
another.
• Data points in separate clusters are less similar to
one another.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


26

Clustering: Application
• Document Clustering:
• Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
• Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
• Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


27

Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.

• Goal: previously unseen records should be


assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
28

Classification Example
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
10

Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10

Set Classifier

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


29

Classification: Application 1
• Ad Click Prediction
• Goal: Predict if a user that visits a web page will click
on a displayed ad. Use it to target users with high
click probability.
• Approach:
• Collect data for users over a period of time and record who
clicks and who does not. The {click, no click} information
forms the class attribute.
• Use the history of the user (web pages browsed, queries
issued) as the features.
• Learn a classifier model and test on new users.
30

Connections of Data Mining with other


areas
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
Statistics/ Machine Learning/
• Enormity of data AI Pattern
• High dimensionality Recognition
of data
Data Mining
• Heterogeneous,
distributed nature
of data Database
systems

Tan, M. Steinbach and V. Kumar, Introduction to Data Mining


31

Data Mining: Confluence of Multiple Disciplines

Database
Technology Statistics

Machine Visualization
Data Mining
Learning

Pattern
Recognition Other
Disciplines
32

Knowledge Discovery (KDD) Process


33

Knowledge Discovery (KDD) Process


• Data cleaning
• Data integration from multiple sources
• Data selection for data mining
• Data transformation
• Data mining
• Patterns evaluation
• Presentation of the mining results
34

Knowledge Discovery (KDD) Process


35

Knowledge Discovery (KDD) Process


36

Knowledge Discovery (KDD) Process


37

Knowledge Discovery (KDD) Process

7
38

The data analysis pipeline


• Mining is not the only step in the analysis process

Data Result
Preprocessing Data Mining Post-processing

• Preprocessing: real data is noisy, incomplete and inconsistent. Data cleaning


is required to make sense of the data
• Techniques: Sampling, Dimensionality Reduction, Feature selection.
• A dirty work, but it is often the
most important step for the analysis.
• Post-Processing: Make the data actionable and useful to the user
• Statistical analysis of importance
• Visualization.
• Pre- and Post-processing are often data mining tasks as well
• Because of the many ways data can be collected and stored, data
preprocessing is perhaps the most laborious and time-
consuming step in the overall knowledge discovery process.
Post-processing
• Visualization
• The human eye is a powerful analytical tool
• If we visualize the data properly, we can discover
patterns
• Visualization is the way to present the data so that
patterns can be seen
• E.g., histograms and plots are a form of visualization
• There are multiple techniques (a field on its own)
Data Quality
• Examples of data quality problems:
• Noise and outliers
Tid Refund Marital Taxable
• Missing values Status Income Cheat

• Duplicate data 1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes


6 No NULL 60K No
Missing values 7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries
9 No Single 90K No
10
41

Sampling
• Sampling is the main technique employed for data
selection.
• It is often used for both the preliminary investigation of the data and
the final data analysis.

• Statisticians sample because obtaining the entire set of


data of interest is too expensive or time consuming.

• Sampling is used in data mining because processing the


entire set of data of interest is too expensive or time
consuming.
42

Sampling …
• The key principle for effective sampling is the
following:
• using a sample will work almost as well as using the
entire data sets, if the sample is representative

• A sample is representative if it has approximately the


same property (of interest) as the original set of data
43

Types of Sampling
• Simple Random Sampling
• There is an equal probability of selecting any particular item

• Sampling without replacement


• As each item is selected, it is removed from the population

• Sampling with replacement


• Objects are not removed from the population as they are selected
for the sample.
• In sampling with replacement, the same object can be picked up more
than once

• Stratified sampling
• Split the data into several partitions; then draw random samples
from each partition
44

Sample Size

8000 points 2000 Points 500 Points

You might also like