0% found this document useful (0 votes)
10 views54 pages

Week 1

CP610 Data Analysis introduces the course content which includes data analysis functions and evaluation methods. The course has no required textbook but provides references. Students will be evaluated based on assignments, a midterm, paper critique, and project. The paper critique and project can be completed individually or in groups of two. The introduction outline covers why data analysis is important due to explosive data growth, defines what data analysis is, discusses common data types, and introduces major data analysis functions like classification, clustering, association rule mining, and outlier analysis. Evaluation methods include generalization, correlation analysis, and trend analysis.

Uploaded by

veceki2439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views54 pages

Week 1

CP610 Data Analysis introduces the course content which includes data analysis functions and evaluation methods. The course has no required textbook but provides references. Students will be evaluated based on assignments, a midterm, paper critique, and project. The paper critique and project can be completed individually or in groups of two. The introduction outline covers why data analysis is important due to explosive data growth, defines what data analysis is, discusses common data types, and introduces major data analysis functions like classification, clustering, association rule mining, and outlier analysis. Evaluation methods include generalization, correlation analysis, and trend analysis.

Uploaded by

veceki2439
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CP610 Data Analysis

-Introduction & Get to know your data

Jiashu (Jessie) Zhao


At the Beginning of this course
• 9:00 – 11:50 Friday
• Location: DAWB 2-105
• Office Hour: Friday 11:50-12:30
• Office Location: N2076G

If you are contacting my or the TA by email, please


include [CP610B] in the email subject line.

2
• No Required Textbook
• References
• Data Mining: Concepts and Techniques, 2nd ed., Jiawei
Han and Micheline Kamber, Morgan Kaufmann, 2006.
• Jure Leskovec, Anand Rajaraman, and Jeffrey David
Ullman, Mining of Massive Datasets (2nd Edition),
Cambridge University Press, 2014.
• Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin
Kumar, Introduction to Data Mining (2nd Edition),
Pearson, 2018.
• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data
Mining, MIT Press, 2001.

3
• Evaluation:
Assessment Weighting
Assignment 30%
Midterm 30%
Paper Critique 10%
Project 30%
Total 100%

• Paper Critique and project: You can work in a group


of 2, or work on it by yourself. You will not receive
extra marks if you work on it by yourself.

4
Paper critique
• Select a paper from recent 2 year’s papers on
related conferences and journals
• Submit a written critique with
• Introduction: paper’s information & preview of your
analysis
• Summary: paper’s main point & findings
• Critique: Strength and weakness of the paper (2-3) &
your opinion (supported by evidence from the paper)
• Conclusion: key points of your analysis & significance of
the research

5
Conferences and Journals

• KDD Conferences n Other related conferences


• ACM SIGKDD Int. Conf. on Knowledge
n DB conferences: ACM SIGMOD,
Discovery in Databases and Data
VLDB, ICDE, EDBT, ICDT, …
Mining (KDD)
n Web and IR conferences: WWW,
• SIAM Data Mining Conf. (SDM)
SIGIR, WSDM
• (IEEE) Int. Conf. on Data Mining
(ICDM) n ML conferences: ICML, NIPS
• European Conf. on Machine Learning n PR conferences: CVPR,
and Principles and practices of n Journals
Knowledge Discovery and Data n Data Mining and Knowledge
Mining (ECML-PKDD)
Discovery (DAMI or DMKD)
• Pacific-Asia Conf. on Knowledge
Discovery and Data Mining (PAKDD) n IEEE Trans. On Knowledge and
Data Eng. (TKDE)
• Int. Conf. on Web Search and Data
Mining (WSDM) n KDD Explorations
n ACM Trans. on KDD

6
Project
• Identify a research topic closely related to this
course
• Propose a novel approach – clarify how it is
different from existing approaches
• Experiments
• Datasets: Public Accessible (or upon approval)
• Experimental Results
• Validate your argument
• Prove the effectiveness of your approach
• Presentation (10 min)
• Submit a report
7
Report Format
• Template:
• https://www.ieee.org/conferences/publishing/template
s.html
• Paper Critique: 1-2 pages
• Project: 5 pages
Introduction Outline
• Why data analysis?
• What is data analysis?
• What kind of data?
• Data analysis functions and evaluation
• Applications of Data Analysis
• Major Challenges in Data Analysis

9
Why Data Analysis?

• The Explosive Growth of Data: from terabytes to petabytes


• Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
• Major sources of abundant data
• Business: E-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”— Data analysis

10
What is data analysis
• Data analysis is a process of
inspecting, cleansing, transforming and modeling
data with the goal of discovering useful
information, informing conclusions and supporting
decision-making.
Data Science

Data
Analysis
Data
Mining

11
Example: A Web Mining Framework

• Web mining usually involves


• Data cleaning
• Data integration from multiple sources
• Warehousing the data
• Data selection for data mining
• Data mining
• Presentation of the mining results
• Patterns and knowledge to be used or stored into
knowledge-base

12
Example: Medical Data Mining
• Health care & medical data mining – often adopted
such a view in statistics and machine learning
• Preprocessing of the data (including feature
extraction and dimension reduction)
• Classification or/and clustering processes
• Post-processing for presentation

13
What Kinds of Data?
• Database-oriented data sets and applications
• Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications


• Data streams and sensor data
• Time-series data, temporal data, sequence data (incl. bio-sequences)
• Structure data, graphs, social networks and multi-linked data
• Object-relational databases
• Heterogeneous databases and legacy databases
• Spatial data and spatiotemporal data
• Multimedia database
• Text databases
• The World-Wide Web

• Big Data
14
Data Analysis Function: Generalization

• Information integration and data warehouse construction


• Data cleaning, transformation, integration, and
multidimensional data model
• Data cube technology
• Scalable methods for computing (i.e., materializing)
multidimensional aggregates
• OLAP (online analytical processing)
• Multidimensional concept description: Characterization and
discrimination
• Generalize, summarize, and contrast data characteristics,
e.g., dry vs. wet region
15
Data Analysis Function: Association and Correlation
Analysis
• Frequent patterns (or frequent itemsets)
• What items are frequently purchased together in your
Walmart?
• Association, correlation vs. causality
• A typical association rule
• Computer à Software [1%, 50%] (support, confidence)

• How to mine such patterns and rules efficiently in large


datasets?

16
Data Analysis Function: Classification

• Classification and label prediction


• Construct models (functions) based on some training examples
• Describe and distinguish classes or concepts for future prediction
• E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
• Predict some unknown class labels
• Typical methods
• Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression, …
• Typical applications:
• Credit card fraud detection, direct marketing, classifying stars, diseases,
web-pages, …
17
Classification Example
• Given existing data about customers and payments,
predict new applicant’s loan eligibility.

18
Data Analysis Function: Cluster Analysis

• Unsupervised learning (i.e., Class label is unknown)


• Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns
• Principle: Maximizing intra-class similarity & minimizing interclass
similarity
• Many methods and applications

19
Data Analysis Function: Outlier Analysis

• Outlier analysis
• Outlier: A data object that does not comply with the general behavior of
the data
• Noise or exception? ― One person’s garbage could be another person’s
treasure
• Methods: statistical tests, distance measures, density-based methods
• Useful in fraud detection, rare events analysis

20
Time and Ordering: Sequential Pattern, Trend and Evolution
Analysis

• Sequence, trend and evolution analysis


• Trend, time-series, and deviation analysis: e.g., regression and
value prediction
• Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory cards
• Periodicity analysis
• Biological sequence analysis
• Approximate and consecutive motifs
• Similarity-based analysis
• Mining data streams
• Ordered, time-varying, potentially infinite, data streams
• Conversational Analysis

21
Structure and Network Analysis

• Graph mining
• Finding frequent subgraphs (e.g., chemical compounds), substructures
(web fragments)
• Information network analysis
• Social networks: actors (objects, nodes) and relationships (edges)
• e.g., author networks in CS, terrorist networks
• Multiple networks
• A person could be multiple information networks: friends, family,
classmates, …
• Links carry a lot of semantic information: Link mining
• Web mining
• Web is a big information network: from PageRank to Google
• Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …

22
Evaluation of Knowledge
• Are all mined knowledge interesting?
• One can mine tremendous amount of “patterns” and knowledge
• Some may fit only certain dimension space (time, location, …)
• Some may not be representative, may be transient, …

• Evaluation of mined knowledge


• Descriptive vs. predictive
• Typicality vs. novelty
• Accuracy
• Timeliness
• …

24
Application Areas
• Industry • Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport promotion analysis
Consumer goods Value added data
Data Service providers Power usage analysis
… …

25
Data Analysis works with
Warehouse Data
• Data Warehousing
provides the Enterprise
with a memory

• Data Analysis provides


the Enterprise with
intelligence

26
Major Challenges in Data Analysis (1)

• Mining Methodology
• Mining various and new kinds of knowledge
• Data Analysis: An interdisciplinary effort
• Boosting the power of discovery in a networked environment
• Handling noise, uncertainty, and incompleteness of data
• Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
• Interactive analysis
• Incorporation of background knowledge
• Presentation and visualization of data analysis results

27
Major Challenges in Data Analysis (2)

• Efficiency and Scalability


• Efficiency and scalability of data analysis algorithms
• Parallel, distributed, stream, and incremental analysis methods
• Diversity of data types
• Handling complex types of data
• Analyzing dynamic, networked, and global data repositories
• Data analysis and society
• Social impacts of data analysis
• Privacy-preserving data analysis

28
Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Summary

29
Data Objects

• Data sets are made up of data objects.


• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects,
tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.

30
Attributes
• Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Ordinal
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled

31
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV
positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between
successive values is not known.
• Size = {small, medium, large}, grades, army rankings

32
• Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚, calendar dates
• No true zero-point
• Ratio
• A ratio-scaled attribute is a numeric attribute with an inherent zero-
point.
• We can speak of values as being an order of magnitude larger than
the unit of measurement.
• e.g., $10 is twice as high as $5

33
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Continuous attributes are typically represented as floating-
point variables
34
Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Summary

35
Measuring the Central Tendency
1 n
• Mean (algebraic measure) (sample vs. population): x = å xi (1)
Note: n is sample size and N is population size. n i =1
n
Weighted arithmetic mean:
åw x

i i
• Trimmed mean: chopping extreme values
x= i =1
n
(2)
å wi
• Median: i =1

• Middle value if odd number of values, or average of the middle two


values otherwise
• Estimated by interpolation (for grouped data):

(3)
n / 2 - (å freq)l
Mode
median = L1 + ( ) width

freqmedian
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal

37
Symmetric vs.
Skewed Data
• Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data

positively skewed negatively skewed

38
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR

• Variance and standard deviation (sample: s, population: σ)


• Variance: (algebraic, scalable computation)

1 n 1 n 2 1 n 1 n 1 n
s = 2
å
n - 1 i =1
( xi - x ) =
2
[å xi - (å xi ) 2 ]
n - 1 i =1 n i =1
s = å ( xi - µ ) 2 =
2

N i =1 N
å xi - µ 2
i =1
2

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

39
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually

40
Histogram Analysis

40
• Histogram: Graph display of tabulated
35
frequencies, shown as bars
30
• It shows what proportion of cases fall
into each of several intervals 25
20
15
10
5
0
10000 30000 50000 70000 90000

41
Histograms Often Tell More than Boxplots

n The two histograms


shown in the left may
have the same boxplot
representation
n The same values
for: min, Q1,
median, Q3, max
n But they have rather
different data
distributions

42
Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Summary

43
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

44
Data Matrix and Dissimilarity Matrix
• Data matrix
• n data points with p é x11 ... x1f ... x1p ù
dimensions ê ú
ê ... ... ... ... ... ú
êx ... xif ... xip ú
ê i1 ú
ê ... ... ... ... ... ú
êx ... xnf ... xnp úú
• Dissimilarity matrix êë n1 û
• n data points, but
registers only the é 0 ù
distance ê d(2,1) 0 ú
ê ú
• A triangular matrix ê d(3,1) d ( 3,2) 0 ú
ê ú
ê : : : ú
êëd ( n,1) d ( n,2) ... ... 0úû

45
Proximity Measure for Nominal Attributes

• Method 1: Simple matching


• m: # of matches, p: total # of variables

d (i, j) = p -
p
m

• Method 2: Binary Encoding


• Creating a new binary attribute for each of the M nominal
states

46
Proximity Measure for Binary Attributes
A contingency table for binary data
Object j

Object i

• Distance measure for symmetric binary variables:

• Distance measure for asymmetric binary variables:

• Jaccard coefficient (similarity measure for asymmetric binary


variables): 47
Distance on Numeric Data: Minkowski Distance

• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
49
Special Cases of Minkowski Distance

• h = 1: Manhattan (city block, L1 norm) distance


• E.g., the Hamming distance: the number of bits that are different
between two binary vectors

d (i, j) =| x - x | + | x - x | +...+ | x - x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x - x |2 + | x - x |2 +...+ | x - x |2 )
i1 j1 i2 j 2 ip jp
• h ® ¥. “supremum” (Lmax norm, L¥ norm) distance.
• This is the maximum difference between any component
(attribute) of the vectors

50
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L¥ x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
Proximity for Ordinal Variables

• Order is important, e.g., rank


• Can be treated like numeric data
• replace xif by their rank rif Î{1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif -1
zif =
M f -1

• compute the dissimilarity using methods for numeric


variables

Example: {small, median, large} -> rank {1, 2, 3} ->numerical {0, 0.5, 1}
52
Attributes of Mixed Type

• A database may contain all attribute types


• Nominal, symmetric binary, asymmetric binary, numeric, ordinal
• One may use a weighted formula to combine their effects

S pf = 1d ij( f ) dij( f )
d (i, j) =
S pf = 1d ij( f )
if xif = xjf =0 and f is asymmetric binary, otherwise
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and
• Treat zif as numeric zif =
r -1
if

M -1 f
53
Textual Data: Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(A, B) = (A• B) /||A|| ||B|| =

where • indicates vector dot product, ||d||: the length of vector

54
Getting to Know Your Data

• Data Objects and Attribute Types

• Basic Statistical Descriptions of Data

• Measuring Data Similarity and Dissimilarity

• Summary

55
Summary
• Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-scaled
• Gain insight into the data by:
• Basic statistical data description: central tendency, dispersion, graphical
displays
• Data visualization: map data onto graphical primitives
• Measure data similarity
• Above steps are the beginning of data preprocessing.
• Many methods have been developed but still an active area of research.

56
References
• Part of the slides are based on
• the slides provided by Jiawei Han, Micheline Kamber,
and Jian Pei. © 2012 Han, Kamber & Pei.
• the slides provide by Richong Zhang @
http://act.buaa.edu.cn

57

You might also like