17 Data Analysis
17 Data Analysis
of Production System
Lesson 17: Data analysis
[email protected]
Dataset
Attributes
ID COLOR AGE WEIGHT ON
Collection of data
1 Orange 24 70.24 Yes
objects and their
attributes. 2 Blue 12 56.45 Yes
3 Red 58 67.23 Yes
An attribute is a
4 Orange 43 62.50 Yes
property or
Objects 5 Orange 18 37.47 No
characteristic of an
6 Blue 19 81.35 No
object.
7 Green 62 44.45 Yes
A collection of
8 Orange 33 23.34 No
attributes describe an
9 Green 20 26.35 No
object.
10 Red 47 57.89 Yes
11 Orange 39 52.98 No
12 Green 30 87.43 Yes
13 Blue 29 77.79 No
Types of data
1 NOMINAL 1. DISTINCTNESS
ID numbers, eye color, gender =, ≠
1, 2 ORDINAL 2. ORDER
rankings, hardness of minerals, grades <, >
1, 2, 3 INTERVAL 3. DIFFERENCES
calendar dates, temperatures in Celsius, … +, −
1, 2, 3, 4 RATIO 4. RATIO
temperature in Kelvin, length, counts… ∗,/
Record Data
points in a multi-dimensional
space, where each dimension
timeout
season
coach
game
score
play
represents a distinct attribute
team
win
ball
lost
● Document data
each term is a component Document 1 3 0 5 0 2 6 0 2 0 2
2
5 1
2
5
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
● Sequence of data CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
● Sequences of transactions
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Important characteristics of data
● Sparsity
Only presence counts
● Resolution
Patterns depend on the scale
● Size
Type of analysis may depend on size of data
From data to knowledge (KDD)
Data preprocessing/cleaning
Similarity: s = 1 − d
Wrong data
CSTGPP73A25I452U 25/01/1973
NDDPRI82E30A859Z 31/05/1982
CSTGPP00A01G732I 01/01/2000
Data Transformation
● Purpose
Data reduction
• Reduce the number of attributes or objects
Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
More “stable” data
• Aggregated data tends to have less variability
Sampling
● Techniques
Feature selection: remove redundant features (e.g., the purchase
price of a product has the same information of the amount of sales
tax paid) or irrelevant features (e.g., students' ID is irrelevant to the
task of predicting students' marks)
Feature creation: create new attributes that can capture the
important information in a data set much more efficiently than the
original attributes
• Feature extraction (e.g., extracting edges from images)
• Feature construction (e.g., dividing mass by volume to get density)
• Mapping data to new space (e.g., Fourier and wavelet analysis)
Feature selection vs. Feature extraction
FEATURE SELECTION
𝑋1 , … , 𝑋𝑝 → 𝑋𝑘1 , … , 𝑋𝑘𝑚
FEATURE EXTRACTION
𝑋1 , … , 𝑋𝑝 → 𝑍1 , … , 𝑍𝑚
𝑍1 , … , 𝑍𝑚 = 𝑓1 (𝑋1 , … , 𝑋𝑝 ), … , 𝑓𝑚 (𝑋1 , … , 𝑋𝑝 )
Mapping data to a new space
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Association Rule
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Deviation/ Anomaly/Change Detection
• Applications:
• Credit Card Fraud Detection
• Network Intrusion Detection
• Identify anomalous behavior from sensor
networks for monitoring and surveillance
• Detecting changes in the global forest cover
Data mining vs. Machine learning
In the 1960s, statisticians and economists used terms like data fishing or data
dredging to refer to what they considered the bad practice of analyzing data
without an a-priori hypothesis. The term "data mining" was used in a similarly
critical way by economist Michael Lovell in an article published in the Review of
Economic Studies in 1983.
Lovell, Michael C., Data Mining (1983). The Review of Economics and Statistics.
1. Meaning
● Extracting knowledge from a large ● Extracting new algorithms from data as
amount of data well as experience
2. History
● Introduced in 1930, initially referred as ● Introduce in near 1950, the first program
knowledge discovery in databases was Samuel’s checker-playing program
3. Responsibility
● Data mining is used to get the rules ● Machine learning teaches the computer
from the existing data to learn and understand the given rules
4. Origin
● Traditional databases with unstructured ● Existing data as well as algorithms
data
Data mining vs. Machine learning
5. Implementation
● We can develop our own models ● We can use machine learning algorithm in
where we can use data mining the decision tree, neural networks and some
techniques other area of artificial intelligence
6. Nature
● Involves human interference more ● Automated, once design self-implemented,
towards manual no human effort
7. Techniques involved
• Data mining is more a research using • Self-learned and trains system to do
methods like machine learning the intelligent task
8. Scope
• Applied in the limited area • Can be used in a vast area
Machine learning
Focus on supervised learning
LINEAR REGRESSION
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽 𝑋 𝑋 𝑋 + 𝜖
POLYNOMIAL REGRESSION
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽12 𝑋12 + ⋯ + 𝛽2 𝑋2 + 𝛽22 𝑋22 + ⋯
Performance indicators for regression
The lower the values of MAE and MSE the better the model
• Task:
Learn a model that maps each attribute set x into
one of the predefined class labels y
Examples of Classification Task
REGRESSION CLASSIFICATION
● The predictive model ● The predictive model
produces as output a produces as output a
numerical estimate class or a category
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Example of a Decision Tree
Splitting Attributes
MarSt Single,
Married Divorc
ed
NO Home
Home Marital Annual Defaulted
ID Yes Owner No
Owner Status Income Borrower
1 Yes Single 125K No
NO Income
2 No Married 100K No
< 80K > 80K
3 No Single 70K No
4 Yes Married 120K No NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
There could be more than one tree
that fits the same data!
9 No Married 75K No
10 No Single 90K Yes
10
Apply model to new data
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
Home
Yes Owner No No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10
NO MarSt
Single, Divorced Married
Income NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Home Home Marital Annual Defaulted
Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10
NO MarSt
Single, Divorced Married
Assign Defaulted to
Income NO “ No”
< 80K > 80K
NO YES
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply
Model Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision tree
• Two categories
trees used for regression problems
trees used for classification problems
• This means that this algorithm can be used both when the
dependent variable is continuous and when it is categorical
• Many Algorithms
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Robust to noise
Can easily handle redundant or irrelevant attributes (unless the
attributes are interacting)
Disadvantages:
Space of possible decision trees is exponentially large. Greedy
approaches are often unable to find the best tree.
Does not take into account interactions between attributes
Each decision boundary involves only a single attribute
Other classification techniques
• Logistic regression
Uses a logistic function to model a binary dependent variable
• Random forest
Ensemble method that constructs a multitude of decision trees
at training time and outputs the class that is the mode of the
classes of the individual trees
• Support vector machine (SVM)
Finds a hyperplane in an N-dimensional space, which separate
data belonging to the different classes
• Nearest-Neighbor (K-NN)
Use class labels of the K nearest neighbors to determine the
class label of unknown record (e.g., by taking majority vote)
• Neural networks
Performance indicators for classification
ACCURACY
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 𝑇𝑃 + 𝑇𝑁
=
𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁
SENSITIVITY
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
=
𝑟𝑒𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 + 𝐹𝑁
PRECISION
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
=
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 + 𝐹𝑃
Prediction evaluation
split
general.
Prediction evaluation
Train
follows the same probability
distribution as the training dataset.
Data
split
assess the performances of a
model.
Prediction evaluation
Train
the architecture parameters) of a
Train model.
Data
Test V
algorithms, the validation dataset
Validation is used to compare their
test performances and decide which
split
Test
test
split
Train-
Test Train
split
Cross
validation
Test V Train
Prediction evaluation
Test V Train
Test Train V
Test Train V
Model underfitting and overfitting
● Underfitting occurs
when a model can’t
capture the
dependencies among
data, usually as a
consequence of its own
simplicity
● Overfitting happens
when a model learns both
dependencies among
data and random
fluctuations (i.e., learns
the existing data too well)
Complex models, which
have many features or
terms, are often prone to
overfitting
Decision Tree with 4 nodes
Decision Tree
Decision Tree
• As the model becomes more and more complex, test errors can
start increasing even though training error may be decreasing
Underfitting: when model is too simple, both training and test errors are
large
Overfitting: when model is too complex, training error is small but test
error is large
Model Overfitting