0% found this document useful (0 votes)
26 views

17 Data Analysis

This document provides an overview of data analysis and management techniques. It discusses different data types like nominal, ordinal, interval and ratio data. It also describes important data characteristics like dimensionality, sparsity and size. Finally, it covers key steps in data preprocessing like handling noise, missing values, duplicates and wrong data. It discusses techniques for data transformation such as aggregation, sampling, dimensionality reduction, discretization and attribute transformation.

Uploaded by

Enri Gjondrekaj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

17 Data Analysis

This document provides an overview of data analysis and management techniques. It discusses different data types like nominal, ordinal, interval and ratio data. It also describes important data characteristics like dimensionality, sparsity and size. Finally, it covers key steps in data preprocessing like handling noise, missing values, duplicates and wrong data. It discusses techniques for data transformation such as aggregation, sampling, dimensionality reduction, discretization and attribute transformation.

Uploaded by

Enri Gjondrekaj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Analysis and Management

of Production System
Lesson 17: Data analysis

Prof. Giulia Bruno

Department of Management and


Production Engineering

[email protected]
Dataset

Attributes
ID COLOR AGE WEIGHT ON
Collection of data
1 Orange 24 70.24 Yes
objects and their
attributes. 2 Blue 12 56.45 Yes
3 Red 58 67.23 Yes
An attribute is a
4 Orange 43 62.50 Yes
property or
Objects 5 Orange 18 37.47 No
characteristic of an
6 Blue 19 81.35 No
object.
7 Green 62 44.45 Yes
A collection of
8 Orange 33 23.34 No
attributes describe an
9 Green 20 26.35 No
object.
10 Red 47 57.89 Yes
11 Orange 39 52.98 No
12 Green 30 87.43 Yes
13 Blue 29 77.79 No
Types of data
1 NOMINAL 1. DISTINCTNESS
ID numbers, eye color, gender =, ≠
1, 2 ORDINAL 2. ORDER
rankings, hardness of minerals, grades <, >
1, 2, 3 INTERVAL 3. DIFFERENCES
calendar dates, temperatures in Celsius, … +, −
1, 2, 3, 4 RATIO 4. RATIO
temperature in Kelvin, length, counts… ∗,/
Record Data

● Data matrix Projection Projection Distance Load Thickness


of x Load of y load
 same fixed set of numeric
attributes 10.23 5.27 15.22 2.7 1.2

 data objects can be thought of as 12.65 6.25 16.22 2.2 1.1

points in a multi-dimensional
space, where each dimension

timeout

season
coach

game
score
play
represents a distinct attribute

team

win
ball

lost
● Document data
 each term is a component Document 1 3 0 5 0 2 6 0 2 0 2

(attribute) of the vector Document 2 0 7 0 2 1 0 0 3 0 0

 the value of each component is the Document 3 0 1 0 0 1 2 2 0 3 0


number of times the corresponding
term occurs in the document TID Items
● Transaction data 1 Bread, Coke, Milk
 each transaction involves a set of 2 Beer, Bread
items (e.g., set of products 3 Beer, Coke, Diaper, Milk
purchased by a customer during a
shopping session) 4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

Examples: generic graphs, molecules, webpages

2
5 1
2
5

Benzene Molecule: C6H6


Ordered Data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
● Sequence of data CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG

● Sequences of transactions

● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Important characteristics of data

● Dimensionality (number of attributes)


 High dimensional data brings a number of challenges

● Sparsity
 Only presence counts

● Resolution
 Patterns depend on the scale

● Size
 Type of analysis may depend on size of data
From data to knowledge (KDD)
Data preprocessing/cleaning

The representation and quality of data is first and foremost before


running any analysis

If there is much irrelevant and redundant information or noisy and


unreliable data, then knowledge discovery is more difficult

EXAMPLES OF DATA QUALITY PROBLEMS


● Noise and outliers
● Missing values
● Duplicate data
● Wrong data
Noise and outliers

Outliers are data objects with characteristics that are


considerably different than most of the other data objects in
the data set
● Case 1: outliers are noise that interferes with data
analysis
● Case 2: outliers are the goal of the analysis

Box plot is a useful graphical method for


describing the behavior of the data and
identify outliers

Examples of applications: detection of


credit card frauds (anomalous set of
purchases), detection of unusual pattern in
medical diagnosis
Missing values

Reasons for missing values


● Information is not collected
● Attributes may not be applicable to all cases

Handling missing values


● Eliminate data objects or attribute
● Estimate missing values
● Ignore the missing value during analysis
Duplicate data

Data set may contain data objects that are duplicates, or


almost duplicates of one another (for example same person,
object, with multiple email addresses)

It can happen when data from heterogeneous sources are


merged

When is a data considered duplicate?


When the similarity between data is equal to 1
Similarity
● Numerical measure of how alike two data objects are
● It is higher when objects are more alike
● Often falls in the range [0,1]

It is possible to use different distance definitions to express the


similarity

ATTRIBUTE TYPE DISTANCE (DISSIMILARITY)


0 𝑖𝑓 𝑥 = 𝑦
Nominal d(𝑥, 𝑦) = ቊ
1 𝑖𝑓 𝑥 ≠ 𝑦
𝑥−𝑦
𝑑 𝑥, 𝑦 =
Ordinal 𝑛−1
(values mapped to integers 0 to n-1
where n is the number of values)
𝑑 𝑥, 𝑦 = 𝑥 − 𝑦 ,
Interval or Ratio
𝑑 𝑥, 𝑦 = σ𝑛𝑘=1 𝑥𝑘 − 𝑦𝑘 2

Similarity: s = 1 − d
Wrong data

The value of a certain attribute can depend on the values of some


others, so it’s possible to check the satisfaction of the dependence
Example: the italian fiscal code which depends on the personal data
Using as example the date of birth we can easily find out which of
these three record are correct (last two numbers of the birth year,
letter representing the month, two numbers for the day)

CSTGPP73A25I452U 25/01/1973
NDDPRI82E30A859Z 31/05/1982
CSTGPP00A01G732I 01/01/2000
Data Transformation

Techniques for data transformation


● Aggregation
● Sampling
● Dimensionality Reduction
 Feature selection
 Feature creation
 Mapping data to new space

● Discretization and Binarization


● Attribute Transformation
Aggregation

● Combining two or more attributes (or objects) into a


single attribute (or object)

● Purpose
 Data reduction
• Reduce the number of attributes or objects
 Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
 More “stable” data
• Aggregated data tends to have less variability
Sampling

● Main technique used for data reduction


 considering the entire set of data of interest can be too
expensive or time consuming

● Key principle: using a sample will work almost as well


as using the entire data set, if the sample is
representative
 a sample is representative if it has approximately the same
properties as the original set of data

8000 points 2000 Points 500 Points


Dimensionality Reduction

Reduction of the dimensionality of the dataset, i.e., the number of


attributes
 reduce amount of time and memory required by algorithms
 allow data to be more easily visualized
 may help to eliminate irrelevant features or reduce noise

● Techniques
 Feature selection: remove redundant features (e.g., the purchase
price of a product has the same information of the amount of sales
tax paid) or irrelevant features (e.g., students' ID is irrelevant to the
task of predicting students' marks)
 Feature creation: create new attributes that can capture the
important information in a data set much more efficiently than the
original attributes
• Feature extraction (e.g., extracting edges from images)
• Feature construction (e.g., dividing mass by volume to get density)
• Mapping data to new space (e.g., Fourier and wavelet analysis)
Feature selection vs. Feature extraction

FEATURE SELECTION
𝑋1 , … , 𝑋𝑝 → 𝑋𝑘1 , … , 𝑋𝑘𝑚

FEATURE EXTRACTION
𝑋1 , … , 𝑋𝑝 → 𝑍1 , … , 𝑍𝑚
𝑍1 , … , 𝑍𝑚 = 𝑓1 (𝑋1 , … , 𝑋𝑝 ), … , 𝑓𝑚 (𝑋1 , … , 𝑋𝑝 )
Mapping data to a new space

Time domain Frequency domain


Discretization

● Process of converting a continuous attribute into an ordinal


attribute
 A potentially infinite number of values are mapped into a
small number of categories
 Discretization is commonly used in classification
 Many classification algorithms work best if both the
independent and dependent variables have only a few values
Binarization

● Binarization maps a continuous or categorical attribute into


one or more binary variables

● Typically used for association analysis

● Often convert a continuous attribute to a categorical attribute


and then convert a categorical attribute to a set of binary
attributes
Attribute Transformation

● An attribute transformation is a function that maps


the entire set of values of a given attribute to a new set
of replacement values such that each old value can be
identified with one of the new values
 Simple functions: xk, log(x), ex, |x|
 Normalization
• Refers to various techniques to adjust to differences among
attributes in terms of frequency of occurrence, mean,
variance, range
• Take out unwanted, common signal, e.g., seasonality
 In statistics, standardization refers to subtracting off the
means and dividing by the standard deviation
Data mining
Predictive modeling – supervised learning

● Regression: Predict a value of a given continuous


valued variable based on the values of other
variables, assuming a linear or nonlinear model of
dependency

● Classification: Find a model for class attribute as a


function of the values of other attributes
Clustering – unsupervised learning

● Finding groups of objects such that the objects in a group


will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Association Rule

● Given a set of records each of which contain


some number of items from a given collection
 Produce dependency rules which will predict occurrence
of an item based on occurrences of other items

TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread
{Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Deviation/ Anomaly/Change Detection

• Detect significant deviations from normal


behavior

• Applications:
• Credit Card Fraud Detection
• Network Intrusion Detection
• Identify anomalous behavior from sensor
networks for monitoring and surveillance
• Detecting changes in the global forest cover
Data mining vs. Machine learning

In the 1960s, statisticians and economists used terms like data fishing or data
dredging to refer to what they considered the bad practice of analyzing data
without an a-priori hypothesis. The term "data mining" was used in a similarly
critical way by economist Michael Lovell in an article published in the Review of
Economic Studies in 1983.
Lovell, Michael C., Data Mining (1983). The Review of Economics and Statistics.

Data mining, also called knowledge discovery in databases, in computer science,


is the process of discovering interesting and useful patterns and relationships in
large volumes of data. The field combines tools from statistics and artificial
intelligence (such as neural networks and machine learning) with database
management to analyze large digital collections, known as data sets.

Christopher Clifton, Data mining (2019). Encyclopædia Britannica, inc.


Data mining vs. Machine learning

Machine learning algorithms build a mathematical model based on


sample data, known as "training data", in order to make predictions or
decisions without being explicitly programmed to perform the task.
Bishop, C. M., Pattern Recognition and Machine Learning (20006). Springer.

The definition "without being explicitly programmed" is often attributed


to Arthur Samuel, who coined the term "machine learning" in 1959.
Data mining vs. Machine learning

1. Meaning
● Extracting knowledge from a large ● Extracting new algorithms from data as
amount of data well as experience
2. History
● Introduced in 1930, initially referred as ● Introduce in near 1950, the first program
knowledge discovery in databases was Samuel’s checker-playing program

3. Responsibility
● Data mining is used to get the rules ● Machine learning teaches the computer
from the existing data to learn and understand the given rules

4. Origin
● Traditional databases with unstructured ● Existing data as well as algorithms
data
Data mining vs. Machine learning

5. Implementation
● We can develop our own models ● We can use machine learning algorithm in
where we can use data mining the decision tree, neural networks and some
techniques other area of artificial intelligence
6. Nature
● Involves human interference more ● Automated, once design self-implemented,
towards manual no human effort

7. Techniques involved
• Data mining is more a research using • Self-learned and trains system to do
methods like machine learning the intelligent task

8. Scope
• Applied in the limited area • Can be used in a vast area
Machine learning
Focus on supervised learning

● Regression: Predict a value of a given continuous


valued variable based on the values of other
variables, assuming a linear or nonlinear model of
dependency

● Classification: Find a model for class attribute as a


function of the values of other attributes
Regression

Given few predictors 𝑋 = (𝑋1 , … , 𝑋|𝑋| ) referring to values of


variables under analysis (|𝑋| is the number of features), the
goals is choosing a model 𝑓 for the relation 𝑌 = 𝑓 𝑋 + 𝜖
to estimate unknow values 𝑌෠ = 𝑓(𝑋).
𝑌 could be a vector 𝑌 = (𝑌1 , … , 𝑌|𝑌| ).
Types of regressions

LINEAR REGRESSION
𝑌 = 𝛽0 + 𝛽1 𝑋1 + ⋯ + 𝛽 𝑋 𝑋 𝑋 + 𝜖

REGRESSION WITH INTERACTIONS


𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 ∗ 𝑋2 + ⋯

POLYNOMIAL REGRESSION
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽12 𝑋12 + ⋯ + 𝛽2 𝑋2 + 𝛽22 𝑋22 + ⋯
Performance indicators for regression

MAE (Mean Absolute Error): the average distance between


each real value and the predicted one

MSE (Mean Squared Error): the average squared difference


between the estimated values and the actual value

The lower the values of MAE and MSE the better the model

𝐑𝟐 (R squared): the proportion of the variance in the dependent


variable that is predictable from the independent variable
𝐑𝟐 close to 1 means most predicted values are equal to the original ones
𝐑𝟐 negative or close to 0 means wrong predictions
Classification

• Given a collection of records


• Each record is by characterized by a tuple (x,y),
where x is the attribute set and y is the class label
 x: attribute, predictor, independent variable,
input
 y: class, response, dependent variable, output

• Task:
 Learn a model that maps each attribute set x into
one of the predefined class labels y
Examples of Classification Task

Task Attribute set, x Class label, y

Categorizing Features extracted from spam or non-spam


email email message header and
messages content

Identifying Features extracted from x- malignant or benign


tumor cells rays or MRI scans cells

Cataloging Features extracted from Elliptical, spiral, or


galaxies telescope images irregular-shaped
galaxies
Regression vs. Classification

REGRESSION CLASSIFICATION
● The predictive model ● The predictive model
produces as output a produces as output a
numerical estimate class or a category

● For example: the ● For example : the


model analyzes some model analyzes some
characteristics of the characteristics of the
houses and gives flowers and labels
them an estimate of them with the species
the market price
Approach for Building Classification Model

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Example of a Decision Tree

Splitting Attributes

Home Marital Annual Defaulted


ID
Owner Status Income Borrower Home
1 Yes Single 125K No
Owner
Yes No
2 No Married 100K No
3 No Single 70K No NO MarSt
4 Yes Married 120K No Single, Divorced Married
5 No Divorced 95K Yes
Income NO
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
8 No Single 85K Yes NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Data Model: Decision Tree


Another Example of Decision Tree

MarSt Single,
Married Divorc
ed
NO Home
Home Marital Annual Defaulted
ID Yes Owner No
Owner Status Income Borrower
1 Yes Single 125K No
NO Income
2 No Married 100K No
< 80K > 80K
3 No Single 70K No
4 Yes Married 120K No NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
There could be more than one tree
that fits the same data!
9 No Married 75K No
10 No Single 90K Yes
10
Apply model to new data

Start from the root of tree.


Test Data
Home Home Marital Annual Defaulted
Owner Owner Status Income Borrower
Yes No
No Married 80K ?
10

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No No Married 80K ?
10

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Owner Status Income Borrower
Home
Yes Owner No No Married 80K ?
10

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Home Marital Annual Defaulted
Home Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10

NO MarSt
Single, Divorced Married

Income NO
< 80K > 80K

NO YES
Apply Model to Test Data

Test Data
Home Home Marital Annual Defaulted
Owner Status Income Borrower
Yes Owner No
No Married 80K ?
10

NO MarSt
Single, Divorced Married
Assign Defaulted to
Income NO “ No”
< 80K > 80K

NO YES
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Model Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision tree

• Two categories
 trees used for regression problems
 trees used for classification problems
• This means that this algorithm can be used both when the
dependent variable is continuous and when it is categorical
• Many Algorithms
 Hunt’s Algorithm (one of the earliest)
 CART
 ID3, C4.5
 SLIQ,SPRINT
Decision Tree Based Classification

Advantages:
 Inexpensive to construct
 Extremely fast at classifying unknown records
 Easy to interpret for small-sized trees
 Robust to noise
 Can easily handle redundant or irrelevant attributes (unless the
attributes are interacting)
Disadvantages:
 Space of possible decision trees is exponentially large. Greedy
approaches are often unable to find the best tree.
 Does not take into account interactions between attributes
 Each decision boundary involves only a single attribute
Other classification techniques

• Logistic regression
Uses a logistic function to model a binary dependent variable
• Random forest
Ensemble method that constructs a multitude of decision trees
at training time and outputs the class that is the mode of the
classes of the individual trees
• Support vector machine (SVM)
Finds a hyperplane in an N-dimensional space, which separate
data belonging to the different classes
• Nearest-Neighbor (K-NN)
Use class labels of the K nearest neighbors to determine the
class label of unknown record (e.g., by taking majority vote)
• Neural networks
Performance indicators for classification

ACCURACY
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 𝑇𝑃 + 𝑇𝑁
=
𝑎𝑙𝑙 𝑣𝑎𝑙𝑢𝑒𝑠 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁

SENSITIVITY
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
=
𝑟𝑒𝑎𝑙 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 + 𝐹𝑁

PRECISION
𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃
=
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑃 + 𝐹𝑃
Prediction evaluation

A training dataset is a dataset of


examples used for learning, that is
to fit the parameters of the model.

Train Most approaches that search


Data

through training data for empirical


relationships tend to overfit the
Train- data, meaning that they can
test identify apparent rules in the
training data that do not hold in
Test

split
general.
Prediction evaluation

A test dataset is independent of


the training dataset, but that

Train
follows the same probability
distribution as the training dataset.
Data

In theory, If a model fit to the


training dataset also fits the test
Train- dataset well, otherwise it is the
case of overfitting.
test
A test set is a set used only to
Test

split
assess the performances of a
model.
Prediction evaluation

The validation dataset is used


to tune the hyperparameters (i.e.

Train
the architecture parameters) of a

Train model.
Data

It is used also to avoid overfitting.


In this sense the training dataset
is used to train the candidate
Train-

Test V
algorithms, the validation dataset
Validation is used to compare their
test performances and decide which
split
Test

split one to take and, finally, the test


dataset is used to obtain the
performances.
Data

test
split
Train-
Test Train

split
Cross
validation

Test V Train
Prediction evaluation

Test V Train
Test Train V
Test Train V
Model underfitting and overfitting

● Underfitting occurs
when a model can’t
capture the
dependencies among
data, usually as a
consequence of its own
simplicity
● Overfitting happens
when a model learns both
dependencies among
data and random
fluctuations (i.e., learns
the existing data too well)
 Complex models, which
have many features or
terms, are often prone to
overfitting
Decision Tree with 4 nodes

Decision Tree

Decision boundaries on Training data


Decision Tree with 50 nodes

Decision Tree

Decision boundaries on Training data


Which tree is better?

Decision Tree with 4 nodes

Which tree is better ?


Decision Tree with 50 nodes

● Training error does not provide a good estimate of how


well the tree will perform on previously unseen records
● Need ways for estimating generalization errors
Model Overfitting

• As the model becomes more and more complex, test errors can
start increasing even though training error may be decreasing

Underfitting: when model is too simple, both training and test errors are
large
Overfitting: when model is too complex, training error is small but test
error is large
Model Overfitting

Using twice the number of data instances

• Increasing the size of training data reduces the difference


between training and testing errors at a given size of model

You might also like