DA Unit I
DA Unit I
Data Management
Design Data Architecture and manage the
Data for analysis
• Data architecture is composed of models, policies, rules
or standards that govern which data is to be collected,
and how it is stored, arranged, integrated, put to use in
data systems and in organizations.
• Data is usually one of several architecture domains that
form the pillars of an enterprise architecture or solution
architecture.
• Various constraints and influences will have an
effect on data architecture design. These
include
– Enterprise requirements
– Technology drivers
– Economics
– Business policies
– Data processing needs.
• Enterprise requirements
These will generally include such elements as economical
and effective system expansion, acceptable performance
levels (especially system access speed), transaction
reliability, and transparent data management.
In addition, the conversion of raw data such as transaction
records and image files into more useful information
forms through such features as data warehouses is also a
common organizational requirement, since this enables
managerial decision making and other organizational
processes.
One of the architecture techniques is the split between
managing transaction data and (master) reference data.
Another one is splitting data capture systems from data
retrieval systems (as done in a data warehouse).
• Technology drivers
These are usually suggested by the completed data
architecture and database architecture designs.
In addition, some technology drivers will be
derived from existing organizational integration
frameworks and standards, organizational
economics, and existing site resources (e.g.
previously purchased software licensing).
• Economics
The only disadvantage of the above sources is that the data may be biased. They are likely
to colour their negative points.
Syndicate Services- These services are provided by certain organizations which collect and
tabulate the marketing information on a regular basis for a number of clients who are the
subscribers to these services. So the services are designed in such a way that the
information suits the subscriber. These services are useful in television viewing, movement
of consumer goods etc. These syndicate services provide information data from both
household as well as institution.
In collecting data from household they use three approaches
• Survey- They conduct surveys regarding - lifestyle, sociographic, general topics.
• Mail Diary Panel- It may be related to 2 fields - Purchase and Media.
• Electronic Scanner Services- These are used to generate data on volume.
Various syndicate services are Operations Research Group (ORG) and The Indian
Marketing Research Bureau (IMRB).
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
– More “stable” data
• Aggregated data tends to have less variability
Example: Precipitation in Australia
• This example is based on precipitation in Australia from the
period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average monthly
precipitation for 3,030 0.5km by 0.5km grid cells in Australia, and
– A histogram for the standard deviation of the average yearly
precipitation for the same locations.
• The average yearly precipitation has less variability than the
average monthly precipitation.
• All precipitation measurements (and their standard
deviations) are in centimeters.
Example: Precipitation in Australia …
The average yearly precipitation has less variability than the
average monthly precipitation
• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
There are two components of dimensionality reduction:
• Feature selection: In this, we try to find a subset of the
original set of variables, or features, to get a smaller
subset which can be used to model the problem. It usually
involves three ways:
– Filter
– Wrapper
– Embedded
• Feature extraction: This reduces the data in a high
dimensional space to a lower dimension space, i.e. a space
with lesser no. of dimensions.
• Filter methods:
– information gain
– chi-square test
– fisher score
– correlation coefficient
– variance threshold
• Wrapper methods:
– recursive feature elimination
– sequential feature selection algorithms
– genetic algorithms
• Embedded methods:
– L1 (LASSO) regularization
– decision tree
Dimensionality Reduction: PCA
• It works on a condition that while the data in a higher dimensional space
is mapped to data in a lower dimension space, the variance of the data in
the lower dimensional space should be maximum.
• Goal is to find a projection that captures the largest amount of variation
in data
x1
It involves the following steps:
• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigen values
are used to reconstruct a large fraction of variance of
the original data.
Hence, we are left with a lesser number of eigenvectors,
and there might have been some data loss in the process.
But, the most important variances should be retained by
the remaining eigenvectors.
Dimensionality Reduction: PCA
Feature Subset Selection
Another way to reduce dimensionality of data
• Redundant features
– Duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
– Contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
Feature Creation
• Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes
30
Counts
20
10
0
0 2 4 6 8
Petal Length
Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Discretization Without Using Class Labels
Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.