Data Visualization of Multidimensional Data
Data Visualization of Multidimensional Data
12/10/2024 1
What is data modeling?
• Data modeling (data modelling) is the process of
creating a data model for the data to be stored
in a database.
• This data model is a concept ual represen tat ion
of Dat a objects, the associations between
diff erent data objects, and the rules.
• Data modeling helps in the visual representation
of data and enforces business rules, regulatory
compliances, and government policies on the
data.
• Data Models ensure consistency in naming
conventions, default values, semantics, security
while ensuring quality of the data.
Data Model
• The Dat a Model is defined as an abst ract
model that organizes data description,
data semantics, and consistency
constraints of data.
• The data model emphasizes on what
data is needed and how it should be
organized instead of what operations
will be performed on data.
• Data Model is like an architect's building
plan, which helps to build conceptual
models and set a relationship between
data items.
Data Models: Types
• The two types of Data Modeling
Techniques are
– Entity Relat ionship (E-R) Model
– UML (Unified Modelling Language)
Why data modeling?
• Ensures that all data objects required by
the database are accurately
represented. Omission of data will lead
to creation of faulty reports and
produce incorrect results.
• A data model helps design the database
at the conceptual, physical and logical
levels.
• Data Model structure helps to define
the relational tables, primary and
foreign keys and stored procedures.
Why data modeling?
• It provides a clear picture of the base
data and can be used by database
developers to create a physical
database.
• It is also helpful to identify
missing and redundant data.
• Though the initial creation of data
model is labor and time consuming,
in the long run, it makes your IT
infrastructure upgrade and
maintenance cheaper and faster.
Data model types
• Types of Data Models: There are mainly
three diff erent types of data models:
conceptual data models, logical data
models, and physical data models, and
each one has a specific purpose.
• The data models are used to represent
the data and how it is stored in the
database and to set the relationship
between data items.
Data model types
• Conceptual Data Model: This Data
Model defines WHAT the system
contains.
• This model is typically created by
Business stakeholders and Data
Architects.
• The purpose is to organize, scope and
define business concepts and rules.
Data model types
• Logical Data Model: Defines HOW the
system should be implemented
regardless of the DBMS.
• This model is typically created by
Data Architects and Business
Analysts.
• The purpose is to developed technical
map of rules and data structures.
Data model types
• Physical Data Model: This Data Model
describes HOW the system will be
implemented using a specific DBMS
system.
• This model is typically created by
DBA and developers.
• The purpose is actual implementation
of the database.
Data model types
Conceptual data model
• A Conceptual Data Model is an organized view of database
concepts and their relationships. The purpose of creating a
conceptual data model is to establish entities, their attributes,
and relationships.
• In this data modeling level, there is hardly any detail available on
the actual database structure. Business stakeholders and data
architects typically create a conceptual data model.
• The 3 basic tenants of Conceptual Data Model are
– Entity: A real-world thing
– Attribute: Characteristics or properties of an entity
– Relationship: Dependency or association between two
entities
Conceptual data model
• Data model example:
– Customer and Product are two entities.
Customer number and name are
attributes of the Customer entity
– Product name and price are attributes of
product entity
– Sale is the relationship between the
customer and product
Characteristics of conceptual data model
• Offers Organisation-wide coverage of the
business concepts.
• This type of Data Models are designed and
developed for a business audience.
• The conceptual model is developed
independently of hardware specifications like
data storage capacity, location or software
specifications like DBMS vendor and technology.
The focus is to represent data as a user will see
it in the "real world."
• Concept ual data models known as Domain
models create a common vocabulary for all
stakeholders by establishing basic concepts and
scope.
Logical Data Model
• The Logical Data Model is used to define the
structure of data elements and to set
relationships between them.
• The logical data model adds further
information to the conceptual data model
• elements.
The advantage of using a Logical data model is
to provide a foundation to form the base for
the Physical model. However, the modeling
structure remains generic.
Characteristics of a Logical data model
• Describes dat a needs for a single
project but could integrate with other
logical data models based on the scope
of the project.
• Designed and developed independently
from the DBMS.
• Data attributes will have datatypes
with exact precisions and length.
• Normalization processes to the model is
applied typically till 3NF.
Physical Data Model
• A Physical Data Model describes a database-
specific implementation of the data model. It
offers database abstraction and helps
• generate the schema.
This is because of the richness of meta-data
• off ered by a Physical Data Model.
The physical data model also helps in visualizing
database structure by replicating database
column keys, constraints, indexes, triggers, and
other RDBMS features.
Characteristics
• The physical data model describes data need for
a single project or application though it maybe
integrated with other physical data models
• based on project scope.
Data Model contains relationships between
tables that which addresses cardinality and
• nullability of the relationships.
Developed for a specific version of a DBMS,
• location, data storage or technology to be used
in the project.
• Columns should have exact datatypes, lengths
assigned and default values.
Primary and Foreign keys, views, indexes, access
profiles, and authorizations, etc. are defined.
Advantages of Data model
• The main goal of a designing dat a model is t o make
cert ain t hat data objects off ered by the functional
team are represented accurately.
• The data model should be detailed enough to be
used for building the physical database.
• The information in the data model can be used for
defining the relationship between tables, primary
and foreign keys, and stored procedures.
• Data Model helps business to communicate the within
and across organizations.
• Data model helps to documents data mappings in ETL
process
• Help to recognize correct sources of data to populate
the model
Disadvantages of Data model
• To develop Data model one should
know physical data stored
characteristics.
• This is a navigational system produces
complex application development,
management. Thus, it requires a
knowledge of the biographical truth.
• Even smaller change made in structure
require modification in the entire
application.
• There is no set data manipulation
Multidimensional Data Model
• Multidimensional Data Model can be defined as a
method f or arranging the data in the database,
with better structuring and organization of the
contents in the database.
• Unlike a system with one dimension such as a
list, the Multidimensional Data Model can
have two or three dimensions of items from
the database system.
• It is typically used in the organizations for
drawing out Analytical results and generation of
reports, which can be used as the main source for
imperative decision-making processes.
• This model is typically applied to systems that
operate with OLAP techniques (Online Analytical
Processing).
Multidimensional Data Model
• Multidimensional data model stores
data in the form of data cube.Mostly,
data warehousing supports two or
three-dimensional cubes.
• A data cube allows data to be viewed in
multiple dimensions.
• A dimensions are entities with respect to
which an organization wants to keep
records.
• For example in store sales record,
dimensions allow the store to keep track of
things like monthly sales of items and the
branches and locations.
Multidimensional Data Model
• A multidimensional databases helps to
provide data-related answers to
complex business queries quickly and
accurately.
• Data warehouses and Online
Analytical Processing (OLAP) tools
are based on a multidimensional
data model.
• OLAP in dat a warehousing enables
users t o view data from diff erent
angles and dimensions.
Multidimensional Data Model
• Schemas for Mult idimensional Dat a
Model are:-
– Star Schema
– Snowflakes Schema
– Fact Constellations Schema
Dimensional Modeling
• Dimensional Modeling (DM) is a dat a
struct ure t echnique opt imized for dat a
storage in a Dat a warehouse.
• The purpose of dimensional modeling
is to optimize the database for faster
retrieval of data.
• The concept of Dimensional
Modelling was developed by Ralph
Kimball and consists of “fact” and
“dimension” tables.
Dimensional Modeling
• A dimensional model in data warehouse
is designed to read, summarize, analyze
numeric information like values,
balances, counts, weight s, et c. in a dat a
warehouse.
• In contrast, relation models are optimized
for addition, updat ing and delet ion of
dat a in a real- time Online Transaction
System.
• These dimensional and relational
models have their unique way of data
storage that has specific advantages.
Dimensional Modeling
• For instance, in the relational mode,
normalization and ER models reduce
• redundancy in data.
On the contrary, dimensional model in
data warehouse arranges data in such a
way t hat it is easier to retrieve
• information and generate reports.
Hence, Dimensional models are used in data
warehouse syst ems and not a good fit f or
relational systems.
Elements of multidimensional modeling
• Fact
• Dimension
• Attributes
• Fact table
• Dimension
Table
Elements of multidimensional modeling
• Fact
– Facts are the measurements/metrics or facts from your
business process. For a Sales business process, a
measurement would be quart erly sales number
• Dimension
– Dimension provides the context surrounding a business
process event. In simple terms, they give who, what,
where of a fact. In the Sales business process, for the
fact quarterly sales number, dimensions would be
• Who – Customer Names
• Where – Location
• What – Product Name
• In other words, a dimension is a window to view
information in the facts.
Elements of multidimensional modeling
• Attributes
– The Attributes are the various
characteristics of the dimension in
dimensional data modeling.
• In the Location dimension, the attributes
can be
– State
– Country
– Zipcode etc.
• Attributes are used to search, filter, or
classify facts. Dimension Tables contain
Elements of multidimensional modeling
• Fact Table
– A fact table is a primary table in
dimension modelling.
• A Fact Table contains
– Measurements/facts
– Foreign key to dimension table
Elements of multidimensional modeling
• A dimension table contains dimensions of a
fact.
• They are joined to fact table via a foreign
key.
• Dimension tables are de-normalized tables.
• The Dimension Attributes are the various
columns in a dimension t able
• Dimensions off ers descriptive characteristics
of the facts with the help of their
attributes
• No set limit set for given for number of
dimensions
Star Schema
• The simplest data warehouse schema is
star schema because its structure
resembles a star.
• Star schema consists of data in the form
of facts and dimensions.
• The fact table present in the center of
star and points of the star are the
dimension tables.
• In st ar schema f act t able cont ain a large
amount of data, with no redundancy.
• Each dimension table is joined with the
fact table using a primary or f oreign key.
Star Schema
Snowflake Schema
• The snowfl ake schema is a more complex
t han st ar schema because dimension tables
of the snowflake are normalized.
• The snowflake schema is represented by
centralized fact table which is connected to
multiple dimension table and this dimension
table can be normalized into additional
dimension tables.
• The major difference between the
snowflake and star schema models is that
the dimension tables of the snowflake
model are normalized to reduce
redundancies.
Snowflake Schema
Fact Constellation Schema
• A fact constellation can have
multiple fact tables that share
many dimension tables.
• This type of schema can be viewed as a
collection of stars, Snowflake and
hence is called a galaxy schema or a
fact constellation.
• The main disadvantage of fact
constellation schemas is its more
complicated design.
Fact Constellation Schema
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Visualize Multidimensional Da
Principal Component Analysis
• Large datasets are increasingly common and
are often diffi cult to interpret.
• Principal component analysis (PCA) is a technique for
reducing the dimensionality of such datasets,
increasing interpretability but at the same time
minimizing information loss.
• It does so by creating new uncorrelated
variables that successively maximize variance.
• Finding such new variables, the principal components,
reduces to solving an eigenvalue/eigenvector
problem, and the new variables are defined by the
dataset at hand, not a priori, hence making PCA an
adapt ive dat a analysis t echnique.
Dimensionality Reduction
• Dimensionality reduction or dimension reduction is the
process of reducing the number of random variables under
consideration by obtaining a set of principal variables.
• It can be divided into feature selection and feature
extraction.
– Feature selection approaches try to find a subset of the
original variables (also called features or attributes).
– Feature projection or Feature extraction transforms the
data in the high-dimensional space to a space of fewer
dimensions.
Large Dimensions
• Large number of features in the dataset is one of the factors that
aff ect both the training time as well as accuracy of machine learning
models. You have diff erent options to deal with huge number of
features in a dataset.
– Try to train the models on original number of features, which
take days or weeks if the number of features is too high.
– Reduce the number of variables by merging correlated
variables.
– Extract the most important features from the dataset that are
responsible for maximum variance in the output. Diff erent
statistical techniques are used for this purpose e.g. linear
discriminant analysis, factor analysis, and principal component
analysis.
Principal Component Analysis
• Principal component analysis, or PCA, is a statistical
technique to convert high dimensional data to
low dimensional data by selecting the most
important features that capture maximum
information about the dataset.
• The features are selected on the basis of variance
that they cause in the output.
• The f eature t hat causes highest variance is t he
fi rst principal component. The feature that is
responsible for second highest variance is
considered the second principal component, and
so on.
• It is important to mention that principal
components do not have any correlation with
Advantages of PCA
• The training time of the algorithms
reduces significantly with less
number of features.
• It is not always possible to analyze data
in high dimensions. For instance if there
are 100 features in a dataset. Total
number of scatter plots required to
visualize the data would be 100(100-1)2
= 4950. Practically it is not possible to
analyze data this way.
Normalization of features
• It is imperative to mention that a feature set must
be normalized before applying PCA. For instance if a
feature set has data expressed in units of Kilograms,
Light years, or Millions, the variance scale is huge in
the training set. If PCA is applied on such a feature
set, the resultant loadings for features with high
variance will also be large. Hence, principal
components will be biased towards features with
high variance, leading to false results.
• Finally, the last point to remember before we start
coding is that PCA is a statistical technique and can
only be applied to numeric data. Therefore,
categorical features are required to be converted
into numerical features before PCA can be applied.
Steps in PCA
• Standardization
• Covariance Matrix Computation
• Computer Eigen vector and eigen
values
• Feature vector
• Recast the Data Along the
Principal Components Axes
Standardization
• The aim of this step is to standardize the
range of the continuous initial variables so
that each one of them contributes equally
to the analysis.
• More specifically, the reason why it is
critical to perform standardization prior
to PCA, is that the latter is quite sensitive
regarding the variances of the initial
variables.
• That is, if there are large differences
between the ranges of initial variables,
those variables with larger ranges will
dominate over those with small ranges
Standardization
• Mathematically, this can be done by
subtracting the mean and dividing by
the standard deviation for each value of
each variable.
•
If we rank the eigenvalues in descending order, we
get λ1>λ2, which means that the eigenvector
that corresponds to the fi rst principal component
(PC1) is v1 and the one that corresponds to the
second component (PC2) isv2.
Feature Vector
• As we saw in the previous step, computing the
eigenvectors and ordering them by their eigenvalues in
descending order, allow us to find the principal
components in order of significance.
• In this step, what we do is, to choose whether to keep
all these components or discard those of lesser
significance (of low eigenvalues), and form with the
remaining ones a matrix of vectors that we call
Feature vector.
• So, the feature vector is simply a matrix that has as
columns the eigenvectors of the components that we
decide to keep.
• This makes it the fi rst step towards dimensionality
reduction, because if we choose to keep only p
eigenvectors (components) out of n, the final data set
will have only p dimensions.
Example:
• Continuing with the example from the previous step,
we can either f orm a f eature vector wit h both of
the eigenvectors v1 and v2: