Unit - 2 Data Warehouse
Unit - 2 Data Warehouse
PREPARED BY:
KUMAR MUTTURAJ
BCA DEPT
Fundamentals of Data Science
Stage 1 : Assembling data from the client : In first stage, a Multi-Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect
the complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi-
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Fundamentals of Data Science
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for identifying
the related qualities. These qualities are also known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities
: In the fifth stage, A Multi-Dimensional Data Model separates and differentiates the
actuality from the factors which are collected by it. These actually play a significant role in
the arrangement of a Multi-Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected
from the steps above : In the sixth stage, on the basis of the data which was collected
previously, a Schema is built.
Dimensions: Dimensions are attributes that describe the measures, such as time, location,
or product. They are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between
measures and dimensions in a data model. They provide a fast and efficient way to retrieve
and analyze data.
Aggregation: Aggregation is the process of summarizing data across dimensions and levels
of detail. This is a key feature of multidimensional data models, as it enables users to quickly
analyze data at different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary
of data to a lower level of detail, while roll-up is the opposite process of moving from a
lower-level detail to a higher-level summary. These features enable users to explore data in
greater detail and gain insights into the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For
example, a time dimension might be organized into years, quarters, months, and days.
Hierarchies provide a way to navigate the data and perform drill-down and roll-up
operations.
OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that
supports fast and efficient querying of large datasets. OLAP systems are designed to handle
complex queries and provide fast response times.
Data Cleaning:
Data cleaning is one of the important parts of machine learning. It plays a significant part
in building a model. In this article, we’ll understand Data cleaning, its significance and
Python implementation.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the
Fundamentals of Data Science
data science pipeline that involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality and usability. Data
cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which
can negatively impact the accuracy and reliability of the insights derived from it.
Managing Unwanted outliers: Identify and manage outliers, which are data points
significantly deviating from the norm. Depending on the context, decide whether to
remove outliers or transform them to minimize their impact on analysis. Managing
outliers is crucial for obtaining more accurate and reliable insights from the data.
Handling Missing Data: Devise strategies to handle missing data effectively. This
may involve imputing missing values based on statistical methods, removing records
with missing values, or employing advanced imputation techniques. Handling
missing data ensures a more complete dataset, preventing biases and maintaining the
integrity of analyses.
Fundamentals of Data Science
Data Integration:
Data integration in data mining refers to the process of combining data from multiple
sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data federation.
Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the
data. These sources may include multiple data cubes, databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
Data integration can be challenging due to the variety of data formats, structures, and
semantics used by different data sources. Different data sources may use different data
types, naming conventions, and schemas, making it difficult to combine the data into a
single view. Data integration typically involves a combination of manual and automated
processes, including data profiling, data mapping, data transformation, and data
reconciliation.
Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the
success of these applications, as it enables organizations to access and analyze data that is
spread across different systems, departments, and lines of business, in order to make better
decisions, improve operational efficiency, and gain a competitive advantage.
Fundamentals of Data Science
There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the
integrated data. The data is extracted from various sources, transformed and loaded into a
data warehouse. Data is integrated in a tightly coupled manner, meaning that the data is
integrated at a high level, such as at the level of the entire dataset or schema. This approach
is also known as data warehousing, and it enables data consistency and integrity, but it can
be inflexible and difficult to change or update.
Here, a data warehouse is treated as an information retrieval component.
In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual
data elements or records. Data is integrated in a loosely coupled manner, meaning that the
data is integrated at a low level, and it allows data to be integrated without having to create
a central repository or data warehouse. This approach is also known as data federation, and
it enables data flexibility and easy updates, but it can be difficult to maintain consistency
and integrity across multiple data sources.
Here, an interface is provided that takes the query from the user, transforms it in a
way the source database can understand, and then sends the query directly to the
source databases to obtain the result.
And the data only remains in the actual source databases.
Data transformation: In data mining refers to the process of converting raw data into
a format that is suitable for analysis and modeling. The goal of data transformation is
to prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values
in the data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as
between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of
relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by
summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of
errors and inconsistencies. Data transformation can also help to improve the performance of
data mining algorithms, by reducing the dimensionality of the data, and by scaling the data
to a common range of values.
The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form. The concept behind data smoothing is that it will be
able to identify simple changes to help predict different trends and patterns. This serves as
a help to analysts or traders who need to look at a lot of data which can often be difficult to
digest for finding patterns that they wouldn’t see otherwise.
Fundamentals of Data Science
4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the
mining more efficient.
6. Normalization: Data normalization involves converting all data variables into a given
range.
Min-Max Normalization:
o This transforms the original data linearly.
o Suppose that: min_A is the minima and max_A is the maxima of an attribute,
P
o Where v is the value you want to plot in the new range.
o v’ is the new value you get after normalizing the old value.S
Data Reduction:
(The method of data reduction may achieve a condensed description of the original data which is
much smaller in quantity but keeps the quality of the original data.)
Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount
of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining,
including:
Fundamentals of Data Science
1. Data Sampling: This technique involves selecting a subset of the data to work with,
rather than using the entire dataset. This can be useful for reducing the size of a
dataset while still preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features
in the dataset, either by removing features that are not relevant or by combining
multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete
data by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the
dataset that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can help to improve
the efficiency and performance of machine learning algorithms by reducing the size of the
dataset. However, it is important to be aware of the trade-off between the size and accuracy
of the data, and carefully assess the risks and benefits before implementing it.
Data Discretization:
Data discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy.
In other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.
There are two forms of data discretization first is supervised discretization, and the
second is unsupervised discretization.
Supervised discretization refers to a method in which the class data is used.
Unsupervised discretization refers to a method depending upon the way which
operation proceeds.
It means it works on the top-down splitting strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example, Suppose we have an
attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization
Importance of Discretization:
A discretization is important because it is useful:
1. To generate concept hierarchies.
2. Transform numeric data.
3. To ease evaluation and management of data.
4. To minimize data loss.
5. To produce a better result.
6. Generate a more understandable structure viz. decision tree.