0% found this document useful (0 votes)
18 views11 pages

Unit - 2 Data Warehouse

The document provides an overview of data warehousing, emphasizing its role as a centralized repository for data integration, storage, and analysis. It discusses key components such as ETL processes, data models, and the benefits of improved decision-making and data quality. Additionally, it outlines data cleaning, integration techniques, and transformation processes essential for effective data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Unit - 2 Data Warehouse

The document provides an overview of data warehousing, emphasizing its role as a centralized repository for data integration, storage, and analysis. It discusses key components such as ETL processes, data models, and the benefits of improved decision-making and data quality. Additionally, it outlines data cleaning, integration techniques, and transformation processes essential for effective data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Fundamentals of Data Science

SMD COLLEGE, HOSAPETE

FUNDAMENTALS OF DATA SCIENCE


UNIT 2: DATA WAREHOUSE

PREPARED BY:
KUMAR MUTTURAJ
BCA DEPT
Fundamentals of Data Science

Introduction to Data Warehousing


In today’s data-driven world, organizations face the challenge of managing and utilizing
vast amounts of data effectively. A data warehouse addresses this challenge by serving as a
centralized repository where data from multiple sources is integrated, stored, and made
available for analysis and reporting purposes.
Key Components of a Data Warehouse:
1. ETL (Extract, Transform, Load): Processes for extracting data from various sources,
transforming it to fit into a standardized format, and loading it into the data
warehouse.
2. Data Warehouse Database: A structured database optimized for querying and
analysis, using schemas such as star schema or snowflake schema to organize data.
3. Metadata: Descriptive information about the data stored in the warehouse, including
its source, meaning, relationships, and usage.
4. Query and Reporting Tools: Software tools and interfaces that enable users to query,
analyze, and visualize data stored in the warehouse.
Benefits of Using a Data Warehouse:
 Improved Decision Making: Provides a single, consistent source of truth for data
analysis, enabling informed decision-making across the organization.
 Historical Analysis: Allows for the analysis of trends and patterns over time,
facilitating predictive analytics and strategic planning.
 Enhanced Data Quality: Integrates data from disparate sources, improving data
consistency and reliability.
 Scalability: Can handle large volumes of data and support the growing analytical
needs of the organization.
Data Warehouse: A Data Warehouse is separate from DBMS, it stores a huge amount of
data, which is typically collected from multiple heterogeneous sources like files, DBMS, etc.
The goal is to produce statistical results that may help in decision-making.
For example, a college might want to see quick different results, like how the placement of
CS students has improved over the last 10 years, in terms of salaries, counts, etc.
Fundamentals of Data Science

Multi-Dimensional Data Model:


The multi-Dimensional Data Model is a method which is used for ordering data in the
database along with good arrangement and assembling of the contents in the database.
The Multi-Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow
customers to access data in the form of queries. They allow users to rapidly receive answers
to the requests which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi-dimensional
databases. It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data
from many dimensions and perspectives. It is defined by dimensions and facts and is
represented by a fact table. Facts are numerical measures and fact tables contain measures
of the related dimensional tables or names of the facts.

Working on a Multidimensional Data Model


On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi-Dimensional
Data Model :

Stage 1 : Assembling data from the client : In first stage, a Multi-Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect
the complete data in detail.

Stage 2 : Grouping different segments of the system : In the second stage, the Multi-
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.

Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Fundamentals of Data Science

Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for identifying
the related qualities. These qualities are also known as “attributes” in the database.

Stage 5 : Finding the actuality of factors which are listed previously and their qualities
: In the fifth stage, A Multi-Dimensional Data Model separates and differentiates the
actuality from the factors which are collected by it. These actually play a significant role in
the arrangement of a Multi-Dimensional Data Model.

Stage 6 : Building the Schema to place the data, with respect to the information collected
from the steps above : In the sixth stage, on the basis of the data which was collected
previously, a Schema is built.

Features of multidimensional data models:


Measures: Measures are numerical data that can be analyzed and compared, such as sales
or revenue. They are typically stored in fact tables in a multidimensional data model.

Dimensions: Dimensions are attributes that describe the measures, such as time, location,
or product. They are typically stored in dimension tables in a multidimensional data model.

Cubes: Cubes are structures that represent the multidimensional relationships between
measures and dimensions in a data model. They provide a fast and efficient way to retrieve
and analyze data.

Aggregation: Aggregation is the process of summarizing data across dimensions and levels
of detail. This is a key feature of multidimensional data models, as it enables users to quickly
analyze data at different levels of granularity.

Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary
of data to a lower level of detail, while roll-up is the opposite process of moving from a
lower-level detail to a higher-level summary. These features enable users to explore data in
greater detail and gain insights into the underlying patterns.

Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For
example, a time dimension might be organized into years, quarters, months, and days.
Hierarchies provide a way to navigate the data and perform drill-down and roll-up
operations.

OLAP (Online Analytical Processing): OLAP is a type of multidimensional data model that
supports fast and efficient querying of large datasets. OLAP systems are designed to handle
complex queries and provide fast response times.

Data Cleaning:
Data cleaning is one of the important parts of machine learning. It plays a significant part
in building a model. In this article, we’ll understand Data cleaning, its significance and
Python implementation.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the
Fundamentals of Data Science

data science pipeline that involves identifying and correcting or removing errors,
inconsistencies, and inaccuracies in the data to improve its quality and usability. Data
cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which
can negatively impact the accuracy and reliability of the insights derived from it.

Steps to Perform Data Cleanliness


Performing data cleaning involves a systematic process to identify and rectify errors,
inconsistencies, and inaccuracies in a dataset. The following are essential steps to perform
data cleaning.

 Removal of Unwanted Observations: Identify and eliminate irrelevant or redundant


observations from the dataset. The step involves scrutinizing data entries for
duplicate records, irrelevant information, or data points that do not contribute
meaningfully to the analysis. Removing unwanted observations streamlines the
dataset, reducing noise and improving the overall quality.

 Fixing Structure errors: Address structural issues in the dataset, such as


inconsistencies in data formats, naming conventions, or variable types. Standardize
formats, correct naming discrepancies, and ensure uniformity in data representation.
Fixing structure errors enhances data consistency and facilitates accurate analysis and
interpretation.

 Managing Unwanted outliers: Identify and manage outliers, which are data points
significantly deviating from the norm. Depending on the context, decide whether to
remove outliers or transform them to minimize their impact on analysis. Managing
outliers is crucial for obtaining more accurate and reliable insights from the data.

 Handling Missing Data: Devise strategies to handle missing data effectively. This
may involve imputing missing values based on statistical methods, removing records
with missing values, or employing advanced imputation techniques. Handling
missing data ensures a more complete dataset, preventing biases and maintaining the
integrity of analyses.
Fundamentals of Data Science

Data Integration:
Data integration in data mining refers to the process of combining data from multiple
sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more useful and meaningful for the
purposes of analysis and decision making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data federation.

Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the
data. These sources may include multiple data cubes, databases, or flat files.

The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.

What is data integration:


Data integration is the process of combining data from multiple sources into a cohesive
and consistent view. This process involves identifying and accessing the different data
sources, mapping the data to a common format, and reconciling any inconsistencies or
discrepancies between the sources. The goal of data integration is to make it easier to access
and analyze data that is spread across multiple systems or platforms, in order to gain a more
complete and accurate understanding of the data.

Data integration can be challenging due to the variety of data formats, structures, and
semantics used by different data sources. Different data sources may use different data
types, naming conventions, and schemas, making it difficult to combine the data into a
single view. Data integration typically involves a combination of manual and automated
processes, including data profiling, data mapping, data transformation, and data
reconciliation.

Data integration is used in a wide range of applications, such as business intelligence, data
warehousing, master data management, and analytics. Data integration can be critical to the
success of these applications, as it enables organizations to access and analyze data that is
spread across different systems, departments, and lines of business, in order to make better
decisions, improve operational efficiency, and gain a competitive advantage.
Fundamentals of Data Science

There are mainly 2 major approaches for data integration – one is the “tight coupling
approach” and another is the “loose coupling approach”.

Tight Coupling:
This approach involves creating a centralized repository or data warehouse to store the
integrated data. The data is extracted from various sources, transformed and loaded into a
data warehouse. Data is integrated in a tightly coupled manner, meaning that the data is
integrated at a high level, such as at the level of the entire dataset or schema. This approach
is also known as data warehousing, and it enables data consistency and integrity, but it can
be inflexible and difficult to change or update.
 Here, a data warehouse is treated as an information retrieval component.
 In this coupling, data is combined from different sources into a single physical
location through the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling:
This approach involves integrating data at the lowest level, such as at the level of individual
data elements or records. Data is integrated in a loosely coupled manner, meaning that the
data is integrated at a low level, and it allows data to be integrated without having to create
a central repository or data warehouse. This approach is also known as data federation, and
it enables data flexibility and easy updates, but it can be difficult to maintain consistency
and integrity across multiple data sources.
 Here, an interface is provided that takes the query from the user, transforms it in a
way the source database can understand, and then sends the query directly to the
source databases to obtain the result.
 And the data only remains in the actual source databases.

Issues in Data Integration:


There are several issues that can arise when integrating data from multiple sources,
including:
1. Data Quality: Inconsistencies and errors in the data can make it difficult to combine
and analyze.
2. Data Semantics: Different sources may use different terms or definitions for the
same data, making it difficult to combine and understand the data.
3. Data Heterogeneity: Different sources may use different data formats, structures, or
schemas, making it difficult to combine and analyze the data.
4. Data Privacy and Security: Protecting sensitive information and maintaining
security can be difficult when integrating data from multiple sources.
5. Scalability: Integrating large amounts of data from multiple sources can be
computationally expensive and time-consuming.
6. Data Governance: Managing and maintaining the integration of data from multiple
sources can be difficult, especially when it comes to ensuring data accuracy,
consistency, and timeliness.
7. Performance: Integrating data from multiple sources can also affect the performance
of the system.
8. Integration with existing systems: Integrating new data sources with existing
systems can be a complex task, requiring significant effort and resources.
9. Complexity: The complexity of integrating data from multiple sources can be high,
Fundamentals of Data Science

requiring specialized skills and knowledge.

Data transformation: In data mining refers to the process of converting raw data into
a format that is suitable for analysis and modeling. The goal of data transformation is
to prepare the data for data mining so that it can be used to extract useful insights and
knowledge. Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies, and missing values
in the data.
2. Data integration: Combining data from multiple sources, such as databases and
spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of values, such as
between 0 and 1, to facilitate comparison and analysis.
4. Data reduction: Reducing the dimensionality of the data by selecting a subset of
relevant features or attributes.
5. Data discretization: Converting continuous data into discrete categories or bins.
6. Data aggregation: Combining data at different levels of granularity, such as by
summing or averaging, to create new features or attributes.
7. Data transformation is an important step in the data mining process as it helps to ensure
that the data is in a format that is suitable for analysis and modeling, and that it is free of
errors and inconsistencies. Data transformation can also help to improve the performance of
data mining algorithms, by reducing the dimensionality of the data, and by scaling the data
to a common range of values.

The data are transformed in ways that are ideal for mining the data. The data
transformation involves steps that are:

1. Smoothing: It is a process that is used to remove noise from the dataset using some
algorithms It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form. The concept behind data smoothing is that it will be
able to identify simple changes to help predict different trends and patterns. This serves as
a help to analysts or traders who need to look at a lot of data which can often be difficult to
digest for finding patterns that they wouldn’t see otherwise.
Fundamentals of Data Science

2. Aggregation: Data collection or aggregation is the method of storing and presenting


data in a summary format. The data may be obtained from multiple data sources to
integrate these data sources into a data analysis description. This is a crucial step since the
accuracy of data analysis insights is highly dependent on the quantity and quality of the
data used. Gathering accurate data of high quality and a large enough quantity is
necessary to produce relevant results. The collection of data is useful for everything from
decisions concerning financing or business strategy of the product, pricing, operations,
and marketing strategies. For example, Sales, data may be aggregated to compute
monthly& annual total amounts.

3. Discretization: It is a process of transforming continuous data into set of small intervals.


Most Data Mining activities in the real world require continuous attributes. Yet many of
the existing data mining frameworks are unable to handle these attributes. Also, even if a
data mining task can manage a continuous attribute, it can significantly improve its
efficiency by replacing a constant quality attribute with its discrete values. For example, (1-
10, 11-20) (age:- young, middle age, senior).

4. Attribute Construction: Where new attributes are created & applied to assist the mining
process from the given set of attributes. This simplifies the original data & makes the
mining more efficient.

5. Generalization: It converts low-level data attributes to high-level data attributes using


concept hierarchy. For Example, Age initially in Numerical form (22, 25) is converted into
categorical value (young, old). For example, Categorical attributes, such as house
addresses, may be generalized to higher-level definitions, such as town or country.

6. Normalization: Data normalization involves converting all data variables into a given
range.
 Min-Max Normalization:
o This transforms the original data linearly.
o Suppose that: min_A is the minima and max_A is the maxima of an attribute,
P
o Where v is the value you want to plot in the new range.
o v’ is the new value you get after normalizing the old value.S

Data Reduction:
(The method of data reduction may achieve a condensed description of the original data which is
much smaller in quantity but keeps the quality of the original data.)

Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount
of irrelevant or redundant information.

There are several different data reduction techniques that can be used in data mining,
including:
Fundamentals of Data Science

1. Data Sampling: This technique involves selecting a subset of the data to work with,
rather than using the entire dataset. This can be useful for reducing the size of a
dataset while still preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features
in the dataset, either by removing features that are not relevant or by combining
multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete
data by partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the
dataset that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.

In conclusion, data reduction is an important step in data mining, as it can help to improve
the efficiency and performance of machine learning algorithms by reducing the size of the
dataset. However, it is important to be aware of the trade-off between the size and accuracy
of the data, and carefully assess the risks and benefits before implementing it.

Data Discretization:
 Data discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy.
 In other words, data discretization is a method of converting attributes values of
continuous data into a finite set of intervals with minimum data loss.
 There are two forms of data discretization first is supervised discretization, and the
second is unsupervised discretization.
 Supervised discretization refers to a method in which the class data is used.
 Unsupervised discretization refers to a method depending upon the way which
operation proceeds.
 It means it works on the top-down splitting strategy and bottom-up merging strategy.

There are different techniques of discretization:


1. Discretization by binning: It is unsupervised method of partitioning the data based
on equal partitions, either by equal width or by equal frequency
2. Discretization by Cluster: clustering can be applied to discretize numeric attributes.
It partitions the values into different clusters or groups by following top down or
bottom up strategy
3. Discretization By decision tree: it employs top down splitting strategy. It is a
supervised technique that uses class information.
4. Discretization By correlation analysis: ChiMerge employs a bottom-up approach by
finding the best neighboring intervals and then merging them to form larger intervals,
recursively
5. Discretization by histogram: Histogram analysis is unsupervised learning because it
doesn’t use any class information like binning. There are various partition rules used
to define histograms.
Fundamentals of Data Science

Now, we can understand this concept with the help of an example, Suppose we have an
attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Importance of Discretization:
A discretization is important because it is useful:
1. To generate concept hierarchies.
2. Transform numeric data.
3. To ease evaluation and management of data.
4. To minimize data loss.
5. To produce a better result.
6. Generate a more understandable structure viz. decision tree.

You might also like