0% found this document useful (0 votes)
72 views55 pages

Unit 2 FDS

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views55 pages

Unit 2 FDS

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Unit 2: Data Warehouse

Introduction
Data Warehouse

A Data Warehouse (DW) is a repository of large amounts of organized


data. The data can be consolidated from multiple sources. DWs are
relational databases designed for analytical reporting and on-time
decision-making in organizations. The data used for this purpose is
isolated and optimized from the source transaction data, so it won’t
affect the main business. When an organization introduces any business
change, then DW is used to analyse the effects of that change, thus DW
can also be used to monitor non-decision-making processes.

Data Warehouses are mainly read-only systems since operational data


is kept separate from the data warehouses. This provides a good query
writing environment for retrieving the highest volume of
data. Therefore, DW will act as the backend engine for Business
Intelligence tools that display reports and dashboards to business
users. It is widely used in banking, financial, retail sectors, etc.
A data warehouse is a central repository that stores integrated and
organized data from various sources within an organization. It is
designed for querying, analysis, and reporting, providing a
consolidated view of the organization's data for decision-making
purposes.

Key characteristics of a data warehouse include:

1. Integration: Data from multiple sources, such as transactional


databases, spreadsheets, and external systems, are integrated into
the data warehouse using Extract, Transform, Load (ETL)
processes. This ensures consistency and uniformity in the stored
data.

2. Consolidation: Data warehouses consolidate historical and


current data from different operational systems into a single,
unified repository. This allows users to analyze trends, patterns,
and relationships across the organization's data.
3. Subject-Oriented: Data warehouses are organized around
subject areas or business processes, rather than specific
applications or transactions. This enables users to focus on
analyzing data from a business perspective, such as sales,
marketing, finance, or operations.

4. Time-Variant: Data warehouses maintain historical data over


time, allowing users to perform time-based analysis and track
changes in business metrics and performance indicators.

5. Non-Volatile: Data in a data warehouse is typically read-only and


non-volatile, meaning that data is not frequently updated or
changed. Instead, new data is periodically loaded into the
warehouse through scheduled ETL processes.

6. Query and Analysis: Data warehouses provide powerful query


and analysis capabilities, allowing users to perform complex
queries, generate reports, and create visualizations to gain
insights into the organization's data.
HISTORY BEHIND DATA WAREHOUSING

The data warehouse is a core repository that performs aggregation


to collect and group data from various sources into a central
integrated unit. The data from the warehouse can be retrieved and
analysed to generate reports or relations between the datasets of the
database which enhances the growth of many industries. Data
warehouse comes under Business Intelligence. The data warehouse
is designed to provide real-time information. Storage of data has
evolved from simple magnetic tapes to integrated data warehouses.
This article will give an overview of the history of warehousing.

Early mechanisms to store data:

Early methods to hold data started with punched cards, paper tapes.
Then the development of magnetic tapes took place. Though we can
write and rewrite data in magnetic tapes, it is not a stable medium to
hold data. Disk storage came into existence where you can store and
access large amounts of data.
DBMS in disk storage:

Later DBMS (Database Management Systems) was integrated with


disk storage to store the data directly on the disk itself. The main
advantage of integrating DBMS is we can locate the data fast. The
features include location and deletion of data, solving problems
when two different data are mapped to the same location. The
physical location can be extended when the data exceeds the storage
limit.

Online Applications:

There came the advent of online applications after the usage of


DBMS in disk storage. Online applications are the products of online
processing which has its applications in the commercial industry.
For e.g. Retail and sales processing, ticket reservation systems,
Automated Teller processing, etc. Online applications play an
important role in the current years due to their intertwined
applications. But it has a drawback which the end-users of the
application put forward. Since there is an enormous amount of data,
the end-users find it difficult to retain the desired data. Even if they
obtained it, they are not sure whether it is correct or accurate due to
the constant escalation of data.

Fourth Generation Technologies (4GL) and Personal computers:


The motive of 4GL technology is to provide end-users the direct
opportunity of accessing data, using the programming languages and
system development without the interference of the IT department.
The same happens with personal computers. So, the individuals can
bring their own personalized systems into the business firm and can
access the specific data accessible to them. This reduced the need for
a centralized department of technology to provide the requested data
to the users. Spreadsheets are a good example. But it has its
drawbacks. The data retrieved may be incomplete, misleading, or
wrong. It lacks finesse in the result due to the lack of documentation
and the existence of multiple versions of the same data.

Spider web environment:

The Spider web environment ended up as a dilemma to end-users,


IT professionals due to its unfavourable nature and complexity. This
environment is called spider web environment because there are
many lines connecting which reminded the lines of a spider web.
Though data can be retrieved, the efficiency and accuracy are very
less. These severe drawbacks called the need for building
information architecture cantering data warehouse.

Evolution of Data warehouse environment:

As the corporation shifted from spider web to data warehouse


environment, it created a major change in the usual techniques in
which data is stored. Before the introduction of data warehouse, it
was thought that a database must aid all the purposes of data. After
the advent of data warehouse, it is evident that there are different
types of databases that serve for different purposes.

A data warehouse is a place where information is processed into bot


integrated and granular forms of data and history. Though not all the
warehouses are integrated, integrated data warehouses has its benefit
of providing the enterprise view of a company. The granular data has
the benefit of looking same data in different ways. A set of data can
be looked in marketing way or it can be looked in finance way. The
same data can be used to look in accounting way too. Data
warehouses are used to store historical data of many years.

Challenges of data warehouse:

➢ First is integration of data which is the most difficult and time-


consuming process as one need to touch the root of old legacies
of corporates to derive useful integrated data. It is a painful
step, but it is worthful.
➢ High volume of data created by data warehousing techniques
which makes the process tedious. So, there comes a need to get
rid of old data. But, for analyses of data they are so valuable
and can’t be ignored.
➢ Data warehouses can’t be created all at once like other
operational applications. It must be developed iteratively, like
one step at a time.
Reasons for the development of Data Warehouse 2.0
Environment (DW 2.0):

The earlier techniques have evolved so much and ended as DW


2.0. We need to travel back and forth to understand the forces
that initiated the architecture of DW 2.0. Some of them are
given below.
• The end-user’s demand for new system or architecture.
• Financially economic
• Online processing techniques
• High storage capacity
• Need for integrated data.
• The need to include unstructured data for analytics purpose in
the data mixture.

Data Warehouse evolution (from business perspective):


• The output of the earlier techniques is in unrefined format. For
e.g. It is a hectic process to read all those hexadecimal inputs just
to find a small piece of information from the cryptic codes.
• Now, the end-users have become more futuristic. So, they
demand the need for more sophisticated output and instantaneous
source of output.
• For online processing techniques to be done, the data need to be
integrated. Also, it needs historical data for analysis.
• First generation data warehouse came into existence due to the
end-user’s thirst for corporate data.

Mutated forms of data warehouse:

Due to the appealing features of data warehouse, the business


consultants have mutated the concept of data warehouse in
accordance to their corporate needs. Some variations of the data
warehouses are:

• The Active data warehouse: The online processing and updates


are taken place in this warehouse. The major feature of this
warehouse is that the transaction has a very high-performance
rate. The demerits of this mutated warehouse are the
uprightness of the transaction is questioned, hefty statistical
processing, large capacities are wasted which in turn increases
the operational cost.
• The Federated data warehouse: In this type of approach, due to
the high complexity in the integration of data, they skip this
process. Technically, a warehouse doesn’t exist in this
approach. The scheme behind that is to construct a data
warehouse magically by merging the old legacies of the
corporate to fetch and process data simultaneously. This
approach seems attractive with less work, but it is just a
delusion rather than a solution itself. It has numerous pitfalls
like bad performance, limited history, absence of data
integration, complexity, inherited granularity which provides
poor performance to the end-user when he requests data from
different level of granularity from the federal warehouse.
• The Star schema data warehouse: The outlook used in this data
warehousing needs the construction of dimension tables and
fact tables. It provides lot of benefits as a data warehouse but
has its limitations. It is designed only for limited requirements
and when the requirements change, the data warehouse
becomes brittle. The level of granularity keeps changing due
to multiple schema formation which questions the integrity of
data. It cannot be extended more than a certain limit and it is
designed only for one-audience type.
• Data Mart data warehouse: The consultants of the online
application processing first build a data mart which gives the
chance to know the sales of the product without any
complications of building an actual data warehouse. The
demerits include non- extensibility, high error occurrences,
reconciliation of data is not possible and extract proliferation
which makes extraction of legacy data difficult. Another fact
about this approach is there is no way a data mart can be
converted into a data warehouse. It’s like the core of each is
different and they can’t be mutated to change into warehouse.

Why is Data Warehousing Crucial?


The main reasons why data warehousing is crucial is as follows:

Data warehouses combine all operational data from several


heterogeneous sources of “different formats” and through the process
of extract, transform, and load (ETL), they load the data into DW in a
“standardized dimensional format” across an organization.

A data warehouse maintains both current and historical data for


analytical reporting and fact-based decision making.

Improve your business decisions. Successful business leaders develop


data-driven strategies and rarely make decisions without considering
the facts. Data warehousing makes it easier for corporate decision-
makers to access different data sets faster and more efficiently, and to
derive insights that will guide their business and marketing strategies.

Data warehouse platforms enable business leaders to access their


organization’s historical activities and evaluate initiatives that have
been successful or unsuccessful in the past. It enables executives to see
where they can reduce costs, maximize efficiency, and increase sales to
boost profit.

Data Warehouse Applications


In data warehousing, Business Intelligence (BI) is used for decision-
making. The BI plays a major role once the data in the DW has been
loaded by analysing it and presenting it to the business users. The term
“data warehouse applications” effectively implies how data can be
manipulated and utilized.

Data warehouse applications fall under three categories: information


processing, analytical processing, and data mining.

Information Processing: A data warehouse makes it possible to


process the information it stores. Data can be processed through
querying, basic statistical analysis, and reporting.

Analytical Processing: The information stored in a data warehouse


can be processed analytically. With the help of basic OLAP(Online
Analytic Processing Server) operations, such as slice-and-dice
operations, drill down and drill up, and pivoting, the data can be
analysed.

Data Mining: Through data mining, knowledge can be discovered by


finding hidden patterns and associations, constructing analytical
models, and performing classification and prediction. Results from data
mining can be presented visually.
1.Information Processing: This is a type of application where the data
warehouse enables direct, one-on-one contact with the data it
stores. Using direct queries on the data with basic statistical analysis of
the data.

The tools which DW supports for information processing are:

1.1) Query Tools: By using query tools, the user can explore the data
and generates reports or graphics in accordance with the business
requirements.

1.2) Reporting Tools: Reporting tools are used when the business
wants to see the results in a certain format on a regular basis, such as
daily, weekly, or monthly. This type of report can be saved and
retrieved at any time.

1.3) Statistics Tools: To generate these results, statistics tools will be


used if the business wants to examine data from a broader
perspective. By understanding these strategic results, businesses can
make predictions and conclusions.

2. Analytical Processing: This is an application that allows the


analysis of data stored in a data warehouse. Slice-and-Dice, Drill
Down, Roll Up, and Pivoting are some of the operations that can be
used to evaluate the data.

2.1) Slice-and-Dice: A data warehouse enables slice-and-dice


operations to evaluate data from several levels and from a variety of
perspectives. Internally, the drill-down mechanism is used for the slice-
and-dice action. Slicing is a technique for manipulating dimensional
data.

If we focus on a single area as part of the business requirement, slicing


evaluates the dimensions of that specific region according to the criteria
and returns the findings. Dicing is a program that performs analytic
processes. Dicing provides a variety of viewpoints by zooming in on a
select set of properties across all dimensions. One or more successive
slices are used to calculate the dimensions.

2.2) Drill Down: Drill down is an operation for traversing down a


summary number to minor detailed levels if the business wishes to get
to a more detailed level of any summary number. This gives a good
indication of what’s going on and where the company should
concentrate its efforts.

2.3) Roll up: Roll up is the polar opposite of drill-down. Roll up comes
into play if the business needs any summary data. By advancing up the
dimensional structure, it aggregates the detail level data. Roll-ups are
used to examine a system’s development and performance.

3. Data Mining: This is a type of application in which the data


warehouse facilitates data knowledge discovery and the findings are
visualized using visualization tools. It’s difficult to query and drill
down the data warehouse to acquire all potential insights into data as
the amount of data grows in various industries. Then data mining enters
the scene to help with knowledge discovery.
This enters the data with all of the previous associations, results, and so
on, and forecasts the future. Hidden patterns, correlations,
classification and predictions can be found in the data.

Figure 3: Applications of data warehouses

Benefits and Disadvantages of Data Warehouses

Benefits:
When a data warehouse system is operational, a business gains the
following advantages:

• Business Intelligence Enhancement

• System and query performance improvement.

• Multiple Sources of Business Intelligence

• Data Access in Real-Time

• Intelligence from the past

• Exceptional Return on Investment

Disadvantages:

Despite the fact that it is a very successful system, it is useful to be


aware of some of its flaws:

• Creating a Data Warehouse is an extremely time-consuming and


difficult task.

• The cost of maintenance is high because the system requires


constant improvements.

• Developers, testers, and users should have adequate training in


order to comprehend the DW system.

• It is possible that sensitive data won’t be able to be fed into DW


for decision-making.

• Any business process source system restructuring has a


significant impact on DW.
Introduction to Multidimensional Data Model
Before analysing a data, we must first treat it and store it somewhere,
however, depending how that data is stored it can become a long-term
problem, such as high maintenance and data processing costs, contained
therein or even useless data that is just wasting our money on storage
and processing.
For this, we must organize our data before passing it on to our Data
warehouse, one of the solutions is the multidimensional data model,
however, every way, however here are some concepts that you will need
to know beforehand.

OLAP
“Online Analytical Processing (OLAP)” is a tool used for the analysis
and treatment of data in real time, widely used to treat a massive amount
of information in the various dimensions of a data warehouse.
Example: We are analyzing temporal data and we want to have views
by year, day, quarter or semester, this interaction with the user is done
by OLAP.
OLTP
Or “Online Transaction Platform”, refers to systems that record all the
operational actions of a stock, guaranteeing its success.
This type of data is generated massively every day
Example: a bank transaction for example, if it fails the whole action
must be reversed, if it is successful, it must be recorded and immutable.

Data warehouse

This is a database for analysis only, think of it as a copy of the


operational database but optimized (let’s talk about it soon) for business
intelligence.
Data granularity

We must pay close attention to data granularity as it directly affects the


volume in data storage and the speed of search and the level of detail of
the information. In a brief explanation, when we have a high granularity
of data it means that we have less details of the data, when we have a
low granularity, we have more details of the data.

Example: Imagine that we have a sales table where the name of


salespeople appears repeatedly. This table has a low granularity because
it has a lot of information, so much information that it can make it
difficult for us to select which salesperson had the highest number of
sales, note that the more decentralized data we have in a table, the
greater the data and therefore we will have longer analysis time. If we
have a table of sellers with the total sales of each one, we will have data
with less details, more granularity and spending less resources.
We can also understand this condensation and detailing as Drill
Down and Drill UP.

Drill Down — the granularity is reduced and the level of details is


increased.

Drill UP — the granularity is increased and the level of detail is


decreased.

Therefore, it is clear that we must condense some data to avoid


redundancy, save processing and space. We should think
about Resources X Data = Information, if our data is poorly modeled
we will throw away resources that are precious, so our equation looks
like this Resources X Data = waste of time and money.

Multidimensional data Analysis

In this analysis we use structured data in the form of a cube (each side
of the cube is a dimension), the multidimensional model is the standard
in the analysis tools, for example, when we use arithmetic queries with
OLAP. This model has a higher performance in queries, besides
providing a facility in the creation of complex query’s.

When the scope of the project is reduced, this model allows a more agile
implementation.
Structure

To visualize this model we use a cube, where tables are associated,


summarized or aggregated to return some metrics (sales per year, for
example). Each table is seen as a dimension, together they form a cube
that can have low or high granularity, always depending on the
requirements of each project.

Data cube is a data structure for storing and analysing large amounts
of multidimensional data (Pedersen, 2009b).

Fact Tables
Fact tables are objects to be analyzed, composed of measures, contexts
of each dimension and Foreign Keys, used to link the dimensions to that
table.

Example: In our data warehouse we need to create a sales fact table, for
this, we have structured it as follows.

Types of Multidimensional Models


Star layout
It is the simplest model, where the fact is centralized and composed of
dimensions with a large number of data, given that, without redundancy,
these dimensions are directly linked to the fact through its Foreign Keys.
Snowflake layout
It is an extension of the star schema, composed of more dimensions
reducing redundancy, in the end, we have a greater number of
dimensions forming more complex queries and reducing performance.
Models such as the snowflake are arranged so that at each end of the
star, it becomes the center of another star (MACHADO, 2013).
Constellation scheme
Fact Constellation, is a grouping of dimensions with multiple fact
tables, its only disadvantage is due to the complexity.

Multidimensional Data Model can be defined as a method for arranging


the data in the database, with better structuring and organization of the
contents in the database. Unlike a system with one dimension such as
a list, the Multidimensional Data Model can have two or three
dimensions of items from the database system. It is typically used in
the organizations for drawing out Analytical results and generation of
reports, which can be used as the main source for imperative decision-
making processes. This model is typically applied to systems that
operate with OLAP techniques (Online Analytical Processing).
How does Multidimensional Data Model work?
Like any other system, the Multidimensional Data Model also works
based on the predetermined steps, in order to keep the pattern the same
throughout the industry and for enabling the reusability of the already
designed or created database systems. For creating a Multidimensional
Data Model, every project should go all the way through the below
phases,
• Congregating the requirements from the client: Similar to the
other software applications, a Data Model also requires the
precise requirement from the client. Most of the time, the client
might not know what could be accomplished with the selected
technology. It is the software professional’s duty to provide
clarity on to what extent a requirement can be achieved with the
selected technology, and elaborately collect the complete
requirement.
• Categorizing the various modules of the system: After the
process of collecting the entire requirement, the next step is to
identify and categorize each of the requirements under the module
where they belong. Modularity helps in better management, and
also makes it trouble-free to implement, one at a time.
• Spotting the various dimensions based on which the system
needs to be designed: Once the separation of various
requirements and moving them to the matching modules are
completed, the next step is to identify the main factors, from the
user’s point of view. These factors can be termed as the
dimensions, based on which the multidimensional data model can
be created.
• Drafting the real-time dimensions and the corresponding
properties: As a part of next step, in the process of the Multi-
Dimensional Data Model, the dimensions identified in the
previous step can be further used for recognizing the related
properties. These properties are termed as the ‘attributes’ in the
database systems.
• Discovering the facts from the already listed dimensions and
their properties: From the initial requirement gathering, the
dimensions can be a mix of dimensions and facts. It is a
significant step to distinguish and segregate the facts from the
dimensions. These facts play a great role in the structure of the
Multi-Dimensional Data Models.
• Constructing the Schema to place the data, with respect to the
information gathered from the above steps: Based on the
information collected so far, the elaborate requirements, the
dimensions, the facts, and their respective attributes, a Schema
can be constructed. There are many types of Schemas, from
which the most suitable type of schema can be chosen.
Advantages and Disadvantages of Multidimensional Data Model
Below are the advantages and disadvantages:
Advantages
• Multi-Dimensional Data Models are workable on complex
systems and applications, unlike the simple one-dimensional
database systems.
• The Modularity in this type of Database is an encouragement for
projects with lower bandwidth for maintenance staff.
• Overall, organizational capacity and structural definition of the
Multi-Dimensional Data Models aids in holding cleaner and
reliable data in the database.
• Clearly defined construction of the data placements makes it
uncomplicated, in situations like one team constructs the
database, another team works on it and some other team works on
the maintenance. It serves as a self –learning system if and when
required.
• As the system is fresh and free of junk, the efficiency of the data
and performance of the database system is found to be advanced
& elevated.
Disadvantages
• As the Multi-Dimensional Data Model handles complex systems,
these types of databases are typically complex in nature.
• Being a complex system means the contents of the database are
huge in the amount as well. This makes the system to be highly
risky when there is a security breach.
• When the system caches due to the operations on the Multi-
Dimensional Data Model, the performance of the system is
affected greatly.
• Though the end product in a Multi-Dimensional Data Model is
advantageous, the path to achieving it is intricate most of the time.
• In today's digital world, a massive amount of data is generated
every second. We're talking about 9,000 tweets, 900 Instagram
photos, 80,000 Google searches, and 3 million emails - all
happening within the blink of an eye. Not all of this data is neat
and ready to use. This is where data scientists come in. Their job
is to sort through this data mess and clean it up, like tidying up a
cluttered room. Data cleaning is like removing the dust and
making everything neat and organized. Clean data is essential for
accurate analysis and getting meaningful insights.
Let us learn more about data cleaning in data mining.
Data Cleaning in Data Mining

• A number of surveys conducted with data scientists suggest that


around 80% of their work time is focused on obtaining, cleaning,
and organizing the data, while only 3% of the time is dedicated to
building machine learning or data science models.
What is Data Cleaning in Data Mining?

Data cleaning is the detailed process of removing any incomplete,


incorrect, or inconsistent detail from the data set. There is no single
defined way to clean such data and the process differs from data to data.
Usually, data scientists establish and follow a set of data cleaning steps
that may have historically worked for them and obtain the correct
results by removing the corrupted, incorrectly formatted, duplicate, or
mislabeled data.
Stages of Data Cleaning in Data Mining
Data cleansing is to have a better organization of the data of the
company or business, being able to take advantage of this information
in an efficient way for the planning of strategies. Below are some of the
different data cleaning processes in data mining –
Analyze Existing Data
The first thing to do in a data cleansing is to analyze the existing data
and determine the faults that need to be eliminated. This stage must
combine a manual and an automatic process to ensure the process. In
other words, in addition to making an exhaustive review of the data
manually, it is important to use specialized programs to detect
erroneous metadata or information problems.
Clean Data in A Separate Spreadsheet
Make a copy of your data set on a spreadsheet before you make any
final changes. This is a preventive step in case your data set gets
corrupted by any chance.
Remove Any Whitespaces from the Data
Whitespaces or extra spaces often lead to miscalculations, which is a
very common issue when handling huge databases. One example to
understand it better – “This is a Dog” and “This is a Dog” will be
considered as different data. You can use the TRIM function to get rid
of such undesired spaces.
Highlight Data Errors
It is possible that you don’t get an error-free data set considering the
huge volumes. Values like #N/A, #VALUE, etc. appear often in raw
data. Using the IFERROR operator and assigning a default value to the
field in case of any errors in calculation can be a useful step in your
data cleaning process.
Remove Duplicates
Duplicate entries are very common. You must go to “Conditional
Formatting” on your MS Excel and choose ‘Remove Duplicates’ to
remove any duplicate entries.
Use Data Cleansing Tools
Data Cleansing Tools can be very helpful if you are not confident of
cleaning the data on your own or have no time to clean up all your data
sets. You might need to invest in those tools, but it is worth the
expenditure!
Usage of Data Cleaning in Data Mining
Let’s understand what is the use of data cleaning in data mining.
Data Integration
Since it is difficult to ensure data quality in low-quality data, data
integration has an important role to play to solve this problem. Data
Integration is the process of combining data from different data sets
into a single one. This process uses data cleansing tools to ensure that
the embedded data set is standardized and formatted before it moves to
the final destination.
Data Migration
Data migration is the process of moving one file from one system to
another, one format to another, or one application to another. While the
data is on move, it is important to maintain its quality, security, and
consistency, to ensure that the resultant data has the correct format and
structure without any delicacies at the destination.
Data Transformation
Before the data is uploaded to a destination, it needs to be transformed.
This is only possible through data cleaning, which considers the system
criteria of formatting, structuring, etc. Data transformation processes
usually include using rules and filters before further analysis. Data
transformation is integral to most data integration and management
processes. Data cleansing tools help to clean the data using the built-in
transformations of the systems.
Data Debugging in ETL Processes
Data cleansing is crucial in preparing data during extract, transform,
and load (ETL) for reporting and analysis. Data cleansing ensures that
only high-quality data is used for decision-making and analysis. For
example, a retail company receives data from various sources, such as
a CRM or ERP system, that contain misinformation or duplicate data.
A good data debugging or debugging tool would detect inconsistencies
in the data and rectify them. The purged data will be converted to a
standard format and uploaded to a target database or data warehouse.
Data Integration
Data integration is the process of bringing data from disparate sources
together to provide users with a unified view. The premise of data
integration is to make data more freely available and easier to consume
and process by systems and users. Data integration done right can
reduce IT costs, free-up resources, improve data quality, and foster
innovation all without sweeping changes to existing applications or
data structures. And though IT organizations have always had to
integrate, the payoff for doing so has potentially never been as great as
it is right now.

Companies with mature data integration capabilities have significant


advantages over their competition, which includes:
• Increased operational efficiency by reducing the need to manually
transform and combine data sets.
• Better data quality through automated data transformations that
apply business rules to data
• More valuable insight development through a holistic view of
data that can be more easily analyzed.
A digital business is built around data and the algorithms that process
it, and it extracts maximum value from its information assets—from
everywhere across the business ecosystem, at any time it is needed.
Within a digital business, data and related services flow unimpeded, yet
securely, across the IT landscape. Data integration enables a full view
of all the information flowing through an organization and gets your
data ready for analysis.
The evolution of data integration
The scope and importance of data integration has completely changed.
Today, we augment business capabilities by leveraging standard SaaS
applications, all while continuing to develop custom applications. With
a rich ecosystem of partners ready to leverage an organization’s
information, the information about an organization’s services that gets
exposed to customers is now as important as the services themselves.
Today, integrating SaaS, custom, and partner applications and the data
contained within them, is a requirement. These days, an organization
differentiates by combining business capabilities in a unique way. For
example, many companies are analyzing data in-motion and at-rest,
using their findings to create business rules, and then applying those
rules to respond even faster to new data. Typical goals for this type of
innovation are stickier user experiences and improved business
operations.
How does data integration work?
One of the biggest challenges organizations face is trying to access and
make sense of the data that describes the environment in which it
operates. Every day, organizations capture more and more data, in a
variety of formats, from a larger number of data sources. Organizations
need a way for employees, users, and customers to capture value from
that data. This means that organizations have to be able to bring
relevant data together wherever it resides for the purposes of supporting
organization reporting and business processes.
But, required data is often distributed across applications, databases,
and other data sources hosted on-premises, in the cloud, on IoT devices,
or provided via 3rd parties. Organizations no longer maintain data
simply in one database, instead maintaining traditional master and
transactional data, as well as new types of structured and unstructured
data, across multiple sources. For instance, an organization could have
data in a flat-file or it might want to access data from a web service.
The traditional approach of data integration is known as the physical
data integration approach. And that involves the physical movement of
data from its source system to a staging area where cleansing, mapping,
and transformation takes place before the data is physically moved to a
target system, for example, a data warehouse or a data mart. The other
option is the data virtualization approach. This approach involves the
use of a virtualization layer to connect to physical data stores. Unlike
physical data integration, data virtualization involves the creation of
virtualized views of the underlying physical environment without the
need for the physical movement of data.
A common data integration technique is Extract Transform and Load
(ETL) where data is physically extracted from multiple source systems,
transformed into a different format, and loaded into a centralized data
store.
Considerations for improving simple integration
The value gained from implementing data integration technology is,
first and foremost, the cost of no longer having to manually integrate
data. There are other benefits as well including the reduction from
avoiding custom coding for the integration. Organizations whenever
they can should look to use an integration tool provided by a vendor
rather than write custom integration code. Reasons for doing this are a)
improved data quality b) optimal performance c) time savings.
Organizations could derive much greater value by adding the following
additional goals to their integration maturity roadmaps:
Streamline development
Choose a solution that lets you create a catalog of formats and sub-
processes for reuse, especially non-functional processes such as
logging, retries, etc. The ability to test any integration logic on-the-fly
will also dramatically reduce the time needed for implementation and
maintenance.
Configuration
Data integration processes are configured to connect applications and
systems. These configurations need to reflect any change immediately,
ensure the right systems are being used, and propagate changes across
various environments (development, test, quality assurance, and
production). Most organizations report that they are still changing
configuration parameters manually within their integrated development
environment (IDE), a costly human process that may also involve
tampering with integration logic. The better alternative, accessing and
managing the variables in scripts or deployment interfaces, allows fully
automated deployments that reduce project duration.
Testing
Testing is at the core of data integration development. It verifies the
data integration technology and target systems, so it should be
performed immediately, as soon as the developer creates or updates
logic. However, it’s clear that most organizations have to deploy
processes before they can test, which causes delays. An IDE allowing
immediate debugging dramatically shortens integration process
development. Moreover, because certain data integration processes are
so critical, they need to be tested in environments very much like the
production environment, and updates to them need to be tested for non-
regression. This testing requires test scenarios to be written. Many
organizations have to develop this logic on top of the integration
process logic, as well as the probes to capture results. This increases
development duration and costs. Using an API to inject data and record
test scenarios, or an integration testing solution, can dramatically
reduce project duration.
Establish a common data model
In addition to limiting technologies, building a common data model
eases future integrations because all integration processes will speak
the same language. The business will also be helped because services
and events involving business objects can be easily created, and
subscribing to the right events provides increased business visibility.
Savings from leveraging past investments
Many legacy applications are still a vital part of business processes and
hold important data that needs to be integrated with all the other
systems in your environment. Though their core business
functionalities provide great assets for reuse in other services, many of
their components and capabilities have since been replaced by other
applications. Data integration can help you infuse the data in your
legacy systems into your more modern environments.
Typically, data integration is used as a prerequisite for more processing
of the data, most notably analytics. You need to bring data together to
facilitate analytical reporting and to give users a full, unified view of
all of the information that is flowing through their organization. A true
analogy of data integration is to create once and use many times. For
instance, you don’t want to have to enter an order into one system
manually. You want to enter it once and have one system pass it to
another - that’s the main value of data integration.
Why is Data Integration Important?
Overall, data integration helps to transfer and sync data from different
systems, types and formats between systems and applications. It’s not
a one-and-done event, but a continuous process that keeps evolving as
business requirements, technologies and frameworks change. As
organizations generate more data, it provides an opportunity for better
business insights. Your data integration strategy determines how much
value you can get out of your data, and which data integration type will
work best for your use cases and initiatives.
Data integration challenges
Common challenges that data management teams encounter on data
integration include the following:
• Keeping up with growing data volumes.
• Unifying inconsistent data silos.
• Dealing with the increasingly broad array of databases and other
data platforms in IT infrastructures.
• Integrating cloud and on-premises data.
• Resolving data quality issues.
• Handling the number of systems that need to be integrated and
their distributed nature, especially in large organizations with
global operations.
The amount of data being generated and collected by organizations
creates particularly big integration challenges. Data volumes continue
to grow quickly, and the rate of that growth is only likely to increase
as big data applications expand, the use of low-cost cloud object
storage services rises and IoT develops further. With so much data
involved, successfully planning and managing the required data
integration work is a complicated process.
Data integration tools and techniques
Developers can hand-code data integration jobs, typically in the form
of scripts written in Structured Query Language (SQL), the standard
programming language used in relational databases. For many years,
that was the most common approach to integration. But packaged data
integration tools that automate, streamline and document the
development process have become available from various IT vendors.
Open source integration tools are also available, some free and others
in commercial versions.
Prominent data integration vendors include the following companies,
as well as others:
• AWS.
• Boomi.
• Cloud Software Group's IBI and Tibco Software units.
• Google Cloud.
• Hitachi Vantara.
• IBM.
• Informatica.
• Microsoft.
• Oracle.
• Precisely.
• Qlik.
• SAP.
• SAS Institute.
• Software AG.
• Talend.
ETL tools were among the first data integration software products,
reflecting the ETL method's central role in the data warehouse systems
that emerged in the mid-1990s. Now, many vendors offer more
expansive data integration platforms that also support ELT, CDC, data
replication, big data integration and other integration methods; in
addition, associated data quality, data catalog and data
governance software is often included as part of the platforms.
Some of the integration platform vendors provide data virtualization
tools, too. They're also available from data virtualization specialists and
other data management vendors, including AtScale, Data Virtuality,
Denodo Technologies, IBM's Red Hat unit and Stone Bond
Technologies.
The growth of cloud computing has created new needs for
organizations to integrate data in different cloud applications and
between cloud and on-premises systems. That led to the development
of integration platform as a service (iPaaS), a product category that
provides cloud-based integration tools. Most of the major data
integration platform vendors now also offer iPaaS technologies; other
companies in the iPaaS market include Jitterbit, Salesforce's MuleSoft
unit, SnapLogic and Workato.
Data Transformation
Data transformation in data mining refers to the process of converting
raw data into a format that is suitable for analysis and modeling. The
goal of data transformation is to prepare the data for data mining so that
it can be used to extract useful insights and knowledge.
Data transformation typically involves several steps, including:
1. Data cleaning: Removing or correcting errors, inconsistencies,
and missing values in the data.
2. Data integration: Combining data from multiple sources, such
as databases and spreadsheets, into a single format.
3. Data normalization: Scaling the data to a common range of
values, such as between 0 and 1, to facilitate comparison and
analysis.
4. Data reduction: Reducing the dimensionality of the data by
selecting a subset of relevant features or attributes.
5. Data discretization: Converting continuous data into discrete
categories or bins.
6. Data aggregation: Combining data at different levels of
granularity, such as by summing or averaging, to create new
features or attributes.
7. Data transformation is an important step in the data mining
process as it helps to ensure that the data is in a format that is
suitable for analysis and modeling, and that it is free of errors and
inconsistencies. Data transformation can also help to improve the
performance of data mining algorithms, by reducing the
dimensionality of the data, and by scaling the data to a common
range of values.
The data are transformed in ways that are ideal for mining the data. The
data transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise from the
dataset using some algorithms It allows for highlighting important
features present in the dataset. It helps in predicting the patterns. When
collecting data, it can be manipulated to eliminate or reduce any
variance or any other noise form. The concept behind data smoothing
is that it will be able to identify simple changes to help predict different
trends and patterns. This serves as a help to analysts or traders who
need to look at a lot of data which can often be difficult to digest for
finding patterns that they wouldn’t see otherwise.
2. Aggregation: Data collection or aggregation is the method of storing
and presenting data in a summary format. The data may be obtained
from multiple data sources to integrate these data sources into a data
analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the
data used. Gathering accurate data of high quality and a large enough
quantity is necessary to produce relevant results. The collection of data
is useful for everything from decisions concerning financing or
business strategy of the product, pricing, operations, and marketing
strategies. For example, Sales, data may be aggregated to compute
monthly& annual total amounts.
3. Discretization: It is a process of transforming continuous data into
set of small intervals. Most Data Mining activities in the real world
require continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes. Also, even if a data
mining task can manage a continuous attribute, it can significantly
improve its efficiency by replacing a constant quality attribute with its
discrete values. For example, (1-10, 11-20) (age:- young, middle age,
senior).
4. Attribute Construction: Where new attributes are created &
applied to assist the mining process from the given set of attributes.
This simplifies the original data & makes the mining more efficient.
5. Generalization: It converts low-level data attributes to high-level
data attributes using concept hierarchy. For Example Age initially in
Numerical form (22, 25) is converted into categorical value (young,
old). For example, Categorical attributes, such as house addresses, may
be generalized to higher-level definitions, such as town or country.
6. Normalization: Data normalization involves converting all data
variables into a given range. Techniques that are used for normalization
are:
• Min-Max Normalization:
• This transforms the original data linearly.
• Suppose that: min_A is the minima and max_A is the
maxima of an attribute, P
• Where v is the value you want to plot in the new range.
• v’ is the new value you get after normalizing the old value.
• Z-Score Normalization:
• In z-score normalization (or zero-mean normalization) the
values of an attribute (A), are normalized based on the mean
of A and its standard deviation
• A value, v, of attribute A is normalized to v’ by computing
• Decimal Scaling:
• It normalizes the values of an attribute by changing the
position of their decimal points
• The number of points by which the decimal point is moved
can be determined by the absolute maximum value of
attribute A.
• A value, v, of attribute A is normalized to v’ by computing
• where j is the smallest integer such that Max(|v’|) < 1.
• Suppose: Values of an attribute P varies from -99 to 99.
• The maximum absolute value of P is 99.
• For normalizing the values we divide the numbers by 100
(i.e., j = 2) or (number of integers in the largest number) so
that values come out to be as 0.98, 0.97 and so on.

Data Reduction

Data reduction techniques ensure the integrity of data while reducing


the data. Data reduction is a process that reduces the volume of original
data and represents it in a much smaller volume. Data reduction
techniques are used to obtain a reduced representation of the dataset
that is much smaller in volume by maintaining the integrity of the
original data. By reducing the data, the efficiency of the data mining
process is improved, which produces the same analytical results.

Data reduction does not affect the result obtained from data mining.
That means the result obtained from data mining before and after data
reduction is the same or almost the same.

Data reduction aims to define it more compactly. When the data size is
smaller, it is simpler to apply sophisticated and computationally high-
priced algorithms. The reduction of the data may be in terms of the
number of rows (records) or terms of the number of columns
(dimensions).
Techniques of Data Reduction

Here are the following techniques or methods of data reduction in data


mining, such as:

1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute
required for our analysis. Dimensionality reduction eliminates the
attributes from the data set under consideration, thereby reducing the
volume of original data. It reduces data size as it eliminates outdated or
redundant features. Here are three methods of dimensionality
reduction.
i. Wavelet Transform: In the wavelet transform, suppose a data
vector A is transformed into a numerically different data vector A'
such that both A and A' vectors are of the same length. Then how
it is useful in reducing data because the data obtained from the
wavelet transform can be truncated. The compressed data is
obtained by retaining the smallest fragment of the strongest
wavelet coefficients. Wavelet transform can be applied to data
cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to
be analyzed that has tuples with n attributes. The principal
component analysis identifies k independent tuples with n
attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space,
and dimensionality reduction can be achieved. Principal
component analysis can be applied to sparse and skewed data.
iii. Attribute Subset Selection: The large data set has many
attributes, some of which are irrelevant to data mining or some
are redundant. The core attribute subset selection reduces the data
volume and dimensionality. The attribute subset selection reduces
the volume of data by eliminating redundant and irrelevant
attributes.
The attribute subset selection ensures that we get a good subset
of original attributes even after eliminating the unwanted
attributes. The resulting probability of data distribution is as close
as possible to the original data distribution using all the attributes.
2. sNumerosity Reduction
The numerosity reduction reduces the original data volume and
represents it in a much smaller form. This technique includes two types
parametric and non-parametric numerosity reduction.
i. Parametric: Parametric numerosity reduction incorporates
storing only data parameters instead of the original data. One
method of parametric numerosity reduction is the regression and
log-linear method.
o Regression and Log-Linear: Linear regression models a
relationship between the two attributes by modeling a
linear equation to the data set. Suppose we need to model a
linear function between two attributes.
y = wx +b
Here, y is the response attribute, and x is the predictor
attribute. If we discuss in terms of data mining, attribute x
and attribute y are the numeric database attributes, whereas
w and b are regression coefficients.
Multiple linear regressions let the response variable y
model linear function between two or more predictor
variables.
Log-linear model discovers the relation between two or
more discrete attributes in the database. Suppose we have
a set of tuples presented in n-dimensional space. Then the
log-linear model is used to study the probability of each
tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse
data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction
technique does not assume any model. The non-Parametric
technique results in a more uniform reduction, irrespective of data
size, but it may not achieve a high volume of data reduction like
the parametric. There are at least four types of Non-Parametric
data reduction techniques, Histogram, Clustering, Sampling, Data
Cube Aggregation, and Data Compression.
o Histogram: A histogram is a graph that represents
frequency distribution which describes how often a value
appears in the data. Histogram uses the binning method to
represent an attribute's data distribution. It uses a disjoint
subset which we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or
skewed data. Instead of only one attribute, the histogram
can be implemented for multiple attributes. It can
effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects
from the data so that the objects in a cluster are similar to
each other, but they are dissimilar to objects in another
cluster.
How much similar are the objects inside a cluster can be
calculated using a distance function. More is the similarity
between the objects in a cluster closer they appear in the
cluster.
The quality of the cluster depends on the diameter of the
cluster, i.e., the max distance between any two objects in the
cluster.
The cluster representation replaces the original data. This
technique is more effective if the present data can be
classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is
sampling, as it can reduce the large data set into a much
smaller data sample. Below we will discuss the different
methods in which we can sample a large data set D
containing N tuples:
a. Simple random sample without replacement
(SRSWOR) of size s: In this s, some tuples are drawn
from N tuples such that in the data set D (s<N). The
probability of drawing any tuple from the data set D is
1/N. This means all tuples have an equal probability
of getting sampled.
b. Simple random sample with replacement
(SRSWR) of size s: It is similar to the SRSWOR, but
the tuple is drawn from data set D, is recorded, and
then replaced into the data set D so that it can be drawn
again.

c. Cluster sample: The tuples in data set D are clustered


into M mutually disjoint subsets. The data reduction
can be applied by implementing SRSWOR on these
clusters. A simple random sample of size s could be
generated from these clusters where s<M.
d. Stratified sample: The large data set D is partitioned
into mutually disjoint sets called 'strata'. A simple
random sample is taken from each stratum to get
stratified data. This method is effective for skewed
data.
3. Data Cube Aggregation
This technique is used to aggregate data in a simpler form. Data Cube
Aggregation is a multidimensional aggregation that uses aggregation at
various levels of a data cube to represent the original data set, thus
achieving data reduction.
For example, suppose you have the data of All Electronics sales per
quarter for the year 2018 to the year 2022. If you want to get the annual
sale per year, you just have to aggregate the sales per quarter for each
year. In this way, aggregation provides you with the required data,
which is much smaller in size, and thereby we achieve data reduction
even without losing any data.

The data cube aggregation is a multidimensional aggregation that eases


multidimensional analysis. The data cube present precomputed and
summarized data which eases the data mining into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the
structure of data in a way that consumes less space. Data compression
involves building a compact representation of information by removing
redundancy and representing data in binary form. Data that can be
restored successfully from its compressed form is called Lossless
compression. In contrast, the opposite where it is not possible to restore
the original form from the compressed form is Lossy compression.
Dimensionality and numerosity reduction method are also used for data
compression.

This technique reduces the size of the files using different encoding
mechanisms, such as Huffman Encoding and run-length Encoding. We
can divide it into two types based on their compression techniques.
i. Lossless Compression: Encoding techniques (Run Length
Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise
original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the
decompressed data may differ from the original data but are
useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find
the meaning equivalent to the original image. Methods such as
the Discrete Wavelet transform technique PCA (principal
component analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the
continuous nature into data with intervals. We replace many constant
values of the attributes with labels of small intervals. This means that
mining results are shown in a concise and easily understandable way.
i. Top-down discretization: If you first consider one or a couple of
points (so-called breakpoints or split points) to divide the whole
set of attributes and repeat this method up to the end, then the
process is known as top-down discretization, also known as
splitting.
ii. Bottom-up discretization: If you first consider all the constant
values as split-points, some are discarded through a combination
of the neighborhood values in the interval. That process is called
bottom-up discretization.
Benefits of Data Reduction
The main benefit of data reduction is simple: the more data you can fit
into a terabyte of disk space, the less capacity you will need to purchase.
Here are some benefits of data reduction, such as:
o Data reduction can save energy.
o Data reduction can reduce your physical storage costs.
o And data reduction can decrease your data center track.
Data reduction greatly increases the efficiency of a storage system and
directly impacts your total spending on capacity.
Data discretization

Data discretization refers to a method of converting a huge number of


data values into smaller ones so that the evaluation and management of
data become easy. In other words, data discretization is a method of
converting attributes values of continuous data into a finite set of
intervals with minimum data loss. There are two forms of data
discretization first is supervised discretization, and the second is
unsupervised discretization. Supervised discretization refers to a
method in which the class data is used. Unsupervised discretization
refers to a method depending upon the way which operation proceeds.
It means it works on the top-down splitting strategy and bottom-up
merging strategy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age


1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78
After Child Young Mature Old
Discretization

You might also like