Unit Ii
Unit Ii
UNIT II
Concept of Data warehousing
• A data warehouse is a centralized system used for storing and
managing large volumes of data from various sources. It is designed to
help businesses analyze historical data and make informed decisions.
Data from different operational systems is collected, cleaned, and
stored in a structured way, enabling efficient querying and reporting.
• Goal is to produce statistical results that may help in decision-making.
• End-User Access Tools: These are reporting and analysis tools, such
as dashboards or Business Intelligence (BI) tools, that enable business
users to query the data warehouse and generate reports.
Characteristics of Data Warehousing
Data warehousing is essential for modern data management, providing a strong
foundation for organizations to consolidate and analyze data strategically. Its
distinguishing features empower businesses with the tools to make informed
decisions and extract valuable insights from their data.
• Centralized Data Repository: Data warehousing provides a centralized repository
for all enterprise data from various sources, such as transactional databases,
operational systems, and external sources. This enables organizations to have a
comprehensive view of their data, which can help in making informed business
decisions.
• Data Integration: Data warehousing integrates data from different sources into a
single, unified view, which can help in eliminating data silos and reducing data
inconsistencies.
• Data Security: Data warehousing provides robust data security features, such as
access controls, data encryption, and data backups, which ensure that the data is
secure and protected from unauthorized access.
• Historical Data Storage: Data warehousing stores historical data, which enables
organizations to analyze data trends over time. This can help in identifying patterns
and anomalies in the data, which can be used to improve business performance.
• Query and Analysis: Data warehousing provides powerful query and analysis
capabilities that enable users to explore and analyze data in different ways. This
can help in identifying patterns and trends, and can also help in making informed
business decisions.
• Data Mining: Data warehousing provides data mining capabilities, which enable
organizations to discover hidden patterns and relationships in their data. This can
help in identifying new opportunities, predicting future trends, and mitigating risks.
Types of Data Warehouses
1. Enterprise Data Warehouse (EDW): A centralized warehouse that stores data from across the organization for
analysis and reporting.
2. Operational Data Store (ODS): Stores real-time operational data used for day-to-day operations, not for deep
analytics.
3. Data Mart: A subset of a data warehouse, focusing on a specific business area or department.
4. Cloud Data Warehouse: A data warehouse hosted in the cloud, offering scalability and flexibility.
5. Big Data Warehouse: Designed to store vast amounts of unstructured and structured data for big data analysis.
6. Virtual Data Warehouse: Provides access to data from multiple sources without physically storing it.
7. Hybrid Data Warehouse: Combines on-premises and cloud-based storage to offer flexibility.
8. Real-time Data Warehouse: Designed to handle real-time data streaming and analysis for immediate insights.
Data Warehouse vs DBMS
Example Applications of Data Warehousing
• Data Warehousing can be applied anywhere where we have a huge amount
of data and we want to see statistical results that help in decision making.
• Social Media Websites: The social networking websites like Facebook,
Twitter, LinkedIn, etc. are based on analyzing large data sets. These sites
gather data related to members, groups, locations, etc., and store it in a
single central repository. Being a large amount of data, Data Warehouse is
needed for implementing the same.
• Banking: Most of the banks these days use warehouses to see the spending
patterns of account/cardholders. They use this to provide them with special
offers, deals, etc.
• Data Quality: Guarantees data quality and consistency for trustworthy reporting.
• Historical Insight: The warehouse stores all your historical data which
contains details about the business so that one can analyze it at any time and
extract insights from it.
Disadvantages of Data Warehousing
• Cost: Building a data warehouse can be expensive, requiring significant
investments in hardware, software, and personnel.
• Complexity: Data warehousing can be complex, and businesses may need to hire
specialized personnel to manage the system.
• Data security: Data warehousing can pose data security risks, and businesses must
take measures to protect sensitive data from unauthorized access or breaches
ETL( Extraction Transformation and Loading)
ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract
data from various sources, transform it into a format suitable for loading into a data warehouse,
and then load it into the warehouse. The process of ETL can be broken down into the following
three stages:
1. Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.
2. Transform: In this stage, the extracted data is transformed into a format that is suitable for
loading into the data warehouse. This may involve cleaning and validating the data, converting
data types, combining data from multiple sources, and creating new data fields.
3. Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.
The ETL process is an iterative process that is repeated as new data is added to the warehouse.
The process is important because it ensures that the data in the data warehouse is accurate,
complete, and up-to-date. It also helps to ensure that the data is in the format required for data
mining and reporting.
Each step of the ETL process in-depth
• Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats
like relational databases, No SQL, XML, and flat files into the staging
area. It is important to extract the data from various source systems
and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be
corrupted also. Hence loading it directly into the data warehouse may
damage it and rollback will be much more difficult. Therefore, this is
one of the most important steps of ETL process.
1. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:
1. Filtering – loading only certain attributes into the data warehouse.
2. Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States, and
America into USA, etc.
2. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular intervals.
The rate and period of loading solely depends on the requirements and varies from system
to system.
Star Schema
• A star schema is a type of data modeling technique used in data
warehousing to represent data in a structured and intuitive way. In a
star schema, data is organized into a central fact table that contains the
measures of interest, surrounded by dimension tables that describe the
attributes of the measures.
• The fact table in a star schema contains the measures or metrics that
are of interest to the user or organization. For example, in a sales data
warehouse, the fact table might contain sales revenue, units sold, and
profit margins. Each record in the fact table represents a specific event
or transaction, such as a sale or order.
• The dimension tables in a star schema contain the descriptive
attributes of the measures in the fact table. These attributes are used to
slice and dice the data in the fact table, allowing users to analyze the
data from different perspectives. For example, in a sales data
warehouse, the dimension tables might include product, customer,
time, and location.
• In a star schema, each dimension table is joined to the fact table
through a foreign key relationship. This allows users to query the data
in the fact table using attributes from the dimension tables. For
example, a user might want to see sales revenue by product category,
or by region and time period.
It is said to be star as its physical model resembles to the star shape having a
fact table at its center and the dimension tables at its peripheral representing
the star’s points. Below is an example to demonstrate the Star Schema:
Types of Star Schema
2. Complex Star Schema
Introduction to data mining
• Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques. The
data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.
• The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques such
as clustering, classification, regression analysis, association rule mining, and
anomaly detection.
• For example, in marketing, data mining can be used to identify customer
segments and target marketing campaigns, while in healthcare, it can be
used to identify risk factors for diseases and develop personalized treatment
plans.
• Data mining uses mathematical analysis to derive patterns and trends that exist in
data. Typically, these patterns cannot be discovered by traditional data exploration
because the relationships are too complex or because there is too much data.
• These patterns and trends can be collected and defined as a data mining model.
Mining models can be applied to specific scenarios, such as:
• Forecasting: Estimating sales, predicting server loads or server downtime
• Risk and probability: Choosing the best customers for targeted mailings, determining
the probable break-even point for risk scenarios, assigning probabilities to diagnoses
or other outcomes
• Recommendations: Determining which products are likely to be sold together,
generating recommendations
• Finding sequences: Analyzing customer selections in a shopping cart, predicting next
likely events
• Grouping: Separating customers or events into cluster of related items, analyzing and
predicting affinities
Origin of Data Mining
In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a sector with
an extensive history.
Early techniques of identifying patterns in data include Bayes theorem (1700s), and the evolution of
regression(1800s). The generation and growing power of computer science have boosted data
collection, storage, and manipulation as data sets have broad in size and complexity level. Explicit
hands-on data investigation has progressively been improved with indirect, automatic data processing,
and other computer science discoveries such as neural networks, clustering, genetic algorithms (1950s),
decision trees(1960s), and supporting vector machines (1990s).
Data mining origins are traced back to three family lines: Classical statistics, Artificial intelligence, and
Machine learning.
• Data mining, also known as knowledge discovery in data (KDD), is the process of
uncovering patterns and other valuable information from large data sets. Given the
evolution of data warehousing technology and the growth of big data, adoption of
data mining techniques has rapidly accelerated over the last couple of decades,
assisting companies by transforming their raw data into useful knowledge.
4) Association Analysis: The link between the data and the rules that bind them is discovered. And two or more
data attributes are associated. It associates qualities that are transacted together regularly. They work out what
are called the rules of partnerships that are commonly used in the study of stock baskets. To link the attributes,
there are two elements. One is the trust that suggests the possibility of both associated together, and another
helps, which informs of associations past occurrence.
5) Outlier Analysis: Data components that cannot be clustered into a given class or cluster are outliers. They
are often referred to as anomalies or surprises and are also very important to remember. Although in some
contexts, outliers can be called noise and discarded, they can disclose useful information in other areas, and
hence can be very important and beneficial for their study.
6) Cluster Analysis: Clustering is the arrangement of data in groups. Unlike
classification, however, class labels are undefined in clustering and it is up to the
clustering algorithm to find suitable classes. Clustering is often called unsupervised
classification since provided class labels do not execute the classification. Many
clustering methods are all based on the concept of maximizing the similarity
(intra-class similarity) between objects of the same class and decreasing the
similarity between objects in different classes (inter-class similarity).
7) Evolution & Deviation Analysis: We may uncover patterns and shifts in actions
over time, with such distinct analysis, we can find features such as time-series
results, periodicity, and similarities in patterns. Many technologies from space
science to retail marketing can be found holistically in data processing and features.
Data Mining Applications
• Data mining is a young discipline with wide and diverse
applications
• There is still a nontrivial gap between general principles
of data mining and domain-specific, effective data
mining tools for particular applications
• Some application domains (covered in this chapter)
• Biomedical and DNA data analysis
• Financial data analysis
• Retail industry
• Telecommunication industry
37
Biomedical Data Mining and DNA
Analysis
41
Data Mining for Retail Industry
45