0% found this document useful (0 votes)
15 views45 pages

Unit Ii

The document discusses data warehousing and data mining, outlining concepts such as ETL processes, star schema, and the importance of data warehousing for decision-making and analytics. It highlights the components, characteristics, advantages, and disadvantages of data warehousing, along with various types and applications in industries like retail, healthcare, and banking. Additionally, it covers the origins and tasks of data mining, emphasizing its role in extracting insights from large datasets.

Uploaded by

Raghav Walia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views45 pages

Unit Ii

The document discusses data warehousing and data mining, outlining concepts such as ETL processes, star schema, and the importance of data warehousing for decision-making and analytics. It highlights the components, characteristics, advantages, and disadvantages of data warehousing, along with various types and applications in industries like retail, healthcare, and banking. Additionally, it covers the origins and tasks of data mining, emphasizing its role in extracting insights from large datasets.

Uploaded by

Raghav Walia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Data warehousing and data mining: Concept of data

warehousing, ETL, Star Schema, Introduction to data mining,


The origin of data mining, Data mining tasks, Application and
trends in data mining, data mining for retail industry, health
industry, Insurance and Telecommunication sector.

UNIT II
Concept of Data warehousing
• A data warehouse is a centralized system used for storing and
managing large volumes of data from various sources. It is designed to
help businesses analyze historical data and make informed decisions.
Data from different operational systems is collected, cleaned, and
stored in a structured way, enabling efficient querying and reporting.
• Goal is to produce statistical results that may help in decision-making.

• Ensures fast data retrieval even with the vast datasets.


Need for Data Warehousing
1. Handling Large Volumes of Data: Traditional databases can only store a limited amount
of data (MBs to GBs), whereas a data warehouse is designed to handle much larger datasets
(TBs), allowing businesses to store and manage massive amounts of historical data.
2. Enhanced Analytics: Transactional databases are not optimized for analytical purposes. A
data warehouse is built specifically for data analysis, enabling businesses to perform complex
queries and gain insights from historical data.
3. Centralized Data Storage: A data warehouse acts as a central repository for all
organizational data, helping businesses to integrate data from multiple sources and have a
unified view of their operations for better decision-making.
4. Trend Analysis: By storing historical data, a data warehouse allows businesses to analyze
trends over time, enabling them to make strategic decisions based on past performance and
predict future outcomes.
5. Support for Business Intelligence: Data warehouses support business intelligence tools
and reporting systems, providing decision-makers with easy access to critical information,
which enhances operational efficiency and supports data-driven strategies.
Components of Data Warehouse
• Data Sources: These are the various operational systems, databases, and
external data feeds that provide raw data to be stored in the warehouse.
• ETL (Extract, Transform, Load) Process: The ETL process is responsible
for extracting data from different sources, transforming it into a suitable
format, and loading it into the data warehouse.
• Data Warehouse Database: This is the central repository were cleaned and
transformed data is stored. It is typically organized in a multidimensional
format for efficient querying and reporting.
• Metadata: Metadata describes the structure, source, and usage of data
within the warehouse, making it easier for users and systems to understand
and work with the data.
• Data Marts: These are smaller, more focused data repositories
derived from the data warehouse, designed to meet the needs of
specific business departments or functions.

• OLAP (Online Analytical Processing) Tools: OLAP tools allow


users to analyze data in multiple dimensions, providing deeper insights
and supporting complex analytical queries.

• End-User Access Tools: These are reporting and analysis tools, such
as dashboards or Business Intelligence (BI) tools, that enable business
users to query the data warehouse and generate reports.
Characteristics of Data Warehousing
Data warehousing is essential for modern data management, providing a strong
foundation for organizations to consolidate and analyze data strategically. Its
distinguishing features empower businesses with the tools to make informed
decisions and extract valuable insights from their data.
• Centralized Data Repository: Data warehousing provides a centralized repository
for all enterprise data from various sources, such as transactional databases,
operational systems, and external sources. This enables organizations to have a
comprehensive view of their data, which can help in making informed business
decisions.

• Data Integration: Data warehousing integrates data from different sources into a
single, unified view, which can help in eliminating data silos and reducing data
inconsistencies.

• Data Security: Data warehousing provides robust data security features, such as
access controls, data encryption, and data backups, which ensure that the data is
secure and protected from unauthorized access.
• Historical Data Storage: Data warehousing stores historical data, which enables
organizations to analyze data trends over time. This can help in identifying patterns
and anomalies in the data, which can be used to improve business performance.

• Query and Analysis: Data warehousing provides powerful query and analysis
capabilities that enable users to explore and analyze data in different ways. This
can help in identifying patterns and trends, and can also help in making informed
business decisions.

• Data Transformation: Data warehousing includes a process of data


transformation, which involves cleaning, filtering, and formatting data from
various sources to make it consistent and usable. This can help in improving data
quality and reducing data inconsistencies.

• Data Mining: Data warehousing provides data mining capabilities, which enable
organizations to discover hidden patterns and relationships in their data. This can
help in identifying new opportunities, predicting future trends, and mitigating risks.
Types of Data Warehouses
1. Enterprise Data Warehouse (EDW): A centralized warehouse that stores data from across the organization for
analysis and reporting.

2. Operational Data Store (ODS): Stores real-time operational data used for day-to-day operations, not for deep
analytics.

3. Data Mart: A subset of a data warehouse, focusing on a specific business area or department.

4. Cloud Data Warehouse: A data warehouse hosted in the cloud, offering scalability and flexibility.

5. Big Data Warehouse: Designed to store vast amounts of unstructured and structured data for big data analysis.

6. Virtual Data Warehouse: Provides access to data from multiple sources without physically storing it.

7. Hybrid Data Warehouse: Combines on-premises and cloud-based storage to offer flexibility.

8. Real-time Data Warehouse: Designed to handle real-time data streaming and analysis for immediate insights.
Data Warehouse vs DBMS
Example Applications of Data Warehousing
• Data Warehousing can be applied anywhere where we have a huge amount
of data and we want to see statistical results that help in decision making.
• Social Media Websites: The social networking websites like Facebook,
Twitter, LinkedIn, etc. are based on analyzing large data sets. These sites
gather data related to members, groups, locations, etc., and store it in a
single central repository. Being a large amount of data, Data Warehouse is
needed for implementing the same.

• Banking: Most of the banks these days use warehouses to see the spending
patterns of account/cardholders. They use this to provide them with special
offers, deals, etc.

• Government: Government uses a data warehouse to store and analyze tax


payments which are used to detect tax thefts.
Advantages of Data Warehousing

• Intelligent Decision-Making: With centralized data in warehouses, decisions may


be made more quickly and intelligently.

• Business Intelligence: Provides strong operational insights through business


intelligence.

• Data Quality: Guarantees data quality and consistency for trustworthy reporting.

• Scalability: Capable of managing massive data volumes and expanding to meet


changing requirements.

• Effective Queries: Fast and effective data retrieval is made possible by an


optimized structure.
• Cost reductions: Data warehousing can result in cost savings over time by
reducing data management procedures and increasing overall efficiency,
even when there are setup costs initially.

• Data security: Data warehouses employ security protocols to safeguard


confidential information, guaranteeing that only authorized personnel are
granted access to certain data.

• Faster Queries: The data warehouse is designed to handle large queries


that’s why it runs queries faster than the database..

• Historical Insight: The warehouse stores all your historical data which
contains details about the business so that one can analyze it at any time and
extract insights from it.
Disadvantages of Data Warehousing
• Cost: Building a data warehouse can be expensive, requiring significant
investments in hardware, software, and personnel.

• Complexity: Data warehousing can be complex, and businesses may need to hire
specialized personnel to manage the system.

• Time-consuming: Building a data warehouse can take a significant amount of


time, requiring businesses to be patient and committed to the process.

• Data integration challenges: Data from different sources can be challenging to


integrate, requiring significant effort to ensure consistency and accuracy.

• Data security: Data warehousing can pose data security risks, and businesses must
take measures to protect sensitive data from unauthorized access or breaches
ETL( Extraction Transformation and Loading)
ETL stands for Extract, Transform, Load and it is a process used in data warehousing to extract
data from various sources, transform it into a format suitable for loading into a data warehouse,
and then load it into the warehouse. The process of ETL can be broken down into the following
three stages:

1. Extract: The first stage in the ETL process is to extract data from various sources such as
transactional systems, spreadsheets, and flat files. This step involves reading data from the
source systems and storing it in a staging area.

2. Transform: In this stage, the extracted data is transformed into a format that is suitable for
loading into the data warehouse. This may involve cleaning and validating the data, converting
data types, combining data from multiple sources, and creating new data fields.

3. Load: After the data is transformed, it is loaded into the data warehouse. This step involves
creating the physical data structures and loading the data into the warehouse.

The ETL process is an iterative process that is repeated as new data is added to the warehouse.
The process is important because it ensures that the data in the data warehouse is accurate,
complete, and up-to-date. It also helps to ensure that the data is in the format required for data
mining and reporting.
Each step of the ETL process in-depth
• Extraction:
The first step of the ETL process is extraction. In this step, data from
various source systems is extracted which can be in various formats
like relational databases, No SQL, XML, and flat files into the staging
area. It is important to extract the data from various source systems
and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be
corrupted also. Hence loading it directly into the data warehouse may
damage it and rollback will be much more difficult. Therefore, this is
one of the most important steps of ETL process.
1. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:
1. Filtering – loading only certain attributes into the data warehouse.

2. Cleaning – filling up the NULL values with some default values, mapping U.S.A, United States, and
America into USA, etc.

3. Joining – joining multiple attributes into one.

4. Splitting – splitting a single attribute into multiple attributes.

5. Sorting – sorting tuples on the basis of some attribute (generally key-attribute).

2. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular intervals.
The rate and period of loading solely depends on the requirements and varies from system
to system.
Star Schema
• A star schema is a type of data modeling technique used in data
warehousing to represent data in a structured and intuitive way. In a
star schema, data is organized into a central fact table that contains the
measures of interest, surrounded by dimension tables that describe the
attributes of the measures.
• The fact table in a star schema contains the measures or metrics that
are of interest to the user or organization. For example, in a sales data
warehouse, the fact table might contain sales revenue, units sold, and
profit margins. Each record in the fact table represents a specific event
or transaction, such as a sale or order.
• The dimension tables in a star schema contain the descriptive
attributes of the measures in the fact table. These attributes are used to
slice and dice the data in the fact table, allowing users to analyze the
data from different perspectives. For example, in a sales data
warehouse, the dimension tables might include product, customer,
time, and location.
• In a star schema, each dimension table is joined to the fact table
through a foreign key relationship. This allows users to query the data
in the fact table using attributes from the dimension tables. For
example, a user might want to see sales revenue by product category,
or by region and time period.
It is said to be star as its physical model resembles to the star shape having a
fact table at its center and the dimension tables at its peripheral representing
the star’s points. Below is an example to demonstrate the Star Schema:
Types of Star Schema
2. Complex Star Schema
Introduction to data mining
• Data mining is the process of extracting knowledge or insights from large
amounts of data using various statistical and computational techniques. The
data can be structured, semi-structured or unstructured, and can be stored in
various forms such as databases, data warehouses, and data lakes.
• The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques such
as clustering, classification, regression analysis, association rule mining, and
anomaly detection.
• For example, in marketing, data mining can be used to identify customer
segments and target marketing campaigns, while in healthcare, it can be
used to identify risk factors for diseases and develop personalized treatment
plans.
• Data mining uses mathematical analysis to derive patterns and trends that exist in
data. Typically, these patterns cannot be discovered by traditional data exploration
because the relationships are too complex or because there is too much data.
• These patterns and trends can be collected and defined as a data mining model.
Mining models can be applied to specific scenarios, such as:
• Forecasting: Estimating sales, predicting server loads or server downtime
• Risk and probability: Choosing the best customers for targeted mailings, determining
the probable break-even point for risk scenarios, assigning probabilities to diagnoses
or other outcomes
• Recommendations: Determining which products are likely to be sold together,
generating recommendations
• Finding sequences: Analyzing customer selections in a shopping cart, predicting next
likely events
• Grouping: Separating customers or events into cluster of related items, analyzing and
predicting affinities
Origin of Data Mining
In the 1990s, the term "Data Mining" was introduced, but data mining is the evolution of a sector with
an extensive history.
Early techniques of identifying patterns in data include Bayes theorem (1700s), and the evolution of
regression(1800s). The generation and growing power of computer science have boosted data
collection, storage, and manipulation as data sets have broad in size and complexity level. Explicit
hands-on data investigation has progressively been improved with indirect, automatic data processing,
and other computer science discoveries such as neural networks, clustering, genetic algorithms (1950s),
decision trees(1960s), and supporting vector machines (1990s).
Data mining origins are traced back to three family lines: Classical statistics, Artificial intelligence, and
Machine learning.

Classical statistics: Artificial Intelligence: Machine Learning:


Statistics are the basis of most technology on AI or Artificial intelligence is based on Machine learning is a combination of statistics and AI. It
which data mining is built, such as heuristics as opposed to statistics. It tries to might be considered as an evolution of AI because it
regression analysis, standard deviation, apply human- thought like processing to mixes AI heuristics with complex statistical analysis.
standard distribution, standard variance, Machine learning tries to enable computer programs to
statistical problems. A specific AI concept was
know about the data they are studying so that programs
discriminatory analysis, cluster analysis, and adopted by some high-end commercial products,
make a distinct decision based on the characteristics of
confidence intervals. All of these are used to such as query optimization modules the data examined. It uses statistics for basic concepts and
analyze data and data connection. for Relational Database Management adding more AI heuristics and algorithms to accomplish
System(RDBMS). its target
Data Mining Tasks:

• Data mining, also known as knowledge discovery in data (KDD), is the process of
uncovering patterns and other valuable information from large data sets. Given the
evolution of data warehousing technology and the growth of big data, adoption of
data mining techniques has rapidly accelerated over the last couple of decades,
assisting companies by transforming their raw data into useful knowledge.

• Data mining functionalities are to perceive the various forms of patterns to be


identified in data mining activities. To define the type of patterns to be discovered
in data mining activities, data mining features are used. Data mining has a wide
application for forecasting and characterizing data in big data.
Data mining tasks are majorly categorized into two categories: descriptive and predictive.

❑ Descriptive data mining:


• Descriptive data mining offers a detailed description of the data, for example-
it gives insight into what's going on inside the data without any prior idea. This
demonstrates the common characteristics in the results. It includes any
information to grasp what's going on in the data without a prior idea.

❑ Predictive Data Mining:


• This allows users to consider features that are not specifically available. For
example, the projection of the market analysis in the next quarters with the
output of the previous quarters, In general, the predictive analysis forecasts or
infers the features of the data previously available. For an instance: judging by
the outcomes of medical records of a patient who suffers from some real
illness.
Key Data Mining Tasks

1) Characterization and Discrimination


∙ Data Characterization: The characterization of data is a description of the
general characteristics of objects in a target class which creates what are called
characteristic rules. A database query usually computes the data applicable to a
user-specified class and runs through a description component to retrieve the
meaning of the data at various abstraction levels';-Bar maps, curves, and pie
charts.
∙ Data Discrimination: Data discrimination creates a series of rules called
discriminate rules that is simply a distinction between the two classes aligned with
the goal class and the opposite class of the general characteristics of objects.
∙ 2) Prediction: To detect the inaccessible data, it uses regression analysis and
detects the missing numeric values in the data. If the class mark is absent, so
classification is used to render the prediction. Due to its relevance in business
intelligence, the prediction is common. If the class mark is absent, so the
prediction is performed using classification. There are two methods of predicting
data. Due to its relevance in business intelligence, a prediction is common. The
prediction of the class mark using the previously developed class model and the
prediction of incomplete or incomplete data using prediction analysis are two
3) Classification: Classification is used to create data structures of predefined classes, as the model is used to
classify new instances whose classification is not understood. The instances used to produce the model are
known as data from preparation. A decision tree or set of classification rules is based on such a form of
classification process that can be collected to identify future details, for example by classifying the possible
compensation of the employee based on the classification of salaries of related employees in the company.

4) Association Analysis: The link between the data and the rules that bind them is discovered. And two or more
data attributes are associated. It associates qualities that are transacted together regularly. They work out what
are called the rules of partnerships that are commonly used in the study of stock baskets. To link the attributes,
there are two elements. One is the trust that suggests the possibility of both associated together, and another
helps, which informs of associations past occurrence.

5) Outlier Analysis: Data components that cannot be clustered into a given class or cluster are outliers. They
are often referred to as anomalies or surprises and are also very important to remember. Although in some
contexts, outliers can be called noise and discarded, they can disclose useful information in other areas, and
hence can be very important and beneficial for their study.
6) Cluster Analysis: Clustering is the arrangement of data in groups. Unlike
classification, however, class labels are undefined in clustering and it is up to the
clustering algorithm to find suitable classes. Clustering is often called unsupervised
classification since provided class labels do not execute the classification. Many
clustering methods are all based on the concept of maximizing the similarity
(intra-class similarity) between objects of the same class and decreasing the
similarity between objects in different classes (inter-class similarity).

7) Evolution & Deviation Analysis: We may uncover patterns and shifts in actions
over time, with such distinct analysis, we can find features such as time-series
results, periodicity, and similarities in patterns. Many technologies from space
science to retail marketing can be found holistically in data processing and features.
Data Mining Applications
• Data mining is a young discipline with wide and diverse
applications
• There is still a nontrivial gap between general principles
of data mining and domain-specific, effective data
mining tools for particular applications
• Some application domains (covered in this chapter)
• Biomedical and DNA data analysis
• Financial data analysis
• Retail industry
• Telecommunication industry

37
Biomedical Data Mining and DNA
Analysis

• DNA sequences: 4 basic building blocks (nucleotides): adenine


(A), cytosine (C), guanine (G), and thymine (T).
• Gene: a sequence of hundreds of individual nucleotides
arranged in a particular order
• Humans have around 100,000 genes
• Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes
• Semantic integration of heterogeneous, distributed genome
databases
• Current: highly distributed, uncontrolled generation and use
of a wide variety of DNA data
• Data cleaning and data integration methods developed in
data mining will help
38
DNA Analysis: Examples

• Similarity search and comparison among DNA sequences


• Compare the frequently occurring patterns of each class (e.g., diseased
and healthy)
• Identify gene sequence patterns that play roles in various diseases
• Association analysis: identification of co-occurring gene sequences
• Most diseases are not triggered by a single gene but by a combination of
genes acting together
• Association analysis may help determine the kinds of genes that are likely
to co-occur together in target samples
• Path analysis: linking genes to different disease development stages
• Different genes may become active at different stages of the disease
• Develop pharmaceutical interventions that target the different stages
separately
• Visualization tools and genetic data analysis
39
Data Mining for Financial Data Analysis

• Financial data collected in banks and financial institutions are


often relatively complete, reliable, and of high quality
• Design and construction of data warehouses for
multidimensional data analysis and data mining
• View the debt and revenue changes by month, by region,
by sector, and by other factors
• Access statistical information such as max, min, total,
average, trend, etc.
• Loan payment prediction/consumer credit policy analysis
• feature selection and attribute relevance ranking
• Loan payment performance
• Consumer credit rating
40
Financial Data Mining

• Classification and clustering of customers for targeted


marketing
• multidimensional segmentation by nearest-neighbor,
classification, decision trees, etc. to identify customer
groups or associate a new customer to an appropriate
customer group
• Detection of money laundering and other financial crimes
• integration of from multiple DBs (e.g., bank transactions,
federal/state crime history DBs)
• Tools: data visualization, linkage analysis, classification,
clustering tools, outlier analysis, and sequential pattern
analysis tools (find unusual access sequences)

41
Data Mining for Retail Industry

• Retail industry: huge amounts of data on sales, customer


shopping history, etc.
• Applications of retail data mining
• Identify customer buying behaviors
• Discover customer shopping patterns and trends
• Improve the quality of customer service
• Achieve better customer retention and satisfaction
• Enhance goods consumption ratios
• Design more effective goods transportation and
distribution policies
42
Data Mining in Retail Industry: Examples

• Design and construction of data warehouses based on the


benefits of data mining
• Multidimensional analysis of sales, customers, products,
time, and region
• Analysis of the effectiveness of sales campaigns
• Customer retention: Analysis of customer loyalty
• Use customer loyalty card information to register sequences
of purchases of particular customers
• Use sequential pattern mining to investigate changes in
customer consumption or loyalty
• Suggest adjustments on the pricing and variety of goods
• Purchase recommendation and cross-reference of items
43
Data Mining for Telecomm. Industry (1)

• A rapidly expanding and highly competitive industry and a


great demand for data mining
• Understand the business involved
• Identify telecommunication patterns
• Catch fraudulent activities
• Make better use of resources
• Improve the quality of service
• Multidimensional analysis of telecommunication data
• Intrinsically multidimensional: calling-time, duration,
location of caller, location of callee, type of call, etc.
44
Data Mining for Telecomm. Industry (2)

• Fraudulent pattern analysis and the identification of unusual patterns


• Identify potentially fraudulent users and their atypical usage patterns
• Detect attempts to gain fraudulent entry to customer accounts
• Discover unusual patterns which may need special attention
• Multidimensional association and sequential pattern analysis
• Find usage patterns for a set of communication services by customer
group, by month, etc.
• Promote the sales of specific services
• Improve the availability of particular services in a region
• Use of visualization tools in telecommunication data analysis

45

You might also like