0% found this document useful (0 votes)
7 views50 pages

Unit 2 Updated

Data warehousing is the process of integrating data from multiple sources to support analytical reporting and decision-making, characterized by being subject-oriented, integrated, nonvolatile, and time-variant. It involves various architectures, including enterprise data warehouses, operational data stores, and data marts, and utilizes ETL processes for data extraction, transformation, and loading. Additionally, data mining is highlighted as a method for discovering patterns and relationships in large datasets, utilizing techniques such as cluster analysis and predictive modeling.

Uploaded by

Anhad Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views50 pages

Unit 2 Updated

Data warehousing is the process of integrating data from multiple sources to support analytical reporting and decision-making, characterized by being subject-oriented, integrated, nonvolatile, and time-variant. It involves various architectures, including enterprise data warehouses, operational data stores, and data marts, and utilizes ETL processes for data extraction, transformation, and loading. Additionally, data mining is highlighted as a method for discovering patterns and relationships in large datasets, utilizing techniques such as cluster analysis and predictive modeling.

Uploaded by

Anhad Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Unit 2

DATA WAREHOUSING
• A data warehouse is constructed by integrating data from multiple
heterogeneous sources that support analytical reporting, structured
and/or ad hoc queries, and decision making. Data warehousing
involves data cleaning, data integration, and data consolidations.
• A data warehouse is a relational database that is designed for query
and analysis rather than for transaction processing. It usually contains
historical data derived from transaction data, but it can include data
from other sources. It separates analysis workload from transaction
workload and enables an organization to consolidate data from
several sources.
Characteristics of a Data Warehouse
• A common way of introducing data warehousing is to refer to the
characteristics of a data warehouse as set forth by William Inmon:
• Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a warehouse that concentrates on
sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case, makes the data warehouse subject oriented.
• Integrated
Integration is closely related to subject orientation. Data warehouses must put data
from different sources into a consistent format. They must resolve such problems as
naming conflicts and inconsistencies among units of measure. When they achieve
this, they are said to be integrated.
• Nonvolatile
Nonvolatile means that, once entered into the warehouse, data should
not change. This is logical because the purpose of a warehouse is to
enable you to analyze what has occurred.
• Time Variant
In order to discover trends in business, analysts need large amounts of
data. This is very much in contrast to online transaction processing
(OLTP) systems, where performance requirements demand that
historical data be moved to an archive. A data warehouse's focus on
change over time is what is meant by the term time variant.
1. Business User: Business users require a data warehouse to view
summarized data from the past. Since these people are non-technical, the
data may be presented to them in an elementary form.
2. Store historical data: Data Warehouse is required to store the time
variable data from the past. This input is made to be used for various
purposes.
3. Make strategic decisions: Some strategies may be depending upon the
data in the data warehouse. So, data warehouse contributes to making
strategic decisions.
4. For data consistency and quality: Bringing the data from different sources
at a commonplace, the user can effectively undertake to bring the
uniformity and consistency in data.
5. High response time: Data warehouse has to be ready for somewhat
unexpected loads and types of queries, which demands a significant
degree of flexibility and quick response time.
History of Data Warehouse
• The idea of data warehousing came to the late 1980's when IBM researchers Barry
Devlin and Paul Murphy established the "Business Data Warehouse."

• In essence, the data warehousing idea was planned to support an architectural


model for the flow of information from the operational system to decisional
support environments. The concept attempt to address the various problems
associated with the flow, mainly the high costs associated with it.

• In the absence of data warehousing architecture, a vast amount of space was


required to support multiple decision support environments. In large corporations,
it was ordinary for various decision support environments to operate
independently.
Goals of Data Warehousing
• To help reporting as well as analysis
• Maintain the organization's historical information
• Be the foundation for decision making.
Benefits of Data Warehouse
• Understand business trends and make better forecasting decisions.
• Data Warehouses are designed to perform well enormous amounts of data.
• The structure of data warehouses is more accessible for end-users to
navigate, understand, and query.
• Queries that would be complex in many normalized databases could be
easier to build and maintain in data warehouses.
• Data warehousing is an efficient method to manage demand for lots of
information from lots of users.
• Data warehousing provide the capabilities to analyze a large amount of
historical data.
TYPES OF DATA WAREHOUSE
• Data Marts
• Operational Data Store
• Enterprise data warehouse
• Enterprise Data Warehouse (EDW)
This type of warehouse serves as a key or central database that facilitates
decision-support services throughout the enterprise. The advantage to this type
of warehouse is that it provides access to cross-organizational information, offers
a unified approach to data representation, and allows running complex queries.
• Operational Data Store (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for
routine activities like storing employee records. It is required when data
warehouse systems do not support reporting needs of the business.
• Data Mart
A data mart is a subset of a data warehouse built to maintain a particular
department, region, or business unit. Every department of a business has a
central repository or data mart to store data. The data from the data mart is
stored in the ODS periodically. The ODS then sends the data to the EDW, where it
is stored and used.
What is Metadata
• Metadata is simply defined as data about data. The data that is used
to represent other data is known as metadata.
• For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is the
summarized data that leads us to detailed data. In terms of data
warehouse, we can define metadata as follows.

• Metadata is the road-map to a data warehouse.


• Metadata in a data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision
support system to locate the contents of a data warehouse.
• Metadata acts as a directory.
• This directory helps the decision support system to locate the contents of the
data warehouse.
• Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
• Metadata helps in summarization between current detailed data and highly
summarized data.
• Metadata also helps in summarization between lightly detailed data and highly
summarized data.
• Metadata is used for query tools.
• Metadata is used in extraction and cleansing tools.
• Metadata is used in reporting tools.
• Metadata is used in transformation tools.
• Metadata plays an important role in loading functions.
STAR SCHEMA
• A star schema is a type of data modeling technique used in data warehousing
to represent data in a structured and intuitive way. In a star schema, data is
organized into a central fact table that contains the measures of interest,
surrounded by dimension tables that describe the attributes of the
measures.
• The fact table in a star schema contains the measures or metrics that are of
interest to the user or organization. For example, in a sales data warehouse,
the fact table might contain sales revenue, units sold, and profit margins.
Each record in the fact table represents a specific event or transaction, such
as a sale or order.
• The dimension tables in a star schema contain the descriptive attributes of
the measures in the fact table. These attributes are used to slice and dice the
data in the fact table, allowing users to analyze the data from different
perspectives. For example, in a sales data warehouse, the dimension tables
might include product, customer, time, and location.
SNOWFLAKE SCHEMA
• A snowflake schema is a multi-dimensional data model that is an
extension of a star schema, where dimension tables are broken down
into sub-dimensions.
• Snowflake schemas are commonly used for business intelligence and
reporting in OLAP data warehouses, data marts, and relational
databases.
• In a snowflake schema, engineers break down individual dimension
tables into logical sub-dimensions. This makes the data model more
complex, but it can be easier for analysts to work with, especially for
certain data types.
• It's called a snowflake schema because its entity-relationship diagram
(ERD) looks like a snowflake, as seen below.
Snowflake schemas vs. star schemas

• Like star schemas, snowflake schemas have a central fact table which
is connected to multiple dimension tables via foreign keys. However,
the main difference is that they are more normalized than star
schemas.
• Snowflake schemas offer more storage efficiency, due to their tighter
adherence to high normalization standards, but query performance is
not as good as with more denormalized data models.
• Denormalized data models like star schemas have more data
redundancy (duplication of data), which makes query performance
faster at the cost of duplicated data.
ETL
• As the databases grew in popularity in the 1970s, ETL was introduced
as a process for integrating and loading data for computation and
analysis, eventually becoming the primary method to process data for
data warehousing projects.
Extract

• During data extraction, raw data is copied or exported from source locations to a
staging area. Data management teams can extract data from a variety of data
sources, which can be structured or unstructured. Those sources include but are
not limited to:

• SQL or NoSQL servers


• CRM and ERP systems
• Flat files
• Email
• Web pages
Transform
• In the staging area, the raw data undergoes data processing. Here, the data is
transformed and consolidated for its intended analytical use case. This phase can involve
the following tasks:

• Filtering, cleansing, de-duplicating, validating, and authenticating the data.


• Performing calculations, translations, or summarizations based on the raw
data. This can include changing row and column headers for consistency,
converting currencies or other units of measurement, editing text strings, and
more.
• Conducting audits to ensure data quality and compliance
• Removing, encrypting, or protecting data governed by industry or
governmental regulators
• Formatting the data into tables or joined tables to match the schema of the
target data warehouse.
Load

• In this last step, the transformed data is moved from the staging area
into a target data warehouse. Typically, this involves an initial loading
of all data, followed by periodic loading of incremental data changes
and, less often, full refreshes to erase and replace data in the
warehouse.
• For most organizations that use ETL, the process is automated, well-
defined, continuous and batch-driven. Typically, ETL takes place
during off-hours when traffic on the source systems and the data
warehouse is at its lowest.
Data Warehouse Architectures
Data Warehouse Architectures
Source Data Component:

• In the Data Warehouse, the source data comes from different places. They are group into four
categories:

1. External Data: For data gathering, most of the executives and data analysts rely on information
coming from external sources for a numerous amount of the information they use. They use
statistical features associated with their organization that is brought out by some external sources
and department.
2. Internal Data: In every organization, the consumer keeps their “private” spreadsheets, reports,
client profiles, and generally even department databases. This is often the interior information, a
part that might be helpful in every data warehouse.
3. Operational System data: Operational systems are principally meant to run the business. In each
operation system, we periodically take the old data and store it in achieved files.
4. Flat files: A flat file is nothing but a text database that stores data in a plain text format. Flat files
generally are text files that have all data processing and structure markup removed. A flat file
contains a table with a single record per line.
Data Staging:
• After the data is extracted from various sources, now it’s time to
prepare the data files for storing in the data warehouse. The extracted
data collected from various sources must be transformed and made
ready in a format that is suitable to be saved in the data warehouse
for querying and analysis. The data staging contains three primary
functions that take place in this part:
1. Data Extraction: This stage handles various data sources. Data analysts should
employ suitable techniques for every data source.
2. Data Transformation: As we all know, information for a knowledge warehouse
comes from many alternative sources. If information extraction for a data
warehouse posture huge challenges, information transformation gifts even
important challenges. We tend to perform many individual tasks as a part of
information transformation. First, we tend to clean the info extracted from every
source of data. Standardization of information elements forms an outsized part of
data transformation. Data transformation contains several kinds of combining items
of information from totally different sources. Information transformation
additionally contains purging supply information that’s not helpful and separating
outsourced records into new mixtures. Once the data transformation performs
ends, we’ve got a set of integrated information that’s clean, standardized, and
summarized.
3. Data Loading: When we complete the structure and construction of the data
warehouse and go live for the first time, we do the initial loading of the data into
the data warehouse storage. The initial load moves high volumes of data
consuming a considerable amount of time.
Data Storage in Warehouse:

• Data storage for data warehousing is split into multiple repositories. These data
repositories contain structured data in a very highly normalized form for fast and efficient
processing.

• Metadata: Metadata means data about data i.e. it summarizes basic details regarding
data, creating findings & operating with explicit instances of data. Metadata is generated
by an additional correction or automatically and can contain basic information about
data.
• Raw Data: Raw data is a set of data and information that has not yet been processed and
was delivered from a particular data entity to the data supplier and hasn’t been
processed nonetheless by machine or human. This data is gathered out from online
sources to deliver deep insight into users’ online behavior.
• Summary Data or Data summary: Data summary is an easy term for a brief conclusion of
an enormous theory or a paragraph. This is often one thing where analysts write the
code and in the end, they declare the ultimate end in the form of summarizing data. Data
summary is the most essential thing in data mining and processing.
Data Marts

• Data marts are also the part of storage component in a data


warehouse. It can store the information of a specific function of an
organization that is handled by a
• single authority. There may be any number of data marts in a
particular organization depending upon the functions. In short, data
marts contain subsets of the data stored in data warehouses.
What is Data Mining
⚫ Data mining is the process of discovering
knowledge in form of interesting
patterns and relationships from large
amounts of data.
⚫ “Data mining,” writes Joseph P. Bigus in his
book, Data Mining with Neural Networks,
⚫ It is the efficient discovery of valuable,
non- obvious information from a large
collection of data.”
⚫ Data mining centers around the automated
discovery of new facts and relationships in
data. 3
⚫ It is a multi-disciplinary skill that uses
machine learning, statistics, AI and
database technology.

⚫ With traditional query tools, you search for


known information. Data mining tools enable
you to uncover hidden information.

3
3
What kinds of data can be mined?
⚫ As a general technology, data mining can be
applied to any kind of data as long as the
data are meaningful for a target application.
The most basic forms of data for mining
applications are :
⚫ database data
⚫ data warehouse data
⚫ transactional data

4
Database Data
⚫ Relational data can be accessed by database queries written in a
relational query language (e.g., SQL) or with the assistance of
graphical user interfaces.

⚫ Suppose that your job is to analyze the AllElectronics store data.


• Through the use of relational queries, you can ask things like:

⚫ “Show me a list of all items that were sold in the last quarter.”
in the
⚫ “Show me the total sales of the last month, grouped bymonth
branch,”
of
⚫ “How many sales transactions
occurred
December?”
⚫ “Which salesperson had the highest 4
⚫ When mining relational databases, we can
go further by searching for trends or data
patterns. For example, data mining systems
can analyse customer data to predict the
credit risk of new customers based on their
income, age, and previous credit information.

4
Trends in Data Mining
• Time Series Data Mining
• Phenomenal Data Mining
• Ubiquitous Data Mining
• Multimedia Data mining
• Spatial and Geographic Data mining
Functionalities of Data Mining
• Cluster Analysis
• Association Analysis
• Predictive modeling
• Anomaly Detection
• Clustering is similar to classification. However, clustering identifies
similarities between objects, then groups those items based on what
makes them different from other items. While classification may result
in groups such as "shampoo," "conditioner," "soap," and "toothpaste,"
clustering may identify groups such as "hair care" and "dental health."
• Association rules, also referred to as market basket analysis, search for
relationships between variables. This relationship in itself creates
additional value within the data set as it strives to link pieces of data.
For example, association rules would search a company's sales history
to see which products are most commonly purchased together; with
this information, stores can plan, promote, and forecast.
• Predictive analysis strives to leverage historical information to build
graphical or mathematical models to forecast future outcomes.
Overlapping with regression analysis, this technique aims to support
an unknown figure in the future based on current data on hand.
• Anomaly detection is the identification of rare events, items, or
observations which are suspicious because they differ significantly
from standard behaviors or patterns. Anomalies in data are also called
standard deviations, outliers, noise, novelties, and exceptions.
Some of the Data mining challenges are given as
under:
1. Security and Social Challenges
2. Noisy and Incomplete Data
3. Distributed Data
4. Complex Data
5. Performance
6. Scalability and Efficiency of the Algorithms
7. Improvement of Mining Algorithms
8. Incorporation of Background Knowledge
9. Data Visualization
10. Data Privacy and Security
11. User Interface
12. Mining dependent on Level of Abstraction
13. Integration of Background Knowledge
14. Mining Methodology Challenges
APPLICATION OF DATA MINING
• HEALTHCARE
• Effective management of Hospital resource: Data mining provides support for constructing a
model for managing the hospital resources which is an important task in healthcare. Using data
mining, it is possible to detect the chronic disease and based on the complication of the patient
disease prioritize the patients so that they will get effective treatment in timely and accurate
manner. Fitness report and demographic details of patients is also useful for utilizing the
available hospital resources effectively.
• Hospital Ranking: Different data mining approaches are used to analyze the various hospital
details in order to determine their ranks . Ranking of the hospitals are done on the basis of their
capability to handle the high risk patients. The hospital with higher rank handles the high risk
patient on its top priority while the hospital with lower rank does not consider the risk factor.
• Better Customer Relation: Data Mining helps the healthcare institute to understand the needs,
preferences, behavior, patterns and quality of their customer in order to make better relation
with them. Using Data Mining, Customer Potential Management Corp. develops an index
represent the utilization of Consumer healthcare. This index helps to detect the influence of
customer towards particular healthcare service.
• Hospital Infection Control: A system for inspection is constructed using data mining techniques to
discover unknown or irregular patterns in the infection control data . Using Data Mining, physicians and
patients can easily compare among different treatments technique. They can analyze the effectiveness of
available treatments and find out which technique is better and cost effective. Data Mining also helps
them to identify the side effects of particular treatment, to make appropriate decision to reduce the
hazard and to develop smart methodologies for treatment.
• Improved Patient care:Data mining helps the healthcare providers to identify the present and future
requirements of patients and their preferences to enhance their satisfaction levels.Large amount of data
is collected with the advancement in electronic health record. Patient data which is available in digitized
form improve the healthcare system quality. In order to analyze this massive data, a predictive model is
constructed using data mining that discover interesting information from this huge data and make
decision regarding the improvement of healthcare quality.
• Decrease Insurance Fraud: Healthcare insurer develops a model to detect the fraud and abuse in the
medical claims using data mining techniques. This model is helpful for identifying the improper
prescriptions, irregular or fake patterns in medical claims made by physicians, patients, hospitals etc.
Doctors prescriptions and treatment materials are produced large amount of data.
APPLICATION OF DATA MINING IN RETAIL
1. Customer Segmentation: Those in the retail industry can use data mining software to categorize customers based
on shared characteristics. These subgroups may be divided by demographics, shopping behaviors, past purchase
history, etc.
2. Retailers can use the data collected to reach their target customers more effectively. With customer
segmentation analysis, businesses can adjust their strategy to reach current and future patrons.
3. You might also use this data analysis to develop your brand’s message, build a marketing persona, and determine
what new products and services you want to invest in.
4. Churn Analysis: Customer churn, also called customer attrition, is a measure of customer loss. Churn analysis
involves analyzing past customer data like demographics and transaction history. This will help you understand
who is leaving, why they are leaving, who is more likely to leave faster, and how you can improve customer
retention. The data analysis seeks to uncover the cause of attrition.
5. Some churn is inevitable, but analyzing this retail data will allow you to be proactive by identifying patterns. For
example, retailers can send incentives, coupons, or special offers to customers with the highest risk of leaving.
6. Market Basket Analysis: Market basket analysis is a data mining technique used to anticipate the customer’s next
moves based on purchasing patterns. The data collected reveals commonly grouped products, which can predict
which items will most likely be purchased together.
7. Market basket analysis reveals previously unrecognized associations and establishes relationships between
products. For example, someone who buys tea is also likely to purchase honey. The store could then display them
on the shelf together, offer bundled discounts, or advertise tea on the product page for honey and vice versa.
Implementing market basket analysis is a fantastic strategy for increasing sales.

You might also like