0% found this document useful (0 votes)
25 views17 pages

All Unit

Uploaded by

Naman Bahri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views17 pages

All Unit

Uploaded by

Naman Bahri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT 1

Data Warehouse: A data warehouse is a specialized database optimized for analysis


and reporting rather than transaction processing. It is designed to store historical data,
allowing organizations to analyze trends over time and make informed business
decisions.

The primary purpose of a data warehouse is to enable businesses to consolidate data


from multiple, often disparate sources into a single, coherent structure. This
consolidation supports advanced querying, reporting, and data analysis, providing
valuable insights that drive strategic decision-making.

Database System Data Warehouse

It supports analysis and performance


It supports operational processes.
reporting.

Capture and maintain the data. Explore the data.

Current data. Multiple years of history.

Data is balanced within the scope Data must be integrated and balanced
of this one system. from multiple system.

Data is updated when transaction Data is updated on scheduled


occurs. processes.

Data verification occurs when


Data verification occurs after the fact.
entry is done.

100 MB to GB. 100 GB to TB.

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly detailed. Summarized and consolidated.

Flat relational. Multidimensional.


NEED of DATA WAREHOUSING
1) Business User: Business users require a data warehouse to view summarized data from the
past. Since these people are nontechnical, the data may be presented to them in an
elementary form.
2) Store historical data: Data Warehouse is required to store the time variable data from the
past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency in
data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and
types of queries, which demands a significant degree of flexibility and quick response time.

Features of Data Warehousing


• Centralized Data Repository: Data warehousing provides a centralized repository for all
enterprise data from various sources, such as transactional databases, operational systems,
and external sources. This enables organizations to have a comprehensive view of their data,
which can help in making informed business decisions.
• Data Integration: Data warehousing integrates data from different sources into a single,
unified view,
which can help in eliminating data silos and reducing data inconsistencies.
• Historical Data Storage: Data warehousing stores historical data, which enables organizations
to analyze data trends over time. This can help in identifying patterns and anomalies in the
data, which can be used to improve business performance.
• Query and Analysis: Data warehousing provides powerful query and analysis capabilities that
enable users to explore and analyze data in different ways. This can help in identifying patterns
and trends, and can also help in making informed business decisions.
• Data Transformation: Data warehousing includes a process of data transformation, which
involves cleaning, filtering, and formatting data from various sources to make it consistent and
usable. This can help in improving data quality and reducing data inconsistencies.
• Data Mining: Data warehousing provides data mining capabilities, which enable organizations
to discover hidden patterns and relationships in their data. This can help in identifying new
opportunities, predicting future trends, and mitigating risks.
• Data Security: Data warehousing provides robust data security features, such as access
controls, data encryption, and data backups, which ensure that the data is secure and
protected from unauthorized access.
1. Data warehouse is a Centralised system. While it is a decentralised system.

In data warehouse, lightly denormalization While in Data mart, highly


2.
takes place. denormalization takes place.

3. Data warehouse is top-down model. While it is a bottom-up model.

4. To built a warehouse is difficult. While to build a mart is easy.

In data warehouse, Fact constellation While in this, Star schema and


5.
schema is used. snowflake schema are used.

6. Data Warehouse is flexible. While it is not flexible.

Data Warehouse is the data-oriented in


7. While it is the project-oriented in nature.
nature.

While data-mart has short life than


8. Data Ware house has long life.
warehouse.

In Data Warehouse, Data are contained in While in this, data are contained in
9.
detail form. summarized form.

While data mart is smaller than


10. Data Warehouse is vast in size.
warehouse.

The Data Warehouse might be


The Size of Data Mart is less than 100
11. somewhere between 100 GB and 1 TB+ in
GB.
size.

The time it takes to implement a data


The Data Mart deployment procedure is
12. warehouse might range from months to
time-limited to a few months.
years.

It uses a lot of data and has comprehensive Operational data are not present in Data
13.
operational data. Mart.

It generally stores data from a data


14. It collects data from various data sources.
warehouse.

Less time for processing the data


Long time for processing the data because
15. because of handling only a small amount
of large data.
of data.

Complicated design process of creating Easy design process of creating schemas


16.
schemas and views. and views.
data warehouses have a three-tier architecture, which consists of a:

Bottom tier: The bottom tier consists of a data warehouse server, usually a
relational database system, which collects, cleanses, and transforms data from
multiple data sources through a process known as Extract, Transform, and
Load (ETL) or a process known as Extract, Load, and Transform (ELT). For most
organizations that use ETL, the process relies on automation, and is efficient,
well-defined, continuous and batch-driven.

Middle tier: The middle tier consists of an OLAP (online analytical


processing) server which enables fast query speeds. Three types of OLAP
models can be used in this tier, which are known as ROLAP, MOLAP and
HOLAP. The type of OLAP model used is dependent on the type of database
system that exists.

Top tier: The top tier is represented by some kind of front-end user interface
or reporting tool, which enables end users to conduct ad-hoc data analysis on
their business data.

Metadata is essentially "data about data." It provides details and


descriptions about the data stored in your data warehouse. Here's what
metadata typically includes:
Data Structure: Describes how data is organized within the warehouse
(tables, columns, data types).
Data Meaning: Defines the meaning of each data element (e.g.,
"CustomerID" refers to a unique customer identifier).
Data Origin: Tracks where the data came from (e.g., sales database,
marketing system).
Data Transformation Rules: Documents how the data was transformed
before being loaded into the warehouse (e.g., currency conversion).
Data Retention Policies: Specifies how long the data will be stored in the
warehouse.
Data Access Controls: Defines who can access specific data elements within
the warehouse.
Information Package: An information package is a document or a set of
documents that outline the data requirements, key metrics, dimensions, and
hierarchies needed for analysis in a specific business area. It acts as a
communication tool between business users and data warehouse developers,
ensuring that the data warehouse is designed to meet specific analytical needs.

: The primary purpose of an information package is to clearly define what data


is needed, how it should be structured, and how it will be used. This helps in
designing the data warehouse and its components in a way that aligns with
business objectives and user requirements.

What is ETL
Extraction, transformation, and load help the organization to make the data
accessible, meaningful, and usable across different data systems. An ETL tool is
a software used to extract, transform, and loading the data.
An ETL tool is a set of libraries written in any programming language which will
simplify our work to make data integration and transformation operation for
any need. For example, in our mobile, each time we browse the web, some
amount of data is generated. A commercial plane can produce up to 500 GB of
data per hour. We can think now, how massive this data would be. This is the
reason it is known as Big Data, but this data is useless until we perform the ETL
operation on it.

How ETL Works?


ETL consists of three separate phases:

Extraction
ADVERTISEMENT

o Extraction is the operation of extracting information from a source system for


further use in a data warehouse environment. This is the first stage of the ETL
process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus
determining which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all
changed data to the warehouse and keep it up-to-date.

Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.

Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
UNIT 2

Dimentional modelling
Dimensional modeling is a data modeling technique used in data warehouses
to organize and categorize data into dimensions and facts. It's part of the
Business Dimensional Lifecycle methodology, which was developed by Ralph
Kimball
This method involves organizing data into dimensions and facts, where
dimensions are used to describe the data, and facts are used to quantify the
data.
For example, a sale transaction can be damage into facts such as the number
of products ordered and the price paid for the products, and into dimensions
such as order date, user name, product number, order ship-to, and bill-to
locations, and salesman responsible for receiving the order.

Objectives of Dimensional Modeling


-To produce database architecture that is easy for end-clients to understand
and write queries.
-To maximize the efficiency of queries. It achieves these goals by minimizing
the number of tables and relationships between them.

PRINCIPLES OF DIMENSIONAL MODELLING


1. Grainularity: define the level of detail at which facts are stored. This is called
the "grain" of the data.
For example, a fact table might store sales data at the daily grain, meaning
each record represents a single sale on a specific day.
2. Surrogate Keys:
Each dimension table has a unique identifier called a surrogate key, typically
an integer with no inherent meaning.
This helps avoid issues with duplicate values in natural keys (e.g., customer
names).
3 Slowly Changing Dimensions (SCDs):
How to handle dimensions that change over time (e.g., customer address).
Different SCD types exist, such as Type 1 (overwrite old values), Type 2 (create
new records), etc.
4. Conformed Dimensions:
Reusable dimension tables shared across multiple fact tables, ensuring
consistency and reducing redundancy.
5. Star Schema:
The most common type of dimensional model, resembling a star with a
central fact table surrounded by dimension tables connected through
foreign keys.
MULTI-DIMENSIONAL DATA MODEL
A multi-dimensional data model is a way of organizing data specifically designed for analyzing
information from various perspectives. It's a popular approach in data warehousing and OLAP
(Online Analytical Processing) applications.
Data Cube: This is the central structure in the model, resembling a cube with each side representing
a different dimension (e.g., time, product, location). Imagine a three-dimensional cube where each
side allows you to analyze data from a specific angle.
Dimensions: These are the categories that define the different perspectives you can analyze the
data from. For example, in a sales data cube, dimensions might include time (year, month, day),
product (category, brand), and location (city, state). Each dimension has its own table called a
dimensional table, which stores the attributes related to that dimension (e.g., product name, brand
name).
Measures: These are the quantitative values associated with the data, like sales amount, product
quantity sold, or customer count. They are stored in a central table called the fact table.

STAR SCHEMA

A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.

A star schema is a relational schema where a relational schema whose design


represents a multidimensional data model. The star schema is the explicit data
warehouse schema. It is known as star schema because the entity-relationship
diagram of this schemas simulates a star, with points, diverge from a central table. The
center of the schema consists of a large fact table, and the points of the star are the
dimension tables.

Characteristics
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the
data.
o It reduces the complexity of metadata for both developers and end-users
snowflake schema is an expansion of the star schema where each point of the
star explodes into more points. It is called snowflake schema because the
diagram of snowflake schema resembles a snowflake.
Snowflaking is a method of normalizing the dimension tables in a STAR
schemas. When we normalize all the dimension tables entirely, the resultant
structure resembles a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The
schema is diagramed with each fact surrounded by its associated dimensions,
and those dimensions are related to other dimensions, branching out into a
snowflake pattern.
The snowflake schema consists of one fact table which is linked to many
dimension tables, which can be linked to other dimension tables through a
many-to-one relationship.

Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or
data mart. Fact Constellation Schema can design with a collection of de-
normalized FACT, Shared, and Conformed Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to
summarize information. Fact Constellation Schema can implement between
aggregate Fact tables or decompose a complex Fact table into independent
simplex Fact tables.
OLAP, or Online Analytical Processing, plays a crucial role within data warehouses. It's a set
of software tools and technologies specifically designed for analyzing multidimensional data
stored in data warehouses.
Benefits of OLAP:
Fast Analysis: Pre-calculated data and optimized structures enable quick response times for
complex queries.
Multidimensional View: Allows users to analyze data from various perspectives, leading to
deeper understanding.
Flexibility: Supports diverse analytical tasks, from simple aggregations to complex
calculations.
User-Friendly: OLAP tools provide intuitive interfaces for data exploration and visualization.
Types of OLAP Servers:
MOLAP (Multidimensional OLAP): Stores data in multidimensional arrays for fast retrieval.
ROLAP (Relational OLAP): Stores data in relational databases but uses OLAP functionalities
for analysis.
HOLAP (Hybrid OLAP): Combines MOLAP and ROLAP approaches for flexibility and
performance.

The main characteristics of OLAP are as follows:


Multidimensional conceptual view: OLAP systems let business users have a dimensional
and logical view of the data in the data warehouse. It helps in carrying slice and dice
operations.
Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal database operations, containing retrieval, update, adequacy control,
integrity, and security.
Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP
operations should be sitting between data sources (e.g., data warehouses) and an OLAP
front-end.
Storing OLAP results: OLAP results are kept separate from data sources.
Uniform documenting performance: Increasing the number of dimensions or database size
should not significantly degrade the reporting performance of the OLAP system.
OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
OLAP system should ignore all missing values and compute correct aggregate values.
OLAP facilitate interactive query and complex analysis for the users.
OLAP allows users to drill down for greater details or roll up for aggregations of metrics
along a single business dimension or across multiple dimension.
OLAP provides the ability to perform intricate calculations and comparisons.
OLAP presents results in a number of meaningful ways, including charts and graphs.
Difference between OLAP and OLTP
OLAP (Online Analytical
Category Processing) OLTP (Online Transaction Processing)

It is well-known as an online
It is well-known as an online
Definition database query management
database modifying system.
system.

Consists of historical data from Consists of only operational current


Data source
various Databases. data.

Method It makes use of a data It makes use of a standard database


used warehouse. management system (DBMS).

It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.

In an OLAP database, tables In an OLTP database, tables


Normalized
are not normalized. are normalized (3NF).

The data is used in planning,


usage of The data is used to perform day-to-
problem-solving, and decision-
data day fundamental operations.
making.

It provides a multi-dimensional
It reveals a snapshot of present
Task view of different business
business tasks.
tasks.

It serves the purpose to extract It serves the purpose to Insert,


Purpose information for analysis and Update, and Delete information
decision-making. from the database.

The size of the data is relatively


Volume of A large amount of data is
small as the historical data is
data stored typically in TB, PB
archived in MB, and GB.
Drill Down:
Imagine zooming in on a specific detail within your data. This is what drilldown
does.
You start with a high-level overview of the data (e.g., total sales) and then
progressively move down to more granular levels within a specific dimension.
For example, you might start by looking at total sales for the year, then drill
down to see sales by quarter, then by month, and finally by product category
within a specific month.

2. Roll Up:
This is the opposite of drill-down, where you move from a detailed view to a
more summarized one.
You start with a specific detail (e.g., sales for a particular product in a specific
month) and gradually move up to higher levels of aggregation within a
dimension.
For instance, you could roll up sales figures from individual months to see
quarterly sales, then annual sales, providing a broader perspective.

3. Slice:
Imagine cutting a specific layer out of your data cube. This is what slicing does.
You select a subset of data based on one or more dimensions, focusing on a
particular aspect of the overall data.
For example, you might slice your data cube to see sales only for a specific
product category or only for a specific region.

4. Dice:
Think of dicing as creating a smaller cube from your main cube. You select
specific combinations of values from two or more dimensions, creating a sub-
cube that focuses on a particular combination of factors. For instance, you
could dice your data cube to see sales only for a specific product category
within a specific region and time period.

5. Pivot (Rotate):
This operation involves rearranging the data within your view to gain a
different perspective.
You essentially rotate the way the data is presented, often swapping
dimensions to see trends from a new angle.
For example, you could pivot a table showing sales by product category to see
sales by customer segment instead, revealing different insights.
EXECUTIVE INFORMATION SYSTEMS (EIS)
Executive Information Systems (EIS), also known as Executive Support Systems
(ESS), are specialized tools designed to assist senior executives in making
informed decisions. They act as a type of management support system,
providing easy access to critical data and insights that are crucial for strategic
planning and goal achievement.
Benefits of Using EIS:
Improved Decision-Making: Data-driven insights lead to more informed and
strategic decisions.
Enhanced Visibility: Executives gain a clear understanding of overall
organizational performance.
Increased Efficiency: EIS saves time by providing quick access to relevant
information.
Improved Communication: EIS facilitates better communication and
collaboration between executives.

DATA WAREHOUSE AND BUSINESS STRATEGY


A successful data warehouse strategy should be aligned with the overall
business strategy. This means identifying the specific data needed to support
strategic goals and ensuring the data warehouse is designed to capture and
analyze that data effectively.
Key performance indicators (KPIs) used to measure strategic goals should be
readily accessible and analyzed through the data warehouse.
Data analysis should be an ongoing process, continuously informing and
refining business strategies based on emerging insights.
Resource Optimization: Data insights can help businesses optimize resource
allocation, streamline processes, and improve overall operational efficiency.
Unit 3
TECHNOLOGIES USED IN DATA MINING
Data mining leverages a variety of powerful technologies to uncover hidden patterns and
extract valuable insights from data. Here's a glimpse into some of the key tools used:

Statistical Techniques:
Descriptive Statistics: Summarize and describe the data using measures like mean, median,
standard deviation.
Hypothesis Testing: Test hypotheses about the data to draw statistically significant
conclusions.
Correlation Analysis: Measure the strength and direction of relationships between
variables.

Machine Learning Algorithms:


Classification Algorithms: Categorize data points into predefined classes (e.g., decision
trees, support vector machines).
Clustering Algorithms: Group data points with similar characteristics together (e.g., K-
means, hierarchical clustering).
Association Rule Learning: Discover frequent patterns and relationships within the data
(e.g., Apriori algorithm).

Data Mining Software:


Specialized software tools provide user-friendly interfaces and functionalities for
data preparation, model building, and visualization of results.
Popular examples include RapidMiner, SAS Enterprise Miner, KNIME.

Database Management Systems (DBMS): Provide the foundation for storing, managing, and
retrieving data efficiently for analysis.
Modern DBMS often have built-in data mining functionalities.

Data Warehouses:
Centralized repositories of historical data provide a rich source of information for data
mining tasks.
Data warehouses are often optimized for large-scale data analysis.

Cloud Computing:
Cloud platforms offer scalable and cost-effective solutions for data storage, processing, and
data mining tasks.
Cloud-based data mining tools are becoming increasingly popular.
Major issue in data mining
Data Quality
The quality of data used in data mining is one of the most significant challenges. The
accuracy, completeness, and consistency of the data affect the accuracy of the results
obtained. The data may contain errors, omissions, duplications, or inconsistencies, which
may lead to inaccurate results. Moreover, the data may be incomplete, meaning that
some attributes or values are missing, making it challenging to obtain a complete
understanding of the data.

Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such as
sensors, social media, and the internet of things (IoT). The complexity of the data may
make it challenging to process, analyze, and understand. In addition, the data may be in
different formats, making it challenging to integrate into a single dataset.

Data Privacy and Security


Data privacy and security is another significant challenge in data mining. As more data is
collected, stored, and analyzed, the risk of data breaches and cyber-attacks increases. The
data may contain personal, sensitive, or confidential information that must be protected.
Moreover, data privacy regulations such as GDPR, CCPA, and HIPAA impose strict rules on
how data can be collected, used, and shared.

Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the size of
the dataset increases, the time and computational resources required to perform data
mining operations also increase.

Interpretability
Data mining algorithms can produce complex models that are difficult to interpret. This is
because the algorithms use a combination of statistical and mathematical techniques to
identify patterns and relationships in the data.

Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination of
data. The data may be used to discriminate against certain groups, violate privacy rights,
or perpetuate existing biases.
DATA PRE-PROCESSING OVERVIEW
It's the process of cleaning, transforming, and organizing raw data into a format suitable for
analysis. Raw data often contains inconsistencies, errors, and missing values, making it
unusable for analysis directly. Data pre-processing ensures the data is accurate, consistent,
and ready for the chosen analysis techniques.
Data Cleaning:
Identifying and correcting errors, inconsistencies, and missing values within the data.
This may involve:
Removing duplicate records.
Correcting typos and formatting inconsistencies.
Handling missing data (e.g., imputing values, removing rows/columns).

Data Integration:
Combining data from multiple sources into a unified format within the data
warehouse.
This may involve:
Identifying and resolving conflicts between different data sources.
Standardizing data formats and units

Data Transformation:
Converting data into a format suitable for analysis, such as:
-Scaling numerical data (normalization, standardization).
-Encoding categorical variables (e.g., one-hot encoding).
-Feature engineering (creating new features from existing ones).

Data Reduction:
Selecting relevant data and removing redundant or irrelevant information.
This may involve:
Feature selection (choosing the most informative features).
Dimensionality reduction techniques (e.g., principal component
analysis).

Data Transformation:
The process of converting raw data into a format suitable for analysis. Involves cleaning,
structuring, and manipulating data to:
Improve data quality (removing errors, inconsistencies, missing values).
Facilitate data integration (combining data from multiple sources).
Enhance analysis (normalization, scaling, feature engineering).
Prepare for visualization (categorizing data).

Data Discretization:
A specific data transformation technique that focuses on continuous numerical data.
Converts continuous data into a smaller number of discrete categories or intervals
(bins).
Benefits:
Simplifies analysis (easier to understand and analyze complex data).
Improves algorithm performance (some algorithms work better with discrete data).
Reduces storage requirements (fewer categories take less space).
Prepares for visualization (easier to visualize categorical data).

You might also like