All Unit
All Unit
Data is balanced within the scope Data must be integrated and balanced
of this one system. from multiple system.
ER based. Star/Snowflake.
In Data Warehouse, Data are contained in While in this, data are contained in
9.
detail form. summarized form.
It uses a lot of data and has comprehensive Operational data are not present in Data
13.
operational data. Mart.
Bottom tier: The bottom tier consists of a data warehouse server, usually a
relational database system, which collects, cleanses, and transforms data from
multiple data sources through a process known as Extract, Transform, and
Load (ETL) or a process known as Extract, Load, and Transform (ELT). For most
organizations that use ETL, the process relies on automation, and is efficient,
well-defined, continuous and batch-driven.
Top tier: The top tier is represented by some kind of front-end user interface
or reporting tool, which enables end users to conduct ad-hoc data analysis on
their business data.
What is ETL
Extraction, transformation, and load help the organization to make the data
accessible, meaningful, and usable across different data systems. An ETL tool is
a software used to extract, transform, and loading the data.
An ETL tool is a set of libraries written in any programming language which will
simplify our work to make data integration and transformation operation for
any need. For example, in our mobile, each time we browse the web, some
amount of data is generated. A commercial plane can produce up to 500 GB of
data per hour. We can think now, how massive this data would be. This is the
reason it is known as Big Data, but this data is useless until we perform the ETL
operation on it.
Extraction
ADVERTISEMENT
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
UNIT 2
Dimentional modelling
Dimensional modeling is a data modeling technique used in data warehouses
to organize and categorize data into dimensions and facts. It's part of the
Business Dimensional Lifecycle methodology, which was developed by Ralph
Kimball
This method involves organizing data into dimensions and facts, where
dimensions are used to describe the data, and facts are used to quantify the
data.
For example, a sale transaction can be damage into facts such as the number
of products ordered and the price paid for the products, and into dimensions
such as order date, user name, product number, order ship-to, and bill-to
locations, and salesman responsible for receiving the order.
STAR SCHEMA
A star schema is the elementary form of a dimensional model, in which data are
organized into facts and dimensions. A fact is an event that is counted or measured,
such as a sale or log in. A dimension includes reference data about the fact, such as
date, item, or customer.
Characteristics
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout
the development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the
data.
o It reduces the complexity of metadata for both developers and end-users
snowflake schema is an expansion of the star schema where each point of the
star explodes into more points. It is called snowflake schema because the
diagram of snowflake schema resembles a snowflake.
Snowflaking is a method of normalizing the dimension tables in a STAR
schemas. When we normalize all the dimension tables entirely, the resultant
structure resembles a snowflake with the fact table in the middle.
Snowflaking is used to develop the performance of specific queries. The
schema is diagramed with each fact surrounded by its associated dimensions,
and those dimensions are related to other dimensions, branching out into a
snowflake pattern.
The snowflake schema consists of one fact table which is linked to many
dimension tables, which can be linked to other dimension tables through a
many-to-one relationship.
Fact constellation means two or more fact tables sharing one or more
dimensions. It is also called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or
data mart. Fact Constellation Schema can design with a collection of de-
normalized FACT, Shared, and Conformed Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to
summarize information. Fact Constellation Schema can implement between
aggregate Fact tables or decompose a complex Fact table into independent
simplex Fact tables.
OLAP, or Online Analytical Processing, plays a crucial role within data warehouses. It's a set
of software tools and technologies specifically designed for analyzing multidimensional data
stored in data warehouses.
Benefits of OLAP:
Fast Analysis: Pre-calculated data and optimized structures enable quick response times for
complex queries.
Multidimensional View: Allows users to analyze data from various perspectives, leading to
deeper understanding.
Flexibility: Supports diverse analytical tasks, from simple aggregations to complex
calculations.
User-Friendly: OLAP tools provide intuitive interfaces for data exploration and visualization.
Types of OLAP Servers:
MOLAP (Multidimensional OLAP): Stores data in multidimensional arrays for fast retrieval.
ROLAP (Relational OLAP): Stores data in relational databases but uses OLAP functionalities
for analysis.
HOLAP (Hybrid OLAP): Combines MOLAP and ROLAP approaches for flexibility and
performance.
It is well-known as an online
It is well-known as an online
Definition database query management
database modifying system.
system.
It is subject-oriented. Used
It is application-oriented. Used for
Application for Data Mining, Analytics,
business tasks.
Decisions making, etc.
It provides a multi-dimensional
It reveals a snapshot of present
Task view of different business
business tasks.
tasks.
2. Roll Up:
This is the opposite of drill-down, where you move from a detailed view to a
more summarized one.
You start with a specific detail (e.g., sales for a particular product in a specific
month) and gradually move up to higher levels of aggregation within a
dimension.
For instance, you could roll up sales figures from individual months to see
quarterly sales, then annual sales, providing a broader perspective.
3. Slice:
Imagine cutting a specific layer out of your data cube. This is what slicing does.
You select a subset of data based on one or more dimensions, focusing on a
particular aspect of the overall data.
For example, you might slice your data cube to see sales only for a specific
product category or only for a specific region.
4. Dice:
Think of dicing as creating a smaller cube from your main cube. You select
specific combinations of values from two or more dimensions, creating a sub-
cube that focuses on a particular combination of factors. For instance, you
could dice your data cube to see sales only for a specific product category
within a specific region and time period.
5. Pivot (Rotate):
This operation involves rearranging the data within your view to gain a
different perspective.
You essentially rotate the way the data is presented, often swapping
dimensions to see trends from a new angle.
For example, you could pivot a table showing sales by product category to see
sales by customer segment instead, revealing different insights.
EXECUTIVE INFORMATION SYSTEMS (EIS)
Executive Information Systems (EIS), also known as Executive Support Systems
(ESS), are specialized tools designed to assist senior executives in making
informed decisions. They act as a type of management support system,
providing easy access to critical data and insights that are crucial for strategic
planning and goal achievement.
Benefits of Using EIS:
Improved Decision-Making: Data-driven insights lead to more informed and
strategic decisions.
Enhanced Visibility: Executives gain a clear understanding of overall
organizational performance.
Increased Efficiency: EIS saves time by providing quick access to relevant
information.
Improved Communication: EIS facilitates better communication and
collaboration between executives.
Statistical Techniques:
Descriptive Statistics: Summarize and describe the data using measures like mean, median,
standard deviation.
Hypothesis Testing: Test hypotheses about the data to draw statistically significant
conclusions.
Correlation Analysis: Measure the strength and direction of relationships between
variables.
Database Management Systems (DBMS): Provide the foundation for storing, managing, and
retrieving data efficiently for analysis.
Modern DBMS often have built-in data mining functionalities.
Data Warehouses:
Centralized repositories of historical data provide a rich source of information for data
mining tasks.
Data warehouses are often optimized for large-scale data analysis.
Cloud Computing:
Cloud platforms offer scalable and cost-effective solutions for data storage, processing, and
data mining tasks.
Cloud-based data mining tools are becoming increasingly popular.
Major issue in data mining
Data Quality
The quality of data used in data mining is one of the most significant challenges. The
accuracy, completeness, and consistency of the data affect the accuracy of the results
obtained. The data may contain errors, omissions, duplications, or inconsistencies, which
may lead to inaccurate results. Moreover, the data may be incomplete, meaning that
some attributes or values are missing, making it challenging to obtain a complete
understanding of the data.
Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such as
sensors, social media, and the internet of things (IoT). The complexity of the data may
make it challenging to process, analyze, and understand. In addition, the data may be in
different formats, making it challenging to integrate into a single dataset.
Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the size of
the dataset increases, the time and computational resources required to perform data
mining operations also increase.
Interpretability
Data mining algorithms can produce complex models that are difficult to interpret. This is
because the algorithms use a combination of statistical and mathematical techniques to
identify patterns and relationships in the data.
Ethics
Data mining raises ethical concerns related to the collection, use, and dissemination of
data. The data may be used to discriminate against certain groups, violate privacy rights,
or perpetuate existing biases.
DATA PRE-PROCESSING OVERVIEW
It's the process of cleaning, transforming, and organizing raw data into a format suitable for
analysis. Raw data often contains inconsistencies, errors, and missing values, making it
unusable for analysis directly. Data pre-processing ensures the data is accurate, consistent,
and ready for the chosen analysis techniques.
Data Cleaning:
Identifying and correcting errors, inconsistencies, and missing values within the data.
This may involve:
Removing duplicate records.
Correcting typos and formatting inconsistencies.
Handling missing data (e.g., imputing values, removing rows/columns).
Data Integration:
Combining data from multiple sources into a unified format within the data
warehouse.
This may involve:
Identifying and resolving conflicts between different data sources.
Standardizing data formats and units
Data Transformation:
Converting data into a format suitable for analysis, such as:
-Scaling numerical data (normalization, standardization).
-Encoding categorical variables (e.g., one-hot encoding).
-Feature engineering (creating new features from existing ones).
Data Reduction:
Selecting relevant data and removing redundant or irrelevant information.
This may involve:
Feature selection (choosing the most informative features).
Dimensionality reduction techniques (e.g., principal component
analysis).
Data Transformation:
The process of converting raw data into a format suitable for analysis. Involves cleaning,
structuring, and manipulating data to:
Improve data quality (removing errors, inconsistencies, missing values).
Facilitate data integration (combining data from multiple sources).
Enhance analysis (normalization, scaling, feature engineering).
Prepare for visualization (categorizing data).
Data Discretization:
A specific data transformation technique that focuses on continuous numerical data.
Converts continuous data into a smaller number of discrete categories or intervals
(bins).
Benefits:
Simplifies analysis (easier to understand and analyze complex data).
Improves algorithm performance (some algorithms work better with discrete data).
Reduces storage requirements (fewer categories take less space).
Prepares for visualization (easier to visualize categorical data).