0% found this document useful (0 votes)
51 views47 pages

Data Warehousing and Data Mining Original Notes

Data warehousing organizes and compiles data into a single database, while data mining extracts valuable information from these databases. Data warehouses provide benefits such as improved business analytics, faster queries, and historical insights, and are utilized across various sectors including airlines, banking, and telecommunications. The document also discusses different types of data warehouses, architectures, and the implementation process, highlighting the distinction between databases and data warehouses.

Uploaded by

Chimmay Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views47 pages

Data Warehousing and Data Mining Original Notes

Data warehousing organizes and compiles data into a single database, while data mining extracts valuable information from these databases. Data warehouses provide benefits such as improved business analytics, faster queries, and historical insights, and are utilized across various sectors including airlines, banking, and telecommunications. The document also discusses different types of data warehouses, architectures, and the implementation process, highlighting the distinction between databases and data warehouses.

Uploaded by

Chimmay Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

DATA WAREHOUSING AND DATA MINING

Sachin Raj Saxena (Assistant Professor in C.S Dept.)


Unit - 1
Data warehousing and Data mining
Data warehousing is a method of organizing and compiling data into one
database, whereas data mining deals with fetching important data from
databases. A Data Warehouse is separate from DBMS, it stores a huge
amount of data, which is typically collected from multiple heterogeneous
sources like files, DBMS, etc.

Benefits of Data Warehouse:


1. Better business analytics: Data warehouse plays an important role in
every business to store and analysis of all the past data and records of the
company.
2. Faster Queries: Data warehouse is designed to handle large queries that’s
why it runs queries faster than the database.
3. Historical Insight: The Data warehouse stores all your historical data
which contains details about the business so that one can analyze it at any
time.

What Is a Data Warehouse Used For?


Here, are most common sectors where Data warehouse is used:

 Airline:In the Airline system, it is used for operation purpose like crew
assignment, analyses of route profitability etc.
 Banking:It is widely used in the banking sector to manage the resources.
Few banks also used for the market research, performance analysis of the
product and operations.
 Public sector:In the public sector, data warehouse is used for intelligence
gathering. It helps government agencies to maintain and analyze tax
records, health policy records, for every individual.
 Telecommunication:A data warehouse is used in this sector for product
promotions.

Benefits of data mining


 In healthcare, data mining plays a huge role in developing a patient
profile, compiling information from their medical history, past treatment
options, medications, and more. This profile gives doctors the ability to
develop more accurate diagnoses and treatment options based on past
results.

 The benefits of data mining in marketing are huge. Effective marketing


campaigns understand their target audience, including their needs and
spending habits. Data mining can provide valuable information on age,
gender, interests, location, and income, which all influence a person’s
interest in a product or service.

 It helps companies gather reliable information.


 Frauds and malware are the most dangerous threats on the internet, which
are increasing day by day. Credit card services and telecommunication
are the main reasons for that. With the help of the Data mining
techniques, professionals can get fraud-related data such as caller ID,
location, duration of the call, exact date and time, etc. which can help find
a person or group responsible for that fraud.

Types of Data Warehouse


Three main types of Data Warehouses (DWH) are:

1. Enterprise Data Warehouse (EDW): Enterprise Data Warehouse


(EDW) is a centralized warehouse. It provides decision support service
across the business. It offers a unified approach for organizing and
representing data.
2. Operational Data Store: ODS, Data warehouse is refreshed in real time.
Hence, it is widely preferred for routine activities like storing records of
the Employees.
3. Data Mart: A data mart is a subset of the data warehouse. It specially
designed for a particular business, such as sales, finance etc.

Data Warehouse Architecture

A data warehouse architecture is a method of defining the overall architecture of


data communication processing and presentation that exist for end-clients
computing within the enterprise. Each data warehouse is different, but all are
characterized by standard vital components.

Production applications such as payroll accounts payable product purchasing


and inventory control are designed for online transaction processing (OLTP).
Such applications gather detailed data from day to day operations.

Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting,
and trend analysis.

Production databases are updated continuously by either by hand or via OLTP


applications. In contrast, a warehouse database is updated from operational
systems periodically, usually during off-hours. As OLTP data accumulates in
production databases, it is regularly extracted, filtered, and then loaded into a
dedicated warehouse server that is accessible to users. As the warehouse is
populated, it must be restructured tables de-normalized, data cleansed of errors
and redundancies and new fields and keys added to reflect the needs to the user
for sorting, combining, and summarizing data.

Data warehouses and their architectures very depending upon the elements of an
organization's situation.

Three common architectures are:

o Data Warehouse Architecture: Basic


o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
Data Warehouse Architecture: Basic

Operational System

An operational system is a method used in data warehousing to refer to


a system that is used to process the day-to-day transactions of an organization.

Flat Files

A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.

Meta Data

A set of data that defines and gives information about other data.

Meta Data used in Data Warehouse for a variety of purpose, including:

Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples of
very basic document metadata.

Metadata is used to direct a query to the most appropriate data source.


Lightly and highly summarized data

The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.

The goals of the summarized information are to speed up query performance.


The summarized record is updated continuously as new information is loaded
into the warehouse.

End-User access Tools

The principal purpose of a data warehouse is to provide information to the


business managers for strategic decision-making. These customers interact with
the warehouse using end-client access tools.

The examples of some of the end-user access tools can be:

o Reporting and Query Tools


o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools

Data Warehouse Architecture: With Staging Area

We must clean and process your operational information before put it into the
warehouse.

We can do this programmatically, although data warehouses uses a staging


area (A place where data is processed before entering the warehouse).

A staging area simplifies data cleansing and consolidation for operational


method coming from multiple source systems, especially for enterprise data
warehouses where all relevant data of an enterprise is consolidated.
Data Warehouse Staging Area is a temporary location where a record from
source systems is copied.
Data Warehouse Architecture: With Staging Area and Data Marts

We may want to customize our warehouse's architecture for multiple groups


within our organization.

We can do this by adding data marts. A data mart is a segment of a data


warehouses that can provided information for reporting and analysis on a
section, unit, department or operation in the company, e.g., sales, payroll,
production, etc.

The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical data
for purchases and sales or mine historical information to make predictions about
customer behavior.

Properties of Data Warehouse Architectures

The following architecture properties are necessary for a data warehouse


system:
1. Separation: Analytical and transactional processing should be keep apart as
much as possible.

2. Scalability: Hardware and software architectures should be simple to


upgrade the data volume, which has to be managed and processed, and the
number of user's requirements, which have to be met, progressively increase.

3. Extensibility: The architecture should be able to perform new operations and


technologies without redesigning the whole system.

4. Security: Monitoring accesses are necessary because of the strategic data


stored in the data warehouses.

5. Administerability: Data Warehouse management should not be complicated.


Types of Data Warehouse Architectures

Single-Tier Architecture

Single-Tier architecture is not periodically used in practice. Its purpose is to


minimize the amount of data stored to reach this goal; it removes data
redundancies.

The figure shows the only layer physically available is the source layer. In this
method, data warehouses are virtual. This means that the data warehouse is
implemented as a multidimensional view of operational data created by specific
middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement
for separation between analytical and transactional processing. Analysis queries
are agreed to operational data after the middleware interprets them. In this way,
queries affect transactional workloads.

Two-Tier Architecture

The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation
between physically available sources and data warehouses, in fact, consists of
four subsequent data flow stages:

1. Source layer: A data warehouse system uses a heterogeneous source of


data. That data is stored initially to corporate relational databases or
legacy databases, or it may come from an information system outside the
corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed
to remove inconsistencies and fill gaps, and integrated to merge
heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can
combine heterogeneous schemata, extract, transform, cleanse, validate,
filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized
individual repository: a data warehouse. The data warehouses can be
directly accessed, but it can also be used as a source for creating data
marts, which partially replicate data warehouse contents and are designed
for specific enterprise departments. Meta-data repositories store
information on sources, access procedures, data staging, users, data mart
schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed
to issue reports, dynamically analyze information, and simulate
hypothetical business scenarios. It should feature aggregate information
navigators, complex query optimizers, and customer-friendly GUIs.

Three-Tier Architecture

The three-tier architecture consists of the source layer (containing multiple


source system), the reconciled layer and the data warehouse layer (containing
both data warehouses and data marts). The reconciled layer sits between the
source data and data warehouse.

The main advantage of the reconciled layer is that it creates a standard


reference data model for a whole enterprise. At the same time, it separates the
problems of source data extraction and integration from those of data warehouse
population. In some cases, the reconciled layer is also directly used to
accomplish better some operational tasks, such as producing daily reports that
cannot be satisfactorily prepared using the corporate applications or generating
data flows to feed external processes periodically to benefit from cleaning and
integration.

This architecture is especially useful for the extensive, enterprise-wide systems.


A disadvantage of this structure is the extra file storage space used through the
extra redundant reconciled layer. It also makes the analytical tools a little
further away from being real-time.
MultiDimensional Data Model
The multi-Dimensional Data Model is a method which is used for ordering
data in the database along with good arrangement and assembling of the
contents in the database. OLAP (online analytical processing) and data
warehousing uses multi dimensional databases. It is used to show multiple
dimensions of the data to users. A multidimensional database is created from
multiple relational databases. In relational databases users can access the data
in the form of queries, In multidimensional databases users can ask analytical
questions related to business or market trends.{Done 20 Feb 2023 4 :00 PM}

Advantages of Multidimensional Databases


Some advantages of multidimensional databases are −
Increased performance- The performance of multidimensional databases is
much better than normal databases such as relational databases.

Easy maintenance- The multidimensional database is easy to handle and


maintain.

Better data presentation- The data in a multidimensional database has many-


sided and contains many different factors.

DONE (20 FEB 2023)


What is Data Cube?
When data is grouped or combined in multidimensional matrices called Data
Cubes. The data cube method has a few alternative names or a few variants,
such as "Multidimensional databases," "materialized views," and "OLAP (On-
Line Analytical Processing)."
Grouping of data in a multidimensional matrix is called
data cubes. In Dataware housing, we generally deals with various
multidimensional data models as the data will be represented by multiple
dimensions and multiple attributes. This multidimensional data is represented
in the data cube. Below is the diagram of a general data cube.

The example above is a 3D cube having attributes like branch(A,B,C,D),item


type(home,entertainment,computer,phone,security), year(1997,1998,1999) .

Data cube classification:


The data cube can be classified into two categories:
 Multidimensional data cube: It basically helps in storing large amounts of
data by making use of a multi-dimensional array. It increases its efficiency
by keeping an index of each dimension. Thus, dimensional is able to
retrieve data fast.
 Relational data cube: It basically helps in storing large amounts of data by
making use of relational tables. Each relational table displays the
dimensions of the data cube. It is slower compared to a Multidimensional
Data Cube.
(20 Feb 2023 Monday DONE)

What is a Star Schema?


A star schema is a multi-dimensional data model used to organize data in a
database. Star Schema in data warehouse, in which the center of the star can
have one fact table and a number of associated dimension tables. It is known as
star schema as its structure resembles a star. The Star Schema data model is the
simplest type of Data Warehouse schema. It is also known as Star Join Schema.
In the following Star Schema example, the fact table is at the center which
contains keys to every dimension table like Dealer_ID, Model ID, Date_ID,
Product_ID, Branch_ID & other attributes like Units sold and revenue.
Example of Star Schema Diagram

What is a Snowflake Schema?


A Snowflake Schema is an extension of a Star Schema, and it adds additional
dimensions.

Example of Snowflake Schema

20/Mar/23 (Done)
Fact Constellation is a schema for representing multidimensional model. It is
a collection of multiple fact tables having some common dimension tables. It
can be viewed as a collection of several star schemas and hence, also known
as Galaxy schema. It is one of the widely used schema for Data warehouse
designing and it is much more complex than star and snowflake schema. For
complex systems, we require fact constellations.

Figure – General structure of Fact Constellation

(20 Feb 2023 Monday DONE)


Unit - 2

Database vs Data Warehouse

 A database is a collection of related data that represents some


elements of the real world, whereas a Data warehouse is an
information system that stores historical and commutative data from
single or multiple sources.
 A database is designed to record data, whereas a Data warehouse is
designed to analyze data.
 Database uses Online Transactional Processing (OLTP), whereas Data
warehouse uses Online Analytical Processing (OLAP).
 ER modeling techniques are used for designing Databases, whereas
data modeling techniques are used for designing Data Warehouse.
 Databases uses OnLine Transactional Processing (OLTP) to delete,
insert, replace, and update large numbers of online transactions
quickly. This type of processing immediately responds to user
requests, and so it is used to process the day-to-day operations of a
business in real-time. For example, if a user wants to reserve a hotel
room using an online booking form, the process is executed with
OLTP.
 Data warehouses uses OnLine Analytical Processing (OLAP) to
analyse huge amount of data. This process gives analysts the power to
look at your data from different points of view. For example your
database records sales data for every minute of every day, you may
just want to know the total amount sold each day. To do this, you need
to collect and sum the sales data for each day. OLAP is specifically
designed to do this type of queries.

(20 Feb 2023 Monday DONE)


What is Data Warehouse Implementation

The various phases of Data Warehouse Implementation are ‘Planning’, ‘Data


Gathering’, ‘Data Analysis’ and ‘Business Actions’. The process of establishing
and implementing a data warehouse system in an organization is known as data
warehouse implementation. Data warehousing is one of the most important
components of the business intelligence process for an organization.
1. Planning - Planning is one of the most important steps of a process. It
helps in getting a pathway or the road map that we have to follow to
achieve our described goals and objectives.In case of the absence of sound
planning, then there are high chances of failure of the project.
2. Data gathering - As data is available everywhere, but all the data
available is not helpful for an organization. Data gathering is a process
that involves the collection of data from various sources that can be used
for data analysis and reporting.It involves a wide range of steps and it is a
time-consuming process.We need to first identify the data that is going to
be helpful for organization.
3. Data analysis- Once the data is collected, the next step which comes into
the picture is data analysis. The process of generating and getting
meaningful data is known as data analysis.
4. Business actions - Information obtained from data analysis is used for
decision making of the organization.
27 March 2023 Done
========================================================
Client-Server Model
The Client-server model is a distributed application structure that divides the
workload between the providers of services. In the client-server architecture,
when the client computer sends a request for data to the server through the
internet, the server accepts the requested process and deliver the data packets
requested back to the client. Clients do not share any of their resources.
Examples of Client-Server Model are Email, World Wide Web, etc.
How the Client-Server Model works ?
Client: In the digital world a Client is a computer or Host i.e. capable for
receiving information or using a particular service from the service providers
(Servers).
Servers: when we talk the word Servers, It mean a person or medium that
serves something. Similarly in this digital world a Server is a remote
computer which provides information (data).

Advantages of Client-Server model:


 Centralized system with all data in a single place.
 Less maintenance cost and Data recovery is possible.
 The capacity of the Client and Servers can be changed separately.

Disadvantages of Client-Server model:


 Server are prone to Denial of Service (DOS) attacks.
 If the server fails for any reason, then none of the requests of the clients can
be fulfilled. This leads of failure of the client server network.
To manage a server, the extra staff is needed. A network administrator is needed
to look after the server. 21 Feb 2023 Done
Parallel Processing

Processing of multiple tasks simultaneously on multiple processors is


called parallel processing. A given task is divided into multiple subtasks using a
divide-and-conquer technique, and each subtask is processed on a
different central processing unit (CPU). Programming on a multiprocessor
system using the divide-and-conquer technique is called parallel programming.
Parallel processing is basically used to minimize the computation
time of a process.

What is a real life example of parallel computing?


A real-life example of this is people standing in a queue and waiting for a
movie ticket and there is only a cashier. In such a situation, people have to wait
for their turn to get movie tickets. But if we have more than two cashiers present
then people will get movie tickets quickly.

Clustered systems

Clustered systems are similar to parallel systems as they both have multiple
CPUs. However a major difference is that clustered systems are created by two
or more individual computer systems merged together. Basically, they have
independent computer systems with a common storage and the systems work
together.

Suppose a program or task needs to be completed by the clustered system. The


clustered system may have 3 computers attached. The program is divided into 3
small processes. Each process is assigned to each computer. Process 1 is
assigned to computer 1, process 2 is assigned to computer 2 and process 3 runs
on computer 3. When all the processes are completed then the output of the
program is delivered.
Types of clustered system

There are three main types of clustered systems.

Asymmetric clustering system

In this system, node X is idle (standby mode) and monitors other nodes in the
network. All other nodes work together. If any node fails then node X will take
the task of the failed node.

Symmetric clustering system

In this system, no node is idle in the network. All nodes work together and they
also monitor other nodes. If any node fails then the nearest node will take its
task.

Parallel clustering system

In a parallel system, multiple users gives the tasks to the system and all the tasks
are completed in parallel like in asymmetric and symmetric systems.

Benefits of clustered system


Some advantages of the clustered system are:-

 High performance:The nodes works together on large tasks and the overall
performance of the system is improved. The large task is completed in less time.

 Reliable:The tasks are completed without errors and if any problem occurs in
the system then it is easy to fix the problem.

 Easy configuration:These systems have high data travel speed and all
computers are connected to the local area network (LAN). As all the computers
are placed near to each other so they are easy to configure.

 Problem recovering:If any problem occurs in the system then it is self-


recoverable without user intervention.

22 Feb 2023 Done

Distributed Database

A distributed database represents multiple interconnected databases spread out


across several sites connected by a network. Since the databases are all
connected, they appear as a single database to the users. In DDBMS a single
query is run on multiple local databases.
sHow is data stored in a Distributed Database
There are mainly 3 ways of storing the data in a distributed database.

1. Data Replication
2. Data Fragmentation
3. Hybrid

1. Data Replication
The same data is stored at more than one site. This improves the availability of
the data as even if one site goes down, the data will still be available on the other
sites.
It can also improve performance by providing faster access.
However, replication does have the disadvantage of requiring more space to
store duplicate data and when one table is updated all the copies of it must also
be updated to maintain consistency.

2. Data Fragmentation

The process of dividing the database into a smaller multiple parts is called
as fragmentation. These fragments may be stored at different locations (sites).

3. Hybrid Storage
Hybrid data storage combines both data replication and fragmentation to get the
benefits of both models.

22 Feb 2023 Done

========================================================

Types of distributed databases

The two types of distributed systems are as follows:

1. Homogeneous distributed databases system:


 Homogeneous distributed database system is a network of two or more
databases (With same type of DBMS software) which can be stored on one
or more machines.
 So, in this system data can be accessed and modified simultaneously on
several databases.
 Homogeneous distributed system are easy to handle.

Example: Consider that we have three departments using Oracle-9i for


DBMS. If some changes are made in one department then, it would update
the other department also.

2. Heterogeneous distributed database system.


Heterogeneous distributed database system is a network of two or more
databases with different types of DBMS software, which can be stored on
one or more machines.
Example: Consider that we have three departments using Different
softwares for DBMS.

HARDWARE AND OPERATING SYSTEMS


Hardware and operating systems creates the computing environment for your
data warehouse. All the data extraction, transformation, integration jobs runs on
the selected hardware and operating system.

some general guidelines for hardware selection.

1. Scalability - When your data warehouse grows in terms of the number of users,
the number of queries, and the complexity of the queries, ensure that your
selected hardware could be scaled up.
2. Support - Vendorविक्रेता support is important for hardware maintenance. (The
vendor is the one who is using the data warehouse.

25 Feb 2023 Done (7:12 PM)

What is data processing in data mining?

Data processing occurs when data is collected and translated into usable
information. Usually performed by a data scientist or team of data scientists.
Data processing is collecting raw data and translating it into usable information.
The raw data is collected, filtered, sorted, processed, analyzed, stored, and
then presented in a readable format. It is usually performed in a step-by-step
process by a team of data scientists and data engineers in an organization.

The data processing is carried out automatically or


manually. Nowadays, most data is processed automatically with the help of the
computer, which is faster and gives accurate results. Thus, data can be
converted into different forms. It can be graphic as well as audio ones. Data
processing is crucial for organizations to create better business strategies. The
most commonly used tools for data processing are Storm, Hadoop, HPCC
and CouchDB.

25 Feb 2023 Done (8:05 PM)


Stages of Data Processing
The data processing consists of the following six stages.

1. Data Collection-The collection of raw data is the first step of the data
processing cycle. The raw data collected has a huge impact on the output
produced. Hence, raw data should be gathered from defined and accurate
sources

2. Data Preparation-Data preparation or data cleaning is a the process of


sorting and filtering the raw data to remove unnecessary and inaccurate data.
Raw data is checked for errors, duplication, miscalculations, or missing data and
transformed into a suitable form for further analysis and processing. This
ensures that only the highest quality data is fed into the processing unit.

3. Data Input-In this step, the raw data is converted into machine-readable
form and fed into the processing unit.

4. Data Processing- In this step, data processing is done using machine learning
and artificial intelligence algorithms to generate the desired output.

5. Data Interpretation or Output-The data is finally transmitted and displayed


to the user in a readable form like graphs, tables, vector files, audio, video,
documents, etc.

6. Data Storage - The last step of the data processing cycle is storage, where
data and metadata are stored for further use.
25 Feb 2023 Done (8:35 PM)

What are Data Mining functionalities?


Data mining functionalities are used to specify what kind of pattern are
present in our data during data mining tasks. We can further divide data mining
tasks into two different categories.
1. Descriptive mining task
2. Predictive mining task

In descriptive mining tasks, we try to find out the general properties present in
our data.
Let’s suppose, there is a mart near your home. One day you visit that mart and saw
that the mart manager is trying to observe the customers purchasing behavior that
who is buying what? You are a curious type of person so you went to him and
asked him why he is doing this?
The mart manager replied to you that he is trying to identify products that are
purchased together so that he can rearrange the mart accordingly. He told you that
let's suppose you buy bread so next thing you may try to buy some eggs or butter.
So, if this thing is kept close to bread than the mart sales may rise. This is known
as Association analysis and considered as a Descriptive data mining task.
Some of the predictive data mining tasks
are Association, Clustering, Summarization, etc.
1) Association
Association is used to find the association or connection among a set of
items present with us. It’s mainly tries to identifies the relationships between
objects.
For example:
If a retailer finds that bread and eggs are bought together mostly, he can put eggs
on sale to promote the sale of bread.

2) Clustering
Clustering is a process to identify data objects that are similar to one
another.
For example:
A Telecom company can cluster its customers based on age, residence, income,
etc. This will help the telecom company to understand its customers in a better way
and hence solved the issues and provide better-customized services.

3) Summarization
Summarization is a technique for the generalization of data.
For example:
The shopping done by a customer can be summarized into total products, total
spending offers used, etc. Such high-level summarized information can be useful
for sales.

Predictive mining task

In predictive mining tasks, we try to find out some inference on the current
data in order to make some predictions from the available data for the future.
For example:
Let’s suppose your friend is a medical practitioner and he is trying to diagnose a
disease based on the medical test results of a patient. This can be considered as a
predictive data mining task. Where we try to predict or classify the new data based
on the historical data.
Some of the predictive data mining tasks are classification, prediction, time-series
analysis etc.
1) Classification
Classification is a process where we try to build a model that can
determine the class of an object based on its different attributes.
Here, a collection of records will be available, each record represents a set of
attributes.
Let’s take an example and try to understand it.
Classification can be used in direct marketing so that we can reduce marketing
costs by targeting a set of customers who are likely to buy a new product. Using
the available data, it is possible to know which customers purchased similar
products and who did not purchase in the past. Hence, {purchase, don’t purchase}
decision forms the class attribute in this case. Once the class attribute is assigned,
demographic and lifestyle information of customers who purchased similar
products can be collected and promotion emails can be sent to them directly.
2) Prediction
In the prediction task, we try to predict the possible values of missing
data. Here, we build a model based on the available data and this model is then
used in predicting future values of a new data set.
For example:
If we want to predict the price of the new house based on the historical data
available such as the number of bedrooms, number of kitchens, number of
bathrooms, carpet area, old house prices, etc. Then we have to build a model that
can predict the new house price based on the given input. Also, prediction analysis
is used in different areas including fraud detection, medical diagnosis, etc.
3) Time series analysis
Time series analysis includes methods to analyze time-series data in order
to extract useful patterns, trends, rules, and statistics.
For example:
Stock price prediction is an important application of time- series analysis.

Done

What Is Data Preprocessing? Why Is It Important?


Data Preprocessing is the process of transforming raw data into an readable
format. It is also an important step in data mining as we cannot work with raw
data. The quality of the data should be checked before applying machine
learning or data mining algorithms.

Some common steps in data preprocessing include:


 Data cleaning: This step involves identifying and removing missing,
inconsistent, or irrelevant data. This can include removing duplicate
records, filling in missing values, and handling outliers.
 Data integration: This step involves combining data from multiple
sources, such as databases, spreadsheets, and text files. The goal of
integration is to create a single, consistent view of the data.
 Data transformation: This step involves converting the data into a format
that is more suitable for the data mining. This can include normalizing
numerical data, creating dummy variables.
 Data reduction: Data reduction is a method of reducing the size of original
data so that it may be represented in a much smaller space. Data reduction
can increase storage efficiency and performance and reduce storage costs.
 Data discretization: Data discretization is a method of converting a huge
data values into smaller ones so that the evaluation and management of data
become easy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values.

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

What is data cleaning in data mining?

Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:


This situation arises when some data is missing in the data. It can be
handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large
and multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the
missing values manually, by the most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines.It
can generated due to faulty data collection, data entry errors etc. It can be
handled in following ways :
1. Binning Method:
This method works on sorted data. The whole data is divided into
segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately.
2. Regression:
Here the data can be smoothed by fitting it into a regression function.

3. Clustering:
This approach groups the similar data in a cluster. This is used for
finding the outliers and also in grouping the data.

DONE 28 MARCH 2023


What is Data Reduction?
Data mining is applied to the selected data in a large amount database. When
data analysis and mining is done on a huge amount of data then it takes a very
long time to process. Data Reduction can reduce the processing time for data
analysis, data reduction techniques are used to obtain a reduced representation
of the dataset that is much smaller in volume by maintaining the integrity of the
original data. By reducing the data, the efficiency of the data mining process is
improved.
There are various strategies for data reduction which are as follows –
1. Data cube aggregation − This technique is used to aggregate data in a simpler
form. Data Cube Aggregation is a multidimensional aggregation.We can easily
understand data cube aggregation with the help of an example.
For example, suppose you have the data of All Electronics sales per quarter for
the year 2018 to the year 2020. If you want to get the annual sale per year, you
just have to aggregate the sales per quarter for each year. In this way,
aggregation provides you with the required data, which is much smaller in size,
and thereby we achieve data reduction even without losing any data.

2. Data Compression

Data compression employs modification, encoding, or converting the structure


of data in a way that consumes less space. Data compression involves building a
compact representation of information by removing redundancy and
representing data in binary form. Data that can be restored successfully from its
compressed form is called Lossless compression. In contrast, the opposite where
it is not possible to restore the original form from the compressed form is Lossy
compression. In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them.
(DONE 29 MARCH 2023)
Numerosity Reduction: Numerosity Reduction is a data reduction technique
which replaces the original data by smaller form of data representation. There
are two techniques for numerosity reduction- Parametric and Non-
Parametric methods.
Numerosity reduction is a technique used in data
mining to reduce the number of data points in a dataset while still preserving
the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a
large amount of irrelevant or redundant data points. There is no loss of data in
this method, but the whole data is represented in a smaller form.
Types of Numerosity Reduction

There are two types of Numerosity reduction, such as:

1. Parametric
2. Non-Parametric

1. Parametric - This method assumes a model into which the data fits. Data
model parameters are estimated, and only those parameters are stored,
and the rest of the data is discarded. Regression and Log-Linear methods
are used for creating such models.

2. Non-Parametric- These methods are used for storing reduced


representations of the data
include histograms, clustering, sampling and data cube aggregation.

Histograms: Histogram is the data representation in terms of frequency. It


uses binning to approximate data distribution.

Clustering: Clustering divides the data into groups/clusters. This technique


partitions the whole data into different clusters. In data reduction, the cluster
representation of the data are used to replace the actual data. It also helps to
detect outliers in data.

Sampling: Sampling can be used for data reduction because it converts a large
data set into a much smaller random data sample (or subset).

Data cube aggregation − This technique is used to aggregate data in a simpler


form. Data Cube Aggregation is a multidimensional aggregation.

DONE 29 MARCH 2023


Data Warehousing and Data Mining MCQ Quiz

1. __________ is a subject-oriented, integrated, time-variant, nonvolatile collection of


data in support of management decisions.
A. Data Mining.
B. Data Warehousing.
C. Web Mining.
D. Text Mining.
2. The data Warehouse is__________.
A. read only.
B. write only.
C. read write only.
D. none.
3. Expansion for DSS in DW is__________.
A. Decision Support system.
B. Decision Single System.
C. Data Storable System.
D. Data Support System.
4. The important aspect of the data warehouse environment is that data found within
the data warehouse is___________.
A. subject-oriented.
B. time-variant.
C. integrated.
D. All of the above.
5. The time horizon in Data warehouse is usually __________.
A. 1-2 years.
B. 3-4years.
C. 5-6 years.
D. 5-10 years.
6. The data is stored, retrieved & updated in ____________.
A. OLAP.
B. OLTP.
C. SMTP.
D. FTP.
7. __________describes the data contained in the data warehouse.
A. Relational data.
B. Operational data.
C. Metadata.
D. Informational data.
8. ____________predicts future trends & behaviors, allowing business managers to
make proactive, knowledge-driven decisions.
A. Data warehouse.
B. Data mining.
C. Datamarts.
D. Metadata.
9. __________ is the heart of the warehouse.
A. Data mining database servers.
B. Data warehouse database servers.
C. Data mart database servers.
D. Relational data base servers.
10. ________________ is the specialized data warehouse database.
A. Oracle.
B. DBZ.
C. Informix.
D. Redbrick.

DATE:-14/April/2023 Sachin Raj Saxena (Assistant Professor in C.S Dept.)


Data Warehousing And Data Mining
Assignment - 1
1. List out the applications of Data mining?

2. Define data cleaning?

3. What is a data cube?

4. Define OLTP & OLAP?

5. Give the differences between a database and a data warehouse?

6. Explain Multidimensional data model with a neat diagram?

7. Define data warehouse. Draw the architecture of data warehouse and

explain the three tiers in detail?

8. Short Notes on- Generalization,Summarization & Discretization?

9. What is Data Warehouse Implementation explain it with the help of an

Diagram?

10. Explain the various schemas of a data warehouse?

DATE: 14/April/2023 Sachin Raj Saxena (Assistant Professor in C.S Dept.)

You might also like