Data Warehousing and Data Mining Original Notes
Data Warehousing and Data Mining Original Notes
Airline:In the Airline system, it is used for operation purpose like crew
assignment, analyses of route profitability etc.
Banking:It is widely used in the banking sector to manage the resources.
Few banks also used for the market research, performance analysis of the
product and operations.
Public sector:In the public sector, data warehouse is used for intelligence
gathering. It helps government agencies to maintain and analyze tax
records, health policy records, for every individual.
Telecommunication:A data warehouse is used in this sector for product
promotions.
Data Warehouse applications are designed to support the user ad-hoc data
requirements, an activity recently dubbed online analytical processing (OLAP).
These include applications such as forecasting, profiling, summary reporting,
and trend analysis.
Data warehouses and their architectures very depending upon the elements of an
organization's situation.
Operational System
Flat Files
A Flat file system is a system of files in which transactional data is stored, and
every file in the system must have a different name.
Meta Data
A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make
finding and work with particular instances of data more accessible. For
example, author, data build, and data changed, and file size are examples of
very basic document metadata.
The area of the data warehouse saves all the predefined lightly and highly
summarized (aggregated) data generated by the warehouse manager.
We must clean and process your operational information before put it into the
warehouse.
The figure illustrates an example where purchasing, sales, and stocks are
separated. In this example, a financial analyst wants to analyze historical data
for purchases and sales or mine historical information to make predictions about
customer behavior.
Single-Tier Architecture
The figure shows the only layer physically available is the source layer. In this
method, data warehouses are virtual. This means that the data warehouse is
implemented as a multidimensional view of operational data created by specific
middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement
for separation between analytical and transactional processing. Analysis queries
are agreed to operational data after the middleware interprets them. In this way,
queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier
architecture for a data warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation
between physically available sources and data warehouses, in fact, consists of
four subsequent data flow stages:
Three-Tier Architecture
20/Mar/23 (Done)
Fact Constellation is a schema for representing multidimensional model. It is
a collection of multiple fact tables having some common dimension tables. It
can be viewed as a collection of several star schemas and hence, also known
as Galaxy schema. It is one of the widely used schema for Data warehouse
designing and it is much more complex than star and snowflake schema. For
complex systems, we require fact constellations.
Clustered systems
Clustered systems are similar to parallel systems as they both have multiple
CPUs. However a major difference is that clustered systems are created by two
or more individual computer systems merged together. Basically, they have
independent computer systems with a common storage and the systems work
together.
In this system, node X is idle (standby mode) and monitors other nodes in the
network. All other nodes work together. If any node fails then node X will take
the task of the failed node.
In this system, no node is idle in the network. All nodes work together and they
also monitor other nodes. If any node fails then the nearest node will take its
task.
In a parallel system, multiple users gives the tasks to the system and all the tasks
are completed in parallel like in asymmetric and symmetric systems.
High performance:The nodes works together on large tasks and the overall
performance of the system is improved. The large task is completed in less time.
Reliable:The tasks are completed without errors and if any problem occurs in
the system then it is easy to fix the problem.
Easy configuration:These systems have high data travel speed and all
computers are connected to the local area network (LAN). As all the computers
are placed near to each other so they are easy to configure.
Distributed Database
1. Data Replication
2. Data Fragmentation
3. Hybrid
1. Data Replication
The same data is stored at more than one site. This improves the availability of
the data as even if one site goes down, the data will still be available on the other
sites.
It can also improve performance by providing faster access.
However, replication does have the disadvantage of requiring more space to
store duplicate data and when one table is updated all the copies of it must also
be updated to maintain consistency.
2. Data Fragmentation
The process of dividing the database into a smaller multiple parts is called
as fragmentation. These fragments may be stored at different locations (sites).
3. Hybrid Storage
Hybrid data storage combines both data replication and fragmentation to get the
benefits of both models.
========================================================
1. Scalability - When your data warehouse grows in terms of the number of users,
the number of queries, and the complexity of the queries, ensure that your
selected hardware could be scaled up.
2. Support - Vendorविक्रेता support is important for hardware maintenance. (The
vendor is the one who is using the data warehouse.
Data processing occurs when data is collected and translated into usable
information. Usually performed by a data scientist or team of data scientists.
Data processing is collecting raw data and translating it into usable information.
The raw data is collected, filtered, sorted, processed, analyzed, stored, and
then presented in a readable format. It is usually performed in a step-by-step
process by a team of data scientists and data engineers in an organization.
1. Data Collection-The collection of raw data is the first step of the data
processing cycle. The raw data collected has a huge impact on the output
produced. Hence, raw data should be gathered from defined and accurate
sources
3. Data Input-In this step, the raw data is converted into machine-readable
form and fed into the processing unit.
4. Data Processing- In this step, data processing is done using machine learning
and artificial intelligence algorithms to generate the desired output.
6. Data Storage - The last step of the data processing cycle is storage, where
data and metadata are stored for further use.
25 Feb 2023 Done (8:35 PM)
In descriptive mining tasks, we try to find out the general properties present in
our data.
Let’s suppose, there is a mart near your home. One day you visit that mart and saw
that the mart manager is trying to observe the customers purchasing behavior that
who is buying what? You are a curious type of person so you went to him and
asked him why he is doing this?
The mart manager replied to you that he is trying to identify products that are
purchased together so that he can rearrange the mart accordingly. He told you that
let's suppose you buy bread so next thing you may try to buy some eggs or butter.
So, if this thing is kept close to bread than the mart sales may rise. This is known
as Association analysis and considered as a Descriptive data mining task.
Some of the predictive data mining tasks
are Association, Clustering, Summarization, etc.
1) Association
Association is used to find the association or connection among a set of
items present with us. It’s mainly tries to identifies the relationships between
objects.
For example:
If a retailer finds that bread and eggs are bought together mostly, he can put eggs
on sale to promote the sale of bread.
2) Clustering
Clustering is a process to identify data objects that are similar to one
another.
For example:
A Telecom company can cluster its customers based on age, residence, income,
etc. This will help the telecom company to understand its customers in a better way
and hence solved the issues and provide better-customized services.
3) Summarization
Summarization is a technique for the generalization of data.
For example:
The shopping done by a customer can be summarized into total products, total
spending offers used, etc. Such high-level summarized information can be useful
for sales.
In predictive mining tasks, we try to find out some inference on the current
data in order to make some predictions from the available data for the future.
For example:
Let’s suppose your friend is a medical practitioner and he is trying to diagnose a
disease based on the medical test results of a patient. This can be considered as a
predictive data mining task. Where we try to predict or classify the new data based
on the historical data.
Some of the predictive data mining tasks are classification, prediction, time-series
analysis etc.
1) Classification
Classification is a process where we try to build a model that can
determine the class of an object based on its different attributes.
Here, a collection of records will be available, each record represents a set of
attributes.
Let’s take an example and try to understand it.
Classification can be used in direct marketing so that we can reduce marketing
costs by targeting a set of customers who are likely to buy a new product. Using
the available data, it is possible to know which customers purchased similar
products and who did not purchase in the past. Hence, {purchase, don’t purchase}
decision forms the class attribute in this case. Once the class attribute is assigned,
demographic and lifestyle information of customers who purchased similar
products can be collected and promotion emails can be sent to them directly.
2) Prediction
In the prediction task, we try to predict the possible values of missing
data. Here, we build a model based on the available data and this model is then
used in predicting future values of a new data set.
For example:
If we want to predict the price of the new house based on the historical data
available such as the number of bedrooms, number of kitchens, number of
bathrooms, carpet area, old house prices, etc. Then we have to build a model that
can predict the new house price based on the given input. Also, prediction analysis
is used in different areas including fraud detection, medical diagnosis, etc.
3) Time series analysis
Time series analysis includes methods to analyze time-series data in order
to extract useful patterns, trends, rules, and statistics.
For example:
Stock price prediction is an important application of time- series analysis.
Done
Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
3. Clustering:
This approach groups the similar data in a cluster. This is used for
finding the outliers and also in grouping the data.
2. Data Compression
1. Parametric
2. Non-Parametric
1. Parametric - This method assumes a model into which the data fits. Data
model parameters are estimated, and only those parameters are stored,
and the rest of the data is discarded. Regression and Log-Linear methods
are used for creating such models.
Sampling: Sampling can be used for data reduction because it converts a large
data set into a much smaller random data sample (or subset).
Diagram?