Data Warehousing & Data Mining-A View
Data Warehousing & Data Mining-A View
and analysis. Data and information are extracted from non-homogeneous sources as
they are generated and processed using process managers (load/warehouse/query).This
makes it much easier and more efficient to run queries over data that originally came
from different sources. It also enables the people to take informed decisions.
Data mining draws from the data warehouse, revealing patterns of information
in historical data, in terms of customer data or any other data in ways that we never
thought possible. It combines techniques like statistical analysis, data visualization,
induction and neutral networks. Data mining systems improve an organization’s
effectiveness, efficiency and value by increasing the usefulness of the knowledge the
organization possesses.
Extract and load the data: Data extraction involves extracting the data from source
systems and makes it available to the data warehouse where as data load takes
extracted data and loads it into the data warehouse.
Clean and transform data: It performs the consistency checks on the loaded data,
and then structures it for query performance and for minimizing the operational
costs.
Data warehousing-“Errors”: The possible errors encountered in a data warehouses
are:
Incomplete errors: missing records, etc.
Incorrect errors: wrong (but sometimes right) codes, wrong calculations, etc.
Incomprehensibility errors: Unknown codes, spreadsheets and word processing
files, etc.
Inconsistency errors: Inconsistent use of different codes, over lapping codes,
etc.
Back up and archive data: The data is being backed up regularly and also older
data is removed from the system in a format that allows it to be quickly restored if
required.
Query management: It manages the queries and speeds them up by directing
queries to the most effective data source and also monitor the actual query profiles.
Data warehouse architecture:
Data warehouse architecture (DWA) is a way of representing the over all
structure of data, communication, processing and presentation that exists for end-user
Computing with in the enterprise. The architecture is made up of a number of inter-
connected parts:
• Operational database/External database layer: Operational systems process
data to support critical operational needs. To do that, operational databases have
been historically created to provide an efficient processing structure for a relatively
small number of well-defined business transactions.
• Information Access Layer: This is the layer that the end-user deals with directly.
In particular, it represents the tools that the end-user normally uses day to day.
e.g.: Excel, Lotus 1-2-3, etc.
• Data Access Layer: The Data Access Layer of the Data Warehouse Architecture is
involved with allowing the information Layer to talk to the Operational Layer.
• Data Directory (Meta-data) Layer: Meta-data is the data about data with in the
enterprise. Record description in a COBOL program is Meta-data.
Data Stating Layer: Data staging is also called copy management or replication
management, but in fact, it includes all of the processes necessary to select, edit,
summarize, combine and load data warehouse and information access data from
operational and/or external databases.
Data Mining:
Scope: Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors: Example: predictive problem is
targeted marketing.
Automated discovery of previously unknown patters: Data mining tools sweep
through databases and identify previously hidden patterns in one step. An example is
the analysis of retail sales.
Data Mining-Algorithms: Some of the most common data mining algorithms in use
today are two sections based on when the technique was developed and when it
became ready to be used.
1. Classical Techniques: Statistics, neighborhoods and clustering that have been used
for decades.
Statistics: These are data driven and are used to discover patterns and build predictive
models.
(a)Histograms: One of the best ways to summarize data is to provide a histogram of
the data.
Ex 1: Counting the numbers of occurrences of different colors of eyes in our database.
Ex 2: Representing the majority of customers that are over the age of 50.
Figure – depicts a simple predictor (eye color).
Nearest Neighbor: Objects that are “near” to each other will have similar prediction
values as well. Thus if you know the prediction value of one of the objects you can
predict it for its nearest neighbors. One of the improvements that are usually made to
the basic nearest neighbor algorithm is to take a vote from the “k” nearest neighbors.
Ex: The nearest neighbors are shown are shown graphically for three unclassified
records: A, B, and C.
Clustering: It is the method by which like records are grouped together. Usually this
is done to give the end user a high level view of what is going on in the database.
There are mainly two types.
Hierarchical and Non-Hierarchical Clustering: The hierarchy of clusters is
usually viewed as a tree where the smallest clusters merge together to create the next
highest level of clusters and so on.
A decision tree
• Rule Induction: The extraction of useful if-then rules from data based on
statistical significance.
These capabilities are now evolving to integrate directly with industry-
standard data warehouse and OLAP platforms.
Once the mining is complete, the results can be tested against the data held in
the vault to confirm the model’s validity. If the model works, its observation should
hold for the vaulted data.