0% found this document useful (0 votes)
128 views

Data Warehousing and Data Mining - Thara - M.Tech Cse

This document provides information on data warehousing and data mining. It includes a table listing books and authors on these topics. It also defines and compares key terms such as data vs. information, open data vs. informational data, database systems vs. information systems, and proprietary systems vs. open systems. Finally, it provides a brief introduction to data warehousing, data marts, data mining, metadata, and data sets.

Uploaded by

thilaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Data Warehousing and Data Mining - Thara - M.Tech Cse

This document provides information on data warehousing and data mining. It includes a table listing books and authors on these topics. It also defines and compares key terms such as data vs. information, open data vs. informational data, database systems vs. information systems, and proprietary systems vs. open systems. Finally, it provides a brief introduction to data warehousing, data marts, data mining, metadata, and data sets.

Uploaded by

thilaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

[DATA WAREHOUSING AND DATA

SMVEC ASSIGNMENT
NO 1 – DATA MINING
AND DATA

MINING – THARA - M.TECH CSE]


WAREHOUSING

D.THARA PARAMESWARI
Task 1 : Books name with authors, Task 2 – comparative analysis,Task 3- Brief introduction about data WH, data Mng,
Meta data, Data sets, Task 4 – various tools, Task 5 – big data workshop notes.
S.NO AUTHOR NAME TITLE NAME RACK NO ( STUD SESSION )
1 Jiawei Han and Micheline Data Mining Concepts and Techniques- 230,225
Kamber Elsevier, 2nd Edition, 2008
2 Alex Berson and Stephen J. Data Warehousing, Data Mining & OLAP 204, 225, 217
Smith
3 K.P. Soman, Shyam Diwakar “Insight into Data mining Theory and 217, 230
and V. Ajay Practice”, Prentice Hall of India, Easter
Economy Edition, 2006
4 G. K. Gupta “Introduction to Data Mining with Case 217,230
Studies”, Prentice Hall of India,
Easter Economy Edition, 2006.
5 Pang-Ning Tan, Michael “Introduction to Data Mining”, 230
Steinbach and Vipin Kumar Pearson Education, 2007.
6 Kargupta, Joshi, Sivakumar Data Mining – Next Generation and Future 209
&Yesha Generation
7 Sam Anahory , Dennis Data Warehousing in the real world 230
Murray
TASK 1 :

TASK 2 : COMPARITIVE ANALYSIS OF THE FOLLOWING TERMS :-

1. DATA Vs INFORMATION
2. OPEN DATA Vs INFORMATIONAL DATA
3. DATABASE SYSTEM Vs INFORMATION SYSTEM
4. PROPRIETORY SYSTEM Vs OPEN SYSTEM

Data Vs Information
Data and Information are important to the design of databases. The term data is used to describe raw facts (not
yet processed) about something or some one. Data is raw facts which has no context and just left with numbers
and text from which the required information is derived.

Information is nothing but the data with context, a processed data, and a value added to the data which is
summarized, organized, analyzed.
For Example :
Data : 211215
Information : 02/12/15 – is the review date of the first phase of the project.
211215 - can be the range of a salary
211215 - can be the zip code.

Data : The other example is that the data to represent the growth of the company. And the data
are 6.34 ,6.45, 6.39, 6.62, 6.57, 6.64, 6.71, 6.82, 7.12, 7.06
SIRIUS SATELLITE RADIO INC.
Information: The graph shows the
$7.20
Processed data. $7.00

$6.80
Stock Price

$6.60

$6.40

$6.20

$6.00

Data Warehousing and Data Mining – Thara - M.Tech CSE $5.80


1 2 3 4 5 6 7 8 9 10
Last 10 Days
Open data Vs Informational data
It is similar to data Vs information which follows the sequential steps to attain the information by
summarizing the data, averaging the data, selecting the part of the data, graphing the data, adding context, and adding
value to the data which results in Knowledge.

A simple example : - name = sham


Class = 12
Age = 16
Marks = 80
Subject = Mathematics

The processed information for the above specified data and information are : sham is in 16yrs, sham is in
twelfth standard and scored 80% in mathematics.

Database System Vs Informational System


The generic goal is that, transforming data into information and the core of an information is mentioned as
information system or otherwise called as Database (raw data). Data does not appear, data has to be captured. The
captured data have other purposes say ( i ) Transaction Processing System ( TPS ) ( ii ) Process Control System. Under
processing the basic system is composed of 5 components say : Input, output, Processing, Feedback, Control.

input output
processing
raw data information

Processing leads to summarizing, computing average, graphing, creating charts, visualizing data. The processing systems
are also called as navigation system, say for example a specialized geographic Information system.

Input : Maps, Addresses, Point of Interest – “ Yellow Pages”


Processing : computing shortest path, finding the nearest Chinese restaurant.
Output : Directions ( each turn with a map followed by an arrows )
List of Chinese restaurants sorted by distance.

Proprietary System Vs Open System

An open systems are free to distribute


Provides the original source code
Allows for unrestricted modification of the original source code
Allows modified code to be redistributed either as a modified version of the original code, or as the original code
along with any patch files that may be used to modify the original code
May not discriminate against any person or group
May not be restricted from any fields of endeavor
Applies the same license to all whom the program is distributed to
May not be program or product specific
May not restrict other software
Must be platform or technology neutral

By contrast, proprietary systems are closed-source software follows few, if any, of these requirements. Most
proprietary software has limited licenses, often cost money, cannot be redistributed, and cannot be altered. The primary
advantage to open-source system is that is often at least moderately powerful compared to proprietary system, yet

Data Warehousing and Data Mining – Thara - M.Tech CSE


costs nothing. Additionally, because open-source system is able to be modified by individual users, any features that are
desired of the program can be added by a user of sufficient programming skill, and then those features can be utilized by
anybody that desires to do so. As such, open-source system is flexible and can often meet the needs of anybody that
chooses to use it. In comparison, the primary advantage of proprietary system is that, while potentially expensive, it is
often more powerful and feature-rich than open-source alternatives. Proprietary system also has, usually, the benefit of
a dedicated technical support department that can assist with both training as well as in troubleshooting and problem
solving any issues that may arise from using their product.

For Example :

When it comes to databases, one of the most popular open-source solutions is MySQL. MySQL has many of the
features that can be found in most commercial, proprietary database management systems (Lorini, 2010). MySQL is
robust, it has high availability, and has an available GUI management system comparable to Microsoft’s SQL Server
Management Studio. MSSQL Server, however, does have a few advantages unavailable in MySQL, such as partitioning
and external rights management, features which may not be used by a small business. As with most cases, the needs of
the user must be identified before the database solution is chosen. It can generally be said that the smaller the business,
the more likely an open-source solution will work best. As a company grows in size, however, more suitable solutions
must be found, solutions that provide more support and greater number of features. In the end, the user must decide
which solution works best for their purpose.

TASK 3 : BRIEF INTRODUCTION ABOUT THE FOLLOWING TERMS

1. DATA WAREHOUSING
2. DATA MARTING
3. DATA MINING
4. META DATA
5. DATA SETS

Data Warehousing : ( reference : net exam book : harihant publications)


Data warehouse generalize and consolidate data in multi dimensional space. The construction of datawarehouse
involves data cleaning, data integration and data transformation and can be viewed as an important
preprocessing step for data mining. Data warehouse provides architecture tools for business executives to
systematically organize, understand and use their data to make strategic decisions. Data Warehouse provides
online analytical processing tool for the interactive analysis of multi-dimensional data of varied granularities,
which facilitates effective data generalization and data mining. A data warehouse is a subject oriented,
integrated , time variant and non-volatile collection of data in support of management’s decision making
process.
A three Tier Datawarehouse Architecture:
Datawarehouses often adopt a three architecture.
1. The bottom tier is a warehouse database that is almost always a relational database system. Back-end tools
and utilities are used to feed data into the bottom tier from operational databases or other external sources.
2. The middle tier is an OLAP server that is typically implemented using either-
(i) A relational OLA ( ROLAP) model that is extended DBMS that maps operation on multi-dimensional
data to standard relational operations.
(ii) A Multidimensional OLAP (MOLAP) model that is a special purpose server that directly implements
multi-dimensional data and operations.
3. The top tier is a front –end client layer, which contains query and reporting tools, analysis tools and/or data
mining tools.

Data Warehousing and Data Mining – Thara - M.Tech CSE


The following diagram depicts the three-tier architecture of data warehouse:

Data Mining: ( reference : net exam book : harihant publications)

Data mining can be said as Knowledge Discovery from Data ( MDD ). Data mining means extraction of data or
find interesting data patterns in large data sets. The terms which are considered as data mining- knowledge mining from
data, knowledge extraction, Data/pattern analysis, Data archeology, data dredging. Essential steps are considered in the
process of knowledge discovery

1. Data cleaning : All the noises and inconsistencies are removed and make the data of knowledge noise free.
2. Data Integration : It is where multiple data sources may be combined and store the required information in
coherent data store as in data store as in data warehousing.
3. Data selection : It is where data relevant to the analysis task are retrieved from the database ( coherent store )
4. Data transformation : It is where the data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations.
5. Data Mining : An essential process where intelligent methods are applied in order to extract data patterns.
6. Pattern Evaluation : Identify the truly interesting patterns representing knowledge based on some interesting
measures.
7. Knowledge Presentation : It is where visualization and knowledge representation techniques are used to
present the mined knowledge to the user.

Data Warehousing and Data Mining – Thara - M.Tech CSE


remove multiple data relevant
Data cleaning Data Data Data
noise transformation
integration will combine selection data is selected

data trasformed
into appropriate
form for mining

Methods are applied


Identify pattern Data mining
Pattern
Knowledge To extract data
evaluation
presentation

Represented in Mined data

Architecture of Typical data mining


1. Database, data warehouse, world wide web or other information repository
This is one or a set of databases , data warehosuses spreadsheets or other kinds of information repositories.
Data cleaning and data integration techniques may be performed on the data.
2. Database or data warehouse server
The database or data warehouse server is responsible for fetching the relevant data, based on the user’s
data mining request.
3. Knowledge base
Knowledge base is the domain knowledge that is used to guide the search or evaluate the interestingness
of resulting patterns.
4. Data Mining engine
It is a set of modules for tasks such a characterization, association and correlation analysis, classification,
prediction, cluster analysis, outlier analysis and evaluation analysis.
5. Pattern evaluation module
This component employs interestingness measures and interact with data mining modules.
6. User Interface
This module communicates between users and the data mining system allowing user to interact with
system by specifying a data mining query or task.

User interface
Knowledge
Pattern evaluation base

Data mining engine

Database or datawarehouse server

Data cleaning, integration and selection

www Other
database Data warehouse
infromatin
repository

Data Warehousing and Data Mining – Thara - M.Tech CSE


DATA MINING ARCHITECTURE

Data Marting ( reference [7])

DATA marts are created for the following reasons:


 To spped up the quesries by reducing the volume of data to be scanned.
 To structure data in a form suitable for a user access tool.
 To partition data in order to impose access control strategies.
 To segment data into different hardware platforms
The operational cost of data marting are highand once startgey is in place , it can be difficult to change without
incurring substantial revedevoplememt costs.

Before designing data marting , one must make sure the appropriate strategies are very much necessary for a
particular situation.

To reduce the cost and make the data mart to fit your bill, the following the steps are to be followed:

 Identify whether a natural functional split within the organization.


 Identify whether there is a natural split of the data.
 Identify whether the proposed user access tool uses its own database structure.
 Identify whether any infrastructure issues predicate the use of data marts.
 Identify whether there are any access control issues that require data marts to provide Chinese walls.

Note : It is recommended that one has to allow data to be loaded into an enterprise data warehouse and then to
be data marted.

Identify functional splits

Here one has to determine whether the business is structured in such a way as to benefit from functionally
splitting the data.

For example considering a retail organization in which each is responsible for maximizing the sales of a group of
products.

This means that the information in a data warehouse will have its value when :
 Sales transaction on daily level, to monitor the actual sales
 Sales forecast on weekely basis
 stock position on daily basis, to monitor stock levels
 stock movements on daily basis , to monitor supplier or shrinkage issues.

All this information can form substantial data volumes when , by the nature of the role , the merchant is not
interested in products that one is not responsible for.

Data Warehousing and Data Mining – Thara - M.Tech CSE


Data mart

Department 1

Summary info

Detailed
Information

Data mart
Meta data

Department 2

Data mart

Department 3

One may think or consider that data marting the subset of data dealing with product group of interest, because the
merchant is unlikely to query about the other products. Here a big question arises when there is a change in the product
from one department to the other.

With further investigation , the functionality of the departmental split are valid where it requires additional information.

Meta data ( reference [7] – please see on the top )


Meta data in general describes about something else. Metadata is used for:
 data transformation and load
 data management
 query generation
Data transformation and load:
Metadata may be used during data transformation and load to describe the source data and any changes
that need to be made. The advantage of storing metadata about the data being transformed is that as source data
changes can be captured in the metadata and transformation program automatically regenerated. For each source data
field the following information is required:
 source field
- unique identifier

Data Warehousing and Data Mining – Thara - M.Tech CSE


- name
- type
- location
-system and object
The destination field needs to be described in a similar way to the source :
 destination
- unique identifier
- name
- type
- tablename
The other information that needs to be stored in the transformation or transformation that need to be applied to turn
the source data into destination data:
- transformation(s)
- name
- language
-module name
-syntax
Here what s worse is that potentially every source system is differently enclosed with different data types and so on.
Another common difficulty is how to deal with many-to-one, one-to-many and many-to-many mappings from source to
destination. If there are many of these mappings you should probably be using a case tool or a third party
transformation mapping tool to handle the data transformation.

The advantage of transformation and mapping tools is that they will do all the described above and more. The main
disadvantage of these tools is their cost. The other prime disadvantage of many transformation tools is that the code
they generate is not efficient.

Reference : [7] ( please see on the top )

Data sets

There are several standard datasets that we will come back to repeatedly. Different datasets tend to expose new issues
and challenges, and it is interesting and instructive to have in mind a variety of problems when considering learning
methods. In fact, the need to work with different datasets is so important that a corpus containing around 100 example
problems has been gathered together so that different algorithms can be tested and compared on the same set of
problems.

Another problem with actual real-life datasets is that they are often proprietary. No corporation is going to share its
customer and product choice database with you so that you can understand the details of its data mining application
and how it works. Corporate data is a valuable asset, one whose value has increased enormously with the development
of data mining techniques.

The weather problem is a tiny dataset that we will use repeatedly to illustrate machine learning methods. Entirely
fictitious, it supposedly concerns the conditions that are suitable for playing some unspecified game. In general,
instances in a dataset are characterized by the values of features, or attributes, that measure different aspects of the
instance. In this case there are four attributes: outlook, temperature, humidity, and windy. 

Table 1.2 

The Weather Temperatur Humidit Wind Pla

Data Warehousing and Data Mining – Thara - M.Tech CSE


Data e y y y
 Outlook

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True No

The rules we have seen so far are classification rules: they predict the classification of the example in terms of whether
or not to play. It is equally possible to disregard the classification and just look for any rules that strongly associate
different attribute values. These are called association rules. Many association rules can be derived from the weather
data in Table 1.2. Some good ones are as follows:

If temperature = cool then humidity = normal


If humidity = normal and windy = false then play = yes
If outlook = sunny and play = no then humidity = high
If windy = false and play = no then outlook = sunny
and humidity = high.
Reference : http://searchbusinessanalytics.techtarget.com/feature/Simple-data-mining-examples-and-datasets

TASK 5:

Various tools
General purpose data mining tools such Clementine and Enterprise Miner are designed to analyze large
commercial databases. Although these tools were primarily designed to identify customer buying patterns in market
basket data, they have also been used in analyzing scientific and engineering data, astronomical data, multimedia data,
genomic data and web data .

Data Warehousing and Data Mining – Thara - M.Tech CSE


As an alternative, several research groups started to develop suites of programs that shared data formats and
provided tools for evaluation and reporting. An early example of such an implementation is MLC a machine learning
library in C with a command line interface that featured several then-standard data analysis techniques from machine
learning.

MLC was also designed as an object-oriented library, extendible through algorithms written by a user who could reuse
parts of the library as desired. Command line interfaces, limited interaction with the data analysis environment, and
textual output of inferred models and their performance scores were not things a physician or medical researcher would
get too excited about. To be optimally useful for researchers, data mining programs needed to provide built-in data
visualization and the ability to easily interact with the program. With the evolution of graphical user interfaces and
operating systems that supported them, data mining programs started to incorporate these features. MLC For instance,
was acquired by Silicon Graphics in mid 1990s, and turned into Mine Set at that time the most sophisticated data mining
environment with many interesting data and model visualizations. Mine Set implemented an interface whereby the data
analysis schema was in a way predefined: the user could change the parameters of analysis methods, but not the
composition of the complete analysis pathway.

Clementine (http://www.spss.com/clementine), another popular commercial data mining suite, pioneered user control
over the analysis pathway by embedding various data mining tasks within separate components that were placed in the
analysis schema and then linked with each other to construct a particular analysis pathway. Several modern open-source
data mining tools use a similar visual programming approach that, because it is flexible and simple to use, may be
particularly appealing to data analysts and users with backgrounds other than computer science.

Flexibility and extensibility in analysis software arise from being able to use existing code to develop or extend one’s
own algorithms.

For example,
Weka (http://www.cs.waikato.ac.nz/ml/weka/), a popular data mining suite, offers a library of well-documented Java-
based functions and classes that can be easily extended, provided sufficient knowledge of Weka’s architecture and Java
programming. A somewhat different approach has been taken by other packages, including R (http://www.r-
project.org), which is one of the most widely known open-source statistical and data mining suites. Instead of extending
R with functions in C (the language of its core) R also implements its own scripting language with an interface to its
functions in C. Most extensions of R are then implemented as scripts, requiring no source-code compilation or use of a
special development environment.

Recently, with advances in the design and performance of general purpose scripting languages and their growing
popularity, several data mining tools have incorporated these languages. The particular benefit of integration with a
scripting language is the speed (all computationally intensive routines are still implemented in some fast low-level
programming language and are callable from the scripting language), flexibility (scripts may integrate functions from the
core suite and functions from the scripting language’s native library), and extensibility that goes beyond the sole use of
the data mining suites through use of other packages that interface with that particular scripting language. Although
harder to learn and use for novices and those with little expertise in computer science or math than systems driven
completely by graphical user interfaces, scripting in data mining environments is essential for fast prototyping and
development of new techniques and is a key to the success of packages like R.

Reference :
Blaz Zupan, PhD , Janez Demsar, PhD “ Open-source tools for Data Mining” – Clin Lab Med , v 28, 2008, pp 37 -54.

Data Warehousing and Data Mining – Thara - M.Tech CSE

You might also like