Data Warehousing and Data Mining - Thara - M.Tech Cse
Data Warehousing and Data Mining - Thara - M.Tech Cse
SMVEC ASSIGNMENT
NO 1 – DATA MINING
AND DATA
D.THARA PARAMESWARI
Task 1 : Books name with authors, Task 2 – comparative analysis,Task 3- Brief introduction about data WH, data Mng,
Meta data, Data sets, Task 4 – various tools, Task 5 – big data workshop notes.
S.NO AUTHOR NAME TITLE NAME RACK NO ( STUD SESSION )
1 Jiawei Han and Micheline Data Mining Concepts and Techniques- 230,225
Kamber Elsevier, 2nd Edition, 2008
2 Alex Berson and Stephen J. Data Warehousing, Data Mining & OLAP 204, 225, 217
Smith
3 K.P. Soman, Shyam Diwakar “Insight into Data mining Theory and 217, 230
and V. Ajay Practice”, Prentice Hall of India, Easter
Economy Edition, 2006
4 G. K. Gupta “Introduction to Data Mining with Case 217,230
Studies”, Prentice Hall of India,
Easter Economy Edition, 2006.
5 Pang-Ning Tan, Michael “Introduction to Data Mining”, 230
Steinbach and Vipin Kumar Pearson Education, 2007.
6 Kargupta, Joshi, Sivakumar Data Mining – Next Generation and Future 209
&Yesha Generation
7 Sam Anahory , Dennis Data Warehousing in the real world 230
Murray
TASK 1 :
1. DATA Vs INFORMATION
2. OPEN DATA Vs INFORMATIONAL DATA
3. DATABASE SYSTEM Vs INFORMATION SYSTEM
4. PROPRIETORY SYSTEM Vs OPEN SYSTEM
Data Vs Information
Data and Information are important to the design of databases. The term data is used to describe raw facts (not
yet processed) about something or some one. Data is raw facts which has no context and just left with numbers
and text from which the required information is derived.
Information is nothing but the data with context, a processed data, and a value added to the data which is
summarized, organized, analyzed.
For Example :
Data : 211215
Information : 02/12/15 – is the review date of the first phase of the project.
211215 - can be the range of a salary
211215 - can be the zip code.
Data : The other example is that the data to represent the growth of the company. And the data
are 6.34 ,6.45, 6.39, 6.62, 6.57, 6.64, 6.71, 6.82, 7.12, 7.06
SIRIUS SATELLITE RADIO INC.
Information: The graph shows the
$7.20
Processed data. $7.00
$6.80
Stock Price
$6.60
$6.40
$6.20
$6.00
The processed information for the above specified data and information are : sham is in 16yrs, sham is in
twelfth standard and scored 80% in mathematics.
input output
processing
raw data information
Processing leads to summarizing, computing average, graphing, creating charts, visualizing data. The processing systems
are also called as navigation system, say for example a specialized geographic Information system.
By contrast, proprietary systems are closed-source software follows few, if any, of these requirements. Most
proprietary software has limited licenses, often cost money, cannot be redistributed, and cannot be altered. The primary
advantage to open-source system is that is often at least moderately powerful compared to proprietary system, yet
For Example :
When it comes to databases, one of the most popular open-source solutions is MySQL. MySQL has many of the
features that can be found in most commercial, proprietary database management systems (Lorini, 2010). MySQL is
robust, it has high availability, and has an available GUI management system comparable to Microsoft’s SQL Server
Management Studio. MSSQL Server, however, does have a few advantages unavailable in MySQL, such as partitioning
and external rights management, features which may not be used by a small business. As with most cases, the needs of
the user must be identified before the database solution is chosen. It can generally be said that the smaller the business,
the more likely an open-source solution will work best. As a company grows in size, however, more suitable solutions
must be found, solutions that provide more support and greater number of features. In the end, the user must decide
which solution works best for their purpose.
1. DATA WAREHOUSING
2. DATA MARTING
3. DATA MINING
4. META DATA
5. DATA SETS
Data mining can be said as Knowledge Discovery from Data ( MDD ). Data mining means extraction of data or
find interesting data patterns in large data sets. The terms which are considered as data mining- knowledge mining from
data, knowledge extraction, Data/pattern analysis, Data archeology, data dredging. Essential steps are considered in the
process of knowledge discovery
1. Data cleaning : All the noises and inconsistencies are removed and make the data of knowledge noise free.
2. Data Integration : It is where multiple data sources may be combined and store the required information in
coherent data store as in data store as in data warehousing.
3. Data selection : It is where data relevant to the analysis task are retrieved from the database ( coherent store )
4. Data transformation : It is where the data are transformed or consolidated into forms appropriate for mining by
performing summary or aggregation operations.
5. Data Mining : An essential process where intelligent methods are applied in order to extract data patterns.
6. Pattern Evaluation : Identify the truly interesting patterns representing knowledge based on some interesting
measures.
7. Knowledge Presentation : It is where visualization and knowledge representation techniques are used to
present the mined knowledge to the user.
data trasformed
into appropriate
form for mining
User interface
Knowledge
Pattern evaluation base
www Other
database Data warehouse
infromatin
repository
Before designing data marting , one must make sure the appropriate strategies are very much necessary for a
particular situation.
To reduce the cost and make the data mart to fit your bill, the following the steps are to be followed:
Note : It is recommended that one has to allow data to be loaded into an enterprise data warehouse and then to
be data marted.
Here one has to determine whether the business is structured in such a way as to benefit from functionally
splitting the data.
For example considering a retail organization in which each is responsible for maximizing the sales of a group of
products.
This means that the information in a data warehouse will have its value when :
Sales transaction on daily level, to monitor the actual sales
Sales forecast on weekely basis
stock position on daily basis, to monitor stock levels
stock movements on daily basis , to monitor supplier or shrinkage issues.
All this information can form substantial data volumes when , by the nature of the role , the merchant is not
interested in products that one is not responsible for.
Department 1
Summary info
Detailed
Information
Data mart
Meta data
Department 2
Data mart
Department 3
One may think or consider that data marting the subset of data dealing with product group of interest, because the
merchant is unlikely to query about the other products. Here a big question arises when there is a change in the product
from one department to the other.
With further investigation , the functionality of the departmental split are valid where it requires additional information.
The advantage of transformation and mapping tools is that they will do all the described above and more. The main
disadvantage of these tools is their cost. The other prime disadvantage of many transformation tools is that the code
they generate is not efficient.
Data sets
There are several standard datasets that we will come back to repeatedly. Different datasets tend to expose new issues
and challenges, and it is interesting and instructive to have in mind a variety of problems when considering learning
methods. In fact, the need to work with different datasets is so important that a corpus containing around 100 example
problems has been gathered together so that different algorithms can be tested and compared on the same set of
problems.
Another problem with actual real-life datasets is that they are often proprietary. No corporation is going to share its
customer and product choice database with you so that you can understand the details of its data mining application
and how it works. Corporate data is a valuable asset, one whose value has increased enormously with the development
of data mining techniques.
The weather problem is a tiny dataset that we will use repeatedly to illustrate machine learning methods. Entirely
fictitious, it supposedly concerns the conditions that are suitable for playing some unspecified game. In general,
instances in a dataset are characterized by the values of features, or attributes, that measure different aspects of the
instance. In this case there are four attributes: outlook, temperature, humidity, and windy.
Table 1.2
The rules we have seen so far are classification rules: they predict the classification of the example in terms of whether
or not to play. It is equally possible to disregard the classification and just look for any rules that strongly associate
different attribute values. These are called association rules. Many association rules can be derived from the weather
data in Table 1.2. Some good ones are as follows:
TASK 5:
Various tools
General purpose data mining tools such Clementine and Enterprise Miner are designed to analyze large
commercial databases. Although these tools were primarily designed to identify customer buying patterns in market
basket data, they have also been used in analyzing scientific and engineering data, astronomical data, multimedia data,
genomic data and web data .
MLC was also designed as an object-oriented library, extendible through algorithms written by a user who could reuse
parts of the library as desired. Command line interfaces, limited interaction with the data analysis environment, and
textual output of inferred models and their performance scores were not things a physician or medical researcher would
get too excited about. To be optimally useful for researchers, data mining programs needed to provide built-in data
visualization and the ability to easily interact with the program. With the evolution of graphical user interfaces and
operating systems that supported them, data mining programs started to incorporate these features. MLC For instance,
was acquired by Silicon Graphics in mid 1990s, and turned into Mine Set at that time the most sophisticated data mining
environment with many interesting data and model visualizations. Mine Set implemented an interface whereby the data
analysis schema was in a way predefined: the user could change the parameters of analysis methods, but not the
composition of the complete analysis pathway.
Clementine (http://www.spss.com/clementine), another popular commercial data mining suite, pioneered user control
over the analysis pathway by embedding various data mining tasks within separate components that were placed in the
analysis schema and then linked with each other to construct a particular analysis pathway. Several modern open-source
data mining tools use a similar visual programming approach that, because it is flexible and simple to use, may be
particularly appealing to data analysts and users with backgrounds other than computer science.
Flexibility and extensibility in analysis software arise from being able to use existing code to develop or extend one’s
own algorithms.
For example,
Weka (http://www.cs.waikato.ac.nz/ml/weka/), a popular data mining suite, offers a library of well-documented Java-
based functions and classes that can be easily extended, provided sufficient knowledge of Weka’s architecture and Java
programming. A somewhat different approach has been taken by other packages, including R (http://www.r-
project.org), which is one of the most widely known open-source statistical and data mining suites. Instead of extending
R with functions in C (the language of its core) R also implements its own scripting language with an interface to its
functions in C. Most extensions of R are then implemented as scripts, requiring no source-code compilation or use of a
special development environment.
Recently, with advances in the design and performance of general purpose scripting languages and their growing
popularity, several data mining tools have incorporated these languages. The particular benefit of integration with a
scripting language is the speed (all computationally intensive routines are still implemented in some fast low-level
programming language and are callable from the scripting language), flexibility (scripts may integrate functions from the
core suite and functions from the scripting language’s native library), and extensibility that goes beyond the sole use of
the data mining suites through use of other packages that interface with that particular scripting language. Although
harder to learn and use for novices and those with little expertise in computer science or math than systems driven
completely by graphical user interfaces, scripting in data mining environments is essential for fast prototyping and
development of new techniques and is a key to the success of packages like R.
Reference :
Blaz Zupan, PhD , Janez Demsar, PhD “ Open-source tools for Data Mining” – Clin Lab Med , v 28, 2008, pp 37 -54.