0% found this document useful (0 votes)
19 views24 pages

Data Warehouse and Data Mining - Unit 2

Uploaded by

zrimreaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views24 pages

Data Warehouse and Data Mining - Unit 2

Uploaded by

zrimreaper
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

INTROD UCTION TO

DATA MINING

CHAPT!R OUTuNE

0 Motivation for data mining


0 Introduction to data mining system
0 Data mining functionalities
0 KDD
0 Data object and attribute types
0 Statistical de!!Cription of data
0 Issues and Applications
U Data Warehousing and Data Mining

l\10TIV AT10N FOR DATA MINING

The major reason for using data mining techniques 1s . . t of useful informati on and
requircmen · b d.
knowledge from huge nmoun~ l'f data. The inforn,,,tion ,md know lC\.i ge g, •ained can e use m many
. . t . t
applications such as business m,m,,gl'ml'nt, production control etc. 0 a ta. minmg came in o ex.is ence
h . the datab
as a result of the natural evolution of intorm,ltion technology. Evolutionary pc1t m ase
industry has developed. the folio" ing tunction,,litics. Tlwy aa'

• Data colll'Ction (md. d.,t.ilw-c Ch.'.\tion .


. d'
• Data man~cm cnt (mdu mg d a ta s torage and retrievc1l, and database transaction
PI'\).:es,in£) . .
. (" 1 · data warehous ing and data rrurung).
• Data anal) sb and underst,mdmg mvo vmg
1960 and early
Data Collecti on and
Databas e Creation
♦ Primitive file processing

1970 to 1980
Databas e Manage ment System
♦ Hierarchical and network
database system
♦ Relation database systems
♦ Ota modeling: entity-re lationshi p
models etc.
♦ Query language: SQL, etc.
♦ Online transatio n processi ng (OLTP)

1980 to Present
t
1980 to present
Advanc ed Databaae Syatem Advance d Databas e System
♦ Advance d data models: extende- • Data warehou se and OLAP.
relationa J, object relational etc. • Data mining and knoedge discovery
♦ Integrat ion of heteroge neous sources
♦ Managin g uncertai n data and data
• Mining complex types of data
Cleanin g • Data mining applicat ions
♦ Data mining and society
♦ Web-ba sed databas eds
(XML, semanti c web)
♦ Data streams and cyberph ysical
data system.
♦ Exterely large data manage ment
Databas e system tuning and
adaptive systems
r---- -......_ Preae11t to future
illformatlo11 a,.tem
♦ Issues of data privacy and security .
♦ Cloud comput ingand parallel data
process ing

Figure 2.1: Evolution of database system technolo gy


-
Introductio n to Data Mining O CHAPTER 2 t 43
During 1960's database and information technology has been evolving from primitive file processing
systems to powerful database systems. During 1970's relational database systems were developed. In
addition, users access data through query languages. Efficient methods for on-line transaction
processing (OLTP) were developed. During the mid-1980s many advanced database systems and
application -oriented database systems were developed. In 1990's Heterogene ous database systems
and Internet-ba sed global information systems such as the World-Wid e Web (WWW) also emerged
and play a vital role in the informalton industry.
Data can now be stored in many diffcren l types of databases. One database architecture that has
recently emerged is the dala warehouse . It is a repository of multiple heterogene ous data sources,
organized under a unified schema to facilitate manageme nt decision making. Data warehouse
technology includes data cleamng, data integration , and On-Line Analytical Processing (OLAP).
Although OLAP tools support multidimen sional analysis and decision making, additional data
analysis tools are required for in-depth analysis. The tremendou s amount of data collected and
stored in large and numerous databases, has led to the developme nt of data mining tools which
perform data analysis to convert huge data into useful knowledge.

INTRODUCTION TO DATA MINING SYSTEM

The process of extracting informatio n to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision from huge sets of data is called Data Mining. In other
viords, w e can say that Data Mining is the process of investigatin g hidden patterns of information to
various perspective s for categorizat ion into useful data, which is collected and assembled in
particular areas such as data warehouses , efficient analysis, data mining algorithm, helping decision
making and other data requiremen t to eventually cost-cutting and generating reven ue.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events. Data Mining is also called
Knowledge Discovery of Data (KDD). It is a process used by organizations to extract specific data from
huge databases to solve business problems. It primarily turns raw data into useful information.
The data mining process involves several component s, and these components constitute a data
mining system architectur e. The significant component s of data mining systems are a data source,
data mining engine, data warehouse server, the pattern evaluation module, graphical user interface,
and knowledge base. The architectur e of a typical data mining system may have the following major
components.
• Database, Data Warehouse, World Wide Web, or Other Information Repository: This
is one or a set of databases, data warehouses, spreadshee ts, or other kinds of
informatio n repositories. Data cleaning and data integration techniques may be
performed on the data.
• Database or Data Warehouse Server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user's data mining request.
• Knowledg e Base: This is the domain knowledge that is used to guide the search or
evaluate the interestingn ess of resulting patterns. It is simply stored in the form of set
of rules. Such knowledge can include concept hierarchies, used to organize attributes or
attribute values into different levels of abstraction.
u
Data Wauhousing and Data Mining d .deally consists of
• Data Mining Engine: This is essential to the ~ as characterization, assoaati~n and
. ata mining system an 1 . .

a set of functional modules for tasks su t analysis outlier analySis, and


. ·
correlation analysis, classification, prediction, c1us er '

evolution analysis. icall employs interestingn""'. measures


• Pattern Evaluation Module: This component typ ~ocus the search toward mteresting
. . odules so as to tt
and interacts with the data. nurung m holds to filter out discovered pa ~-.
Pattems. It may use interestingness thres ers and the data nurung sy em,
• User interface: This module commuruca · tes between
b us • · query orsttask·
ifying a data nun.mg
allowing the user to interact with the system y sbpecwse database and data warehouse
In add
ition this component a11ows the user to ro and visualize the pa ttems m·
schemas or, data structures, evalua te mm · ed patterns,
different fom1s.

User Interface

Pattern Evaluation
Knowledge
Base
Data Mining Engine

Database or Data
Warehouse Server

----,
;-D~;a-c~eaning, integration and selection :
'- - - - - - - - - - - - - - -- --- - -

Database Data World Other


Warehouse Wide Data
Web Repositories

Figure 2.2: Architecture of typical data mining system


What Kind of Data are Mined on Data Mining?

There are number of different data stores on which mining can be performed. This includes
relational databases, data warehouses, transactional databases, advanced database systems, flat files,
and the World-Wide Web. Advanced database systems include object-oriented and object-relational
databases, and specific application-oriented databases such as spatial databases, lime-1leries
databases, text databases, and multimedia databases.
Relational Database

A relational database is a collection of multiple data sets formally organized by tables, records, and
columns from which data can be accessed in various ways without having to recognize the database
tables. Tables convey and share information, which facilitates data searchability, reporting, and
organization.
lntrod11<:tion trJ Dota Minin O CHA,ra
. . g 2 tC5
A ,,•1,,ti 1111,,1 d,11,,t,,1,;1• 1a ,1 10ll1'1 l1or1 of 1,,hl,•s, l',1ch of which is ass· •d .
, ' ign{ · a unique name, Each table
u, 11 .,i~l!i nl II i;1' I 111111111h11l1'S (1 ol11mm1 m · fif'ld s) ,ind usut11ly i;ton•s a I be
• . , - arge num r of tuples (records
111 11 n\'S). h111 h I11ph• 111 •' ll'ln lion.ti l,1hlc n•pri•i;,•nts ,1 record identified by .
a unique key and described
h)' ., o;('f 111 .11111h11h• \ ' 1lhll1 8.

r~antl'lc:
Stu,lrnl
~~ N,11111•
- A,f,lress Co"rs,•-11) Foreign Keys
S-12
-
P,IW,111 Jo~h ~ C002
S-14
S.St Abin -
\ ,\llll\l,11\

I•
K,1rki
5,llHJ -
-
C021
C321 -
S-11 A,H-.w S,tud C002 ..
I +
C'nurllP-1D
Course
Course-Name
Relationships C002 C++
C021 DBMS
C321 Account
Figure 2.3: Relationship between tables of relational database

Dat:i Warehouses

A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from multiple
places such as Marketing and Finance. The extracted data is utilized for analytical purposes and
helps in decision- making for a business organization. The data warehouse is designed for the
analysis of data rather than transaction processing.
A data warehouse is usually modeled by a multidimensional database structure, where each dimension
corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some
aggregate measure, such as count or sales amount. The actual physical structure of a data warehouse
may be a relational data store or a multidimensional data cube. It provides a multidimensional view of
data and allows the pre-computation and fast accessing of summarized data .
. ,jl
G~
,,.._~ .
. 0.... Kawasot1
~
"'v(f Dhangadi

- fl
QI 605 825 14

i Q2 680 952 31 512


g
GI Q3 812 1023 30 501

f-4 <Pokhara, Q4, SSD >
Q4 927 1038 38 580

1V PC AP SSD
Product (types)

Figure 2.4: 3D cube of sales data (measure displayed is rupees_sold in thousands)


• Data Warehousing and Data Mining

Data Repositories (

The Data Repository generally refers to a destination for data storage. However, many IT E

professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure. t
For example, a group of databases, where an organization has kept various kinds of information. l
I
Object-Relational Database !

A combination of an object-oriented database model and relational database model is called an


object-relational model. lt supports Classes, Objects, Inheritance, etc. One of the primary objectives of
the Object-relational data model is to close the gap between the Relational database and the object-
oriented model practices frequently utilized in many programming languages, for example, C++,
Java, C#, and so on.
Transactional Database
A transactional database consists of a file where each record represents a transaction. A transaction
typically includes a unique transaction identity number (Trans_ID), and a list of the items making up
the transaction. The transactional database may have additional tables associated with it, which
contain other information regarding the sale, such as the date of the transaction, the customer ID
number, the ID number of the sales person, and of the branch at which the sale occurred, and so on.
A transactional database refers to a database management system (DBMS) that has the potential to
undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
Product master data
Product_id Produect name Produt Description Price
Pl Computer Pentium 4 computer with Dell brand 40000
P2 Laptop Dell 17 black cover laptop 80000
P3 Chair Four legs comfortable chair 15000
P4 Table Computer table
I 10000

Customer master data


Customer_id Customer name Customer address
CJ Manmohan Butwal-22, Rupandehi, Nepal
C2 Kumar Ktm-32, Sukedhara, Nepal
3 Abin Pokhara-05, Lakeside, Nepal

I
Transactions
~ ~

Sale_id Product_id Customer_id Sale price Sale date time


S1 Pl Cl 40000 2072/10/09; 2: 15:40
S2 P2 Cl 80000 2071/11/30; 1: 12:36
S3 Pl C3 35000 2070/02/29; 4: 11:34
S4 P3 C2 15000 2072/12/21; 9: 32:11
I

f igure 2.5: Example of master and transactional data in transactional database


Inlroduc-lion to Data Mining O
1
CHAPTER 2 I 47
Generally, master data docs not change and does not nrc>d lo be cr"ated 'lh
... w1 every transacti.on.For
examplc,if one custome r purchas es multiple product s al different t,·mcs tr
, a ansac t·10n record needs
to be created for each sale, but the data about the custome r stays the same. Figure
2.5 shows how
master data forms part of a transact ional record. In this case, when the lamp and chair
products are
sold, the transact ion referenc es the relevant product IDs and custome r IDs. The
product and
customer records, if they already exist, do not need to be recreated or modified
for this new
transaction. The other data in the transact ion, such as the unique identifier for the transacti
on (i.e.,
sale ID) and sale time do, howeve r, need to change. Note that this simple example shows
the actual
sale price in the transact ion and price of the product in the master data. There are differen
t ways to
model this, but the example shows that the actual price can change dependi ng on the
transaction and
it may only be calculated from the product price in the master data (after including discoun
ts, etc.).
Transaction of sale idS3 in figure 2.5 shows that a discoun t has been applied to the retail
price.
Advanc ed Databa se System s and Advanc ed Databas e Applica tions

Relation al databas e systems have been widely used in business applications. With the
advances of
database technolo gy, various kinds of advance d database systems have emerged and are
undergo ing
develop ment to address the requirem ents of new database applications. The
new database
applicat ions include handlin g spatial data (such as maps), enginee ring design data
(such as the
design of building s, system compon ents, or integrat ed circuits), hypertex t and multime
dia data
(includin g text, image, video, and audio data), time-rel ated data (such as historical records
or stock
exchang e data), and the World-W ide Web (a huge, widely distribu ted informa tion reposito
ry made
available by Internet ). These applicat ions require efficient data structur es. In response
to these needs,
advance d databas e systems and specific applicat ion-orie nted database systems have been
develop ed.

Data Minin g Proce ss

For data mining to be effective, much careful work is needed in defining the aims of
the enterpri se
data mining and then in selection, cleaning, transfor mation and perhaps separate storage
of data that
is suitable for data mining. This separate database may be a data warehou se that stores
informa tion
that is needed by the enterpri se's decision makers. Such a database should include not
just current
data but also historica l data. Data Mining is a process of discovering various models,
summaries,
and derived values from a given collection of data.As noted above, the data mining
process
involves much hard work.W e discuss two differen t approac hes now. The first approac
h is an
adaptati on of the well-kn own software develop ment process and the second one is
Cross-Industry
Standar d Process for Data Mining (CRISP-OM) approac h.

Software Develop ment Approach

A typical data mining process is likely to include the followin g steps. These steps
are based on a
typical software develop ment process. It should be noted that all such processes are
iterative and
work done at Step 1 may need to be revised based on some new informa tion or new
insight further
down in the process.
41 Data Wnrchou~ing and Data Mining
.--------- ---,
Sr'ATE THE PROBLEM

COT LECTTHE DATA

PI RfORM PRFPROCESSING

rsnMATE THE MODEL


(MINE THE DATA)

INTERPRET THE MODEL AND


ORAW THE CONCLUSION

Figure 2.6: Data mining process (general experimental approach)

1. State the Problem and Formulate the Hypothesis


M05t data-based modelling s tudies are performed in a particular application
. domain. Hence,
.
domain-specific knowledge and experience are usually necessary m order to come up with a
meaningful problem statement. Unfortunately, many application studies tend to focus on the
data-mining technique at the expense of a clear problem statement. In_ this s~ep, a modeler ~
usually specifies a set of variables for the unknown dependency and, if possible, a general
form of this dependency as an initial hypothesis. There may be several hypotheses formulated
for a single problem at this stage. The first step requires the combined expertise of an
application domain and a data-mining model. In practice, it usually means a close interaction
between the data-mining expert and the application expert. In successful data-mining
applications, this cooperation does not stop in the initial phase; it continues during the entire
data-mining process.
2. Collect the Data \
This step is concerned with how the data are generated and collected. In general, there are two
distinct possibilities. The first is when the data-generation process is under the control of an
expert (modeler): this approach is known as a designed experiment. The second possibility is
when the expert cannot influence the data- generation process: this is known as the
observational approach. An observational setting, namely, random data generation, is
assumed in most data-mining applications. Typically, the sampling distribution is completely
unknown after data are collected, or it is partially and implicitly given in the data-collection
procedure. It is very important, however, to understand how data collection affects its '
theoretical distribution, since such a priori knowledge can be very useful for modeling and,
later, for the final interpretation of results. Also, it is important to make sure that the data used
for estimating a model and the data used later for testing and applying a model come from the
same, unknown, sampling distribution. If this is not the case, the estimated model cannot be
successfully used in a final application of the results.
3. Preprocessing the Data

In the observational setting, data are usually collected from the existing databases, data
warehouses, and data marts. Data preprocessing usually includes at least two common tasks:
Introdu ction to Data Mining O CHAPTER 2 149
a) Outlier detection (and removal): Outlier s are unusua l data values that
are not
con~istcnt with most observations. Commonly, outliers result from measur
ement
errors, coding .u,d recordi ng errors, and, sometimes, are natural , abnorm
al values.
Such non reprcSl'nt<llivc sample s can seriously affect the model produc ed
later. There
.m~ two slt-.-\tegies for dealing with outliers. (i) Detect and eventua lly remove
outliers as
a p.ut of tlw preproc essing phase, or (ii) Develop robust modeli ng method
s that are
insensitive lo outliers.
b) Scaling, encoding, and selecting features: Data preproc essing include s several
steps
like v,ui.1blc scaling and differing types of encoding. For instance, one
feature with
r,mgc [0 1] ,md other with range [100, 1000] will not have an equiva lent weight
within
applied techniq ue. They are going to also influence ultimat e data-m ining
results
differently. Therefore, it is recomm ended to scale them and convey both feature
s to an
equiYalent weight for further analysis. Also, application-specific encodi
ng method s
usually achieve dimens ionality reducti on by providi ng a smaller
numbe r of
inform ative feature s for subseq uent data modeling.
These two classes of preproc essing tasks are only illustrative sample s of an
outsize d spectru m
of prepro cessing activiti es during a data-m ining process. Data-p reproce ssing
steps should not
be consid ered comple tely indepe ndent from other data-m ining phases.
In every iteratio n of
data-m ining process , all activities, together, could define new and improv
ed data sets for
subseq uent iteratio ns. Genera lly, an honest preproc essing method provid
es an optima l
represe ntation for a data-m ining techniq ue by incorpo rating a prior knowle
dge within sort of
applica tion-sp ecific scaling and encodin g.
4. Estimate the model
The selectio n and implem entatio n of the approp riate data-m ining techniq
ue is the main task
in this phase. This process is not straigh tforwar d; usually , in practice, the
implem entatio n is
based on several models , and selecting the best one is an additio nal task.
5. Interpret the model and draw conclusions
In most cases, data-m ining models should help in decisio n making . Hence,
such models need
to be interpr etable in order to be useful becaus e human s are not likely to
base their decisio ns
on comple x black-b ox models . Note that the goals of accurac y of the model
and accurac y of its
interpr etation are somew hat contrad ictory. Usually, simple models are
more interpr etable,
but they are also less accurat e. Moder n data-m ining method s are expect
ed to yield highly
accurat e results using high dimens ional models . The proble m of interpr
eting these models ,
also very import ant, is consid ered a separat e task, with specific techniq
ues to validat e the
results. A user does not want hundre ds of pages of numeri c results. He
does not unders tand
them; he cannot summa rize, interpr et, and use them for success ful decisio
n making .
CRISP-OM Approach
There is a Cross- Industr y Standa rd Proces s for Data Mining (CRISP
-OM) claime d to be more
practical, success ful, widely adopte d and used by industr y membe rs.
This model consist s of six
phases intend ed as a cyclical proces s (see figure 2.7):
5D Data Warehousing und Datu Mining

Data
Bu siness - -----i unde rstan ding
Unders tanding

Data
Prepar ation

Dcplo~ mc.nl
Modeling

Data

Evaluation

Figure 2.7: CRISP-OM approach


J. Business Understanding: Business understanding includes determ ining
busine ss objectives,
assessing the current situation, establishing data mining goals, and develo
ping a project plan.
2 Data Understanding: Once business objectives and the project
plan are establ ished, data
understanding considers data requirements. This step can include initial
data collection, data
description, data exploration, and the verification of data quality. Data
explor ation such as
viewing summary statistics (which includes the visual display of catego
rical variables) can
occur at the end of this phase. Models such as cluster analysis can also
be applie d during this
phase, with the intent of identifying patterns in the data.
3. Data Preparation: Once the data resources available are identif
ied, they need to be selected,
cleane d, built into the form desired, and formatted. Data cleani ng and
data transfo rmatio n in
prepar ation of data modeling needs to occur in this phase. Data explor
ation at a greate r depth
can be applie d during this phase, and additional model s utilize
d, again provid ing the
oppor tunity to see pattern s based on business unders tandin g.
4. Modeling Data mining software tools such as visuali zation (plotti ng
data and establishing
relatio nships ) and cluster analysis (to identify which variab les go well
togeth er) are useful for
initial analysis. Tools such as generalized rule induction can develo
p initial associ ation rules.
Once greate r data unders tandin g is gained (often throug h patter
n recogn ition triggered by
viewin g model output ), more detaile d models appro priate to the
data type can be applied.
The divisio n of data into trainin g and test sets is also neede d for model
ing.
5. Evaluation: Mode l results should be evalua ted in the contex t of the
busine ss objectives
establ ished in the first phase (business unders tandin g). This will
lead to the identification of
other needs (often throug h patter n recognition), frequently revert ing
to prior phases of CRISP·
Introduction to Data Mining O CHAPTER 2 I 51
D~1 G,1ining business understanding is an iterative procedure in data mining, where the
rc._..,ults of ,·arious vi!-ualization, statistic.ii, and artificial intelligence tools show the user new
relationships that provide a deeper understanding of organizationa l operations.
t,. Deployment D,lta mining can be used to both verify previously held hypotheses, or for
knowledge dt<;CO\'cry (identification of unexpected and useful relationships). Through the
knowledge di~coYcrcd in the earlier ph.1scs of the CRISP-OM process, sound models can be
obtaint:-d that m..1y then be applied to busme5s operations for many purposes, including
pn•diction or identification of key situations. These models need to be monitored for changes
in operating cond1hons, because what might be true today may not be true a year from now. If
signific,mt changes do occur, the model should be redone. It's also wise to record the results of
dat,\ mining pro1ects so documented evidence is available for future studies.

ISSUES IN DATA MINING

Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These
factors also create some issues. Data mining systems face a lot of challenges and issues in today's
world some of them are:
1. Mining methodology and user interaction issues
., Performance issues
3. ls.5ues relating to the diversity of database types
I Data Mining Issues

Mining Methodology and


User Interaction
I Performance Issues I Diverse Data Types Issues

i-
l
Mining different kinds of ie Efficiency and scalability of
!
• Handling of relational and
knowledge in database data mining algorithms complex types of data
i- Interactive mining of knowledge i- Parallel, distributed and • Mining information from
at multiple levels of abstraction incremental mining heterogeneous database and
i- Incorporation of background algorithms global information systems
knowledge
~ Data mining query languages and
ad hoc data mining
• Presentation and visualization of
data mining results
• Handling noisy or incomplete
data
• Pattern evaluation
figure 2.8: Data mining iuues

l. Mining Methodology and User Interaction Issues:


a) Mining different kinds of knowledge in databases: The need for different users is not
same. Different users may be interested in different kinds of knowledge. Therefore, it is
necessary for data mining to cover a broad range of knowledge discovery tasks
52 Data Warehousing and Data Mining
. . rnd corrcl.1tion ,m.1lysic;,
includi,w dat,, charactcriz.1tion, discrimination, association ' t·on analysis (which
o . . I ·s
c\.issification, prediction, clustcrmg, outlier ,.ma ysi., and evo 1u J

includes trend and similarity ,m,,lysis). . . The d.it,1 mining


b) . 1 1 els of abstraction.
Interactive mining of knowledge at mulbp e ev f on search for p,1tterns,
. . be 'tallows users to ocus
•d results.
process nl~ds to be mteracltve cause i
. . ts based on retume
providing and refining data mm.mg rcques . roccss and to express
c) 1 d . To guide discovery P .
Incorporation of background know e ge. be d to express discovered
d kn ledge can use
discovered patterns, backgroun ow . vels of abstraction.
. · but at mu 1tip 1e 1e
patterns not only m concise terms . . . ta Mining Query language
d) d d-hoc data mmmg. 0 a .
Data mining query langu~ges an a . . tasks should be integrated with a data
that allows user to descnbe ad-hoc nurung ff' . t and flexible data mining.
d timized for e 1cien
warehouse query language an op . . lt . Once patterns are discovered it
e) . aliz ti f data nurung resu s.
Presentation and v1su a on o visual representations. These
needs to be expressed in high-level languages,
. d be il understandable by users.
representations shoul eas Y d . g methods are required that can
f) . . lete data· The ata c1earun
Handling noisy or mcomp . · . . . data regularities. If data cleaning
handle noise, incomplete obiects while muung . r
methods are not there then accuracy of discovered patterns will be poo .·
g) Pattern evaluation; It refers to interestingness of problem. The patterns discovered should
be interesting because either they represent common knowledge or lack of novelty.
2 Performance Issues:
a) Efficiency and scalability of data mining algorithms: In or~e_r to effec_tively extract
information from huge amount of data in databases, data mmmg algonthm must be
efficient and scalable In other words, the running time of a data mining algorithm
must be predictable and acceptable in large databases. From a database perspective on
knowledge discovery, efficiency and scalability are key issues in the implementation of
data mining systems
b) Parallel, distributed, and incremental mining algorithms: The factors such as huge
s17.e of databases, wide distribution of data, and complexity of data mining methods
motivate development of parallel and distributed data mining algorithms. These
algorithms divide data into partitions that are further processed parallel. Then results
from partitions are merged. The incremental algorithms update databases without
having minf'd data again from scratch.
3. Issues relating to 'fhe Diversity of Database Types:

a) Handling of relational and complex types of data: Because relational databases and
da ta warehouse:. are widely used, the development of efficient and effective dall
mining systems for such data is important. However, other databases may contain
complex data objects, hypertext and multimedia data, spatial data, temporal data, or
transaction data.

b) Mining information from heterogeneous databases and global information systems


Local- and wide-area computer networks {such as the Internet) connect many source
of data, forming huge, distributed, and heterogeneous databases. The discovery cJ
knowledge from different sources of structured, semi-structured, or unstructured datJ
with diverse data semantics poses great challenges to data mining.
Introduc..'tion to Data Mrrnng 0 Of4'1Ea2 11:a
DATA MINING FuNCTIO:--iALITIES

Data mining functionalities are used to specify the ldnd of patterns to be found m data mmmg l,t.- ,
Data mining functions are used to define the trends or correlations co;; tamed m data rnmmg
activities. Data mining tasks can be classified into two categories: desc.riptive and predu.twt

1. Descriptive Task; Descriptive mining tasks chaiactt>n.a: · ne general prope:ties of the data m
the database. It includes certain knowledge to understand what is hapvem11g within the data
without a previous idea. The common data features are highlighted in the data sci. for
examples: count, average etc.
2. Predictive Task; Predictive mining tasks perform inference on the current data m order to
make predictions. It helps developers to provide unlabeled definitions of attributes, Based on
previous tests, the software estimates the characteristics that are absent. For e"'Mirnple: Judging
from the findings of a patient's medical exanunations that is he suffering frmn any particUla:
disease.
There are number of data mining functionalities that the organized and scientific methods offer,
Major data mining functionalities are described as follows:
1. OassfConcept Descriptions: Characterization and Discrimination
Gasses or definitions can be correlated with results. In simplified, descriptive and yet a«urate
ways, it can be helpful to define individual groups and concepts. It is important to liru data
with groups or related items. For example, computers and printers are types of goods for sale
in the Hardware Shop. These class or concept definitions are referred to as class/concept
descriptions.
• Data Characterization: The characterization of data is a description of the key
characteristics of objects in a target class which creates what is called a characteristic
rule. To do this, a user can run a database query to compute the user-specified class
through predefined modules to retrieve desired results from data at various abstraction
levels. This refers to the summary of general characteristics or features of the class that
is under the study. For example. To study the characteristics of a software product
whose sales increased by 15% two years ago, anyone can collect these types of data
related to such products by running SQL queries.
• Data Discrimination: It compares common features of class which is under study. The
output of this process can be represented in many forms. E.g., bar charts, curves and
pie charts.
2. Mining Frequent Patterns
Frequent patterns are patterns that occur frequently in data. Frequent patterns are nothing but
things that are found to be most common in the data. There are different kinds of frequency
that can be observed in the dataset.
• Frequent item set This applies to a number of items that can be seen together regularly
for e.g.: milk and sugar.
• Frequent Subsequence: This refers to the pattern series that often occurs regularly such
as purchasing a phone followed by a back cover.
• Frequent Substructure: It refers to the different kinds of data structures such as trees
and graphs that may be combined with the item.set or subsequence.
54 Data Warehousing and Data Mining

3. Association Analysis
The process involves uncovering the rclJtionship
bctWl'Cll d,,t,, and d1•1 iJin~ 111" rut,,• of tlh,
association. It is a way of discovering thC' rel,1lions 1
hip bt1twecn v.trt0 115 ili·tn~. Por 'X,1tnpl.., it
can be used lo determine the sales of items that arc
frequently purch,H,<'d togt'llwr.
4. Correlation Analysis
Correlation is a mathematical technique that can show st
whether and how rongly lbc p,lir11 of
attributes are related to each other. For example, I Jeigh
ted people tend lo hnvl' more WPight.
S. Classification and Prediction
Oassification is the process of finding a model (or
function) that describt>s and distinguishes
data classes or concepts. Mainly it is used to pred
ict the class of objects whose clac;s lab<'l is
unknown. The derived model is based on the analy
sis of a set of training dnta. Dc.1t,1 obh t
whose class label is known is considered as train
ing data. The derived model may ~
represented in various forms, such as classifica
tion 0F-TIIEN) rules, decision ln.'L'S,
mathematical formulae, neural networks etc.
For example, a decision tree performs the classifica
tion in the form of tree structure, It brt1aks
down the dataset into small subsets and a decision
tree can be designed simultaneously. The
final result is a tree with decision node.
The following decision tree can be designed
to declare a result, whe ther an applicant is
eligible or not eligible to get the driving license.

Driving license

li age is >=18 If age is< 18

Yes No

Eligible Not eligible

Figure 2.9: Deci sion tree


Whereas classification predicts categorical label
functions. That is, it is used to pred s . .
ict . . ' prediction models continuous-valued
than class labels. Regression analysis i's rruss mg or unavail bl
t . .
.
numeric prediction, although other m th a s atistical meth0 da0 1e numerica .
l data values rather
. d . ogy
identification of distribution trends ba e d o s exist as well p di that .
1s mos t often used for
se on the avai.lable ·d ta re Cla
ction. also encompasses tht
. .
may need to be preceded by relevance . .
.
ana1ysis, . a · ssifi catio n and pred1ctiO!l
not contribute to the classification or whic h attem t tO 1•ct . d
diction
.
excluded. pre process PThs entif. y attributes that •
· ese attributes can then ~
lt1trnclt1~tion to Dntn Mining O CHAPTER 2 155

('ht'lll't' An,1ly11i~
(!.
l lnlik,• r1,,ssitic,1tion nnd pH•di, lion, whit h ,m,,lyzc cl,1ss-l,1bclcd data objects, clustering
,in,ily1.l'!'- d,11,1 obj,•,·ls without consulting ,, known cl,1c;s label. Clustering can be used to
h,•m•1nl1' such 1,,lwb. 'l lw nhj,•,•ts illl' l'ht9h•rcd or grouped based on the principle of
m,\\imizing tlh• inlt,,-, l,1ss simil,,1 ity ,md minimizing tlw interclass similarity. That is, clusters
,,1 l,bj,•ds ,m• 1,,rnwd so th.it nhj,•ct~ within n duc,t,•r h,wc high similarity in comparison to one
,motlw1, but ,m' \'l't")' dbsimil.n lo ohjl'l ts in ollwr clusters.

Figure 2.10: Three data clusters and outliers

7. Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of
the d ata. These data objects are outliers. Most data mining methods discard outliers as noise
or exceptions. However, in some applications such as fraud detection, the rare events can be
more interesting than the more regularly occurring ones. Outliers may be detected using
statistical tests that assume a distribution or probability model for the data, or using distance
measures where objects that are a substantial distance from any other cluster are considered
outliers. For example, Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account number in comparison to
regular charges incurred by the same account.
8. Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time. Distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis. For example,
you have the major stock market (time-series) data of the last several years available from the
Nepal Stock Exchange and you would like to invest in shares of high-tech industrial
companies. A data mining study of stock exchange data may identify stock evolution
regularities for overall stocks and for the stocks of particular companies. Such regularities may
help predict future trends in stock market prices, contributing to your decision-making
regarding stock investments.
51 Data Warehousing and Data Mining
ES(KDD)
KNOWLEDGE DISCOVERY IN DATABAS '
. cful knowledge from a
I discovering us .
in dntabascs (K
'DD) ,s the proecss
~ technique
.
O
.
• Judes data preparation
.
pro<:css that incd · terprehng
Knowledge discovcn.• ·.1 • • 1s a accurate
collection of dnta. 11,b widely used d.tta ""nmg d ,c on d,1ta sets an in . fr
. . t',
11 8
~ prior know1c 8 . Jude marketin~ aud
and selection, data cleansing, m1..·orpoia . I<DD .,pplication ,treas inc the knowled
solution., from the observed results. ~faJM •ning is the core p,Ht of . . ge
In turing Datu nu th" by using data mmmg
detection, telecommunicatio n and munu , L • l i c in dat,,, it does is
• <: .,1 finding know c1.. g ount of data.
dLc;co,·ery proces~ 1'PP 1::, a proce,. l d' knowledge from large am
. rd' ~ tract den1.1n ,ng
methods (algonthm!'-) m o d to c:1. Knowledge

panem Eva1ua11on rf1 /


iii
~['"']1~ .
111S1<-relevent Data

~-~•-r
Data CINll7=
~L-lectt-on __L.--+

!
""' ...i»ta..!_ntagratlon I
,, _j,
'·◄-----

Dat.lW•
Figure 2.11: Knowledge Discovery Process (KDP)

Simply stated, data mining refers to extracting or "mining" knowledge from large amounts of data
stored in databases, data warehouses, or other information repositories. Many people treat data
mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD.
Alternatively, others view data mining as simply an essential step in the process of knowledge
discovery. Knowledge discovery consists of an iterative sequence of the following steps:
• Data deaning: First step in the Knowledge Discovery Process is Data cleaning in which
noJ.Se and inconsistent data is removed.
o Cleaning in case of Missing values.
o Oeaning noisy data, where noise is a random or variance error.
o Cleaning with Data discrepancy detection and Data transformation tools.
• Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWar ehouse).
o Data integration using Data Migration tools.
o Data integration using Data Synchronizatio n tools.
o Data integration using ETL (Extract-Load-T ransformation) process.
Introductio l D
. n o ata Mining O CHAPTER 2 57
• Data Selection: Data scleclion · d f'
. . d . is c med as the I
ana1ys1s 1s cctded and retrieved f process where data relevant t th
ro,n the database o e
o Data selection using neural network.
o Date'\ selection using Decision T
rees.
0 Data selection using Naive bayes.
0 Data selection using Clusteiing, R .
egress1on,
etc.
• Data Transformation· Tn Data T
. · f
rans ormation data f
appropnate for mining by performin summ ' are tr~ ormed into forms
Transformation 1s a two-step process: g ary or aggregation operations. Data
o Data Mapping· Assign·
transformation~. mg e1ements from source base to destination to capture
0 Code
. . generation: Creation of the actual transformation
. program.
• Data l\lmmg: In Data Mining, data . . .
to e'.tract data patterns. nurung me thods (algonthms) are applied in order
0 Transforms task relevant data into patterns.
0
Decides purpose of model using classification or characterization.
• ~attern_Evaluation: In Pattern Evaluation, data patterns are identified based on some
mteresting measures.
o Find interestingness score of each pattern.
0 Uses summarization and Visualization to make data understandable by user.
• Knowl~dge Presentation; In Knowledge Presentation, knowledge is represented to
user usmg many knowledge representation techniques.
o Generate reports.
o Generate tables.
o Generate discriminant rules, classification rules, characterization rules, etc.

DATA OBJECT AND ATTRIBUTE TYPES

Data sets are made up of data objects. A data object represents an entity-in a sales database, the
objects may be customers, store items, and sales; in a medical database, the objects may be patients;
in a university database, the objects may be students, professors, and courses. Data objects are
typically described by attributes. Data objects can also be referred to as samples, examples, instances,
data points, or objects. If the data objects are stored in a database, they are data tuples. That is, the
rows of a database correspond to the data objects, and the columns correspond to the attributes.
An attribute is a data field, representing a characteristic or feature of a data object. The nouns
attribute, dimension, feature, and variable are often used interchangeably in the literature. The term
dimension is commonly used in data warehousing. Machine learning literature tends to use the term
feature, while statisticians prefer the term variable. Data mining and database professionals
commonly use the term attribute, and we do here as well. Attributes describing a customer object can
include, for example, customer_ID, name, and address. Observed values for a given attribute are
known as observations. A set of attributes used to describe a given object is called an attribute vector
(or feature vector). The distribution of data involving one attribute (or variable) is called univariate.
A bivariate distribution involves two attributes, and so on.
58 Data Warehousing and Do.ta Mining
J
The type of an attribute is dt•tt•rmilwd by tlw !i!'I of possihl1· v,1lucs-nornin,II, binary, ordinal, or
numeric-the attribute rnn h,Wl'. In the follow111g subst'l'tions, we introduce each typ(•.
1. Nominal Attributes
Nominal me,ms "rel,1ting to n,mws." The v,,ltH'S of ,1 nornin.il clllribulc ctre symbols or narnes
of things. Each vnltw n•presl'llls some kind of C,lll'gory, c:odc, or slate, and so nominal
attributes are also referred to ,1s c.1lcgoric,1! Tlw v,1lucs do nol h,wc i.JnY meaning( ul order. In
computer science, lhe v,1lues an• ulso known as l'llt1m<•r,1l1ons.
Nominal attribu ll' \'alm•s do not h,1w any me,mingful order abou_l the~ and are not
quantitative, it m,1kcs no sense to fmd the mean (.ivcr,1ge) value or mcdrnn (middle) value for
such an attribu te. giwn a Sl'I of objects. One thing th,1I ;., of interest, however, is the atlribute's
most commonly occurring value This value, known ns the mode, is one of the measures of
central tendency
Example: Suppose th,1t h,1ir color and marita l slc1lus are two attributes describing person
objects. In our ,lpphcahon, possible va lues for hair_color are black, br?wn, blond: red, ~uburn,
gray, and white. The attribute marital_slatus can lake on the values single, married, divorced,
and widowed. Both hair color and marital_stalus are nominal a ttributes. Another example of
a nominal attribute is oc~upation, with the values teacher, dentist, programmer, farmer, and
soon.
2. Binary Attributes
A binan· attribute is a nominal attribute with only two categories or s tates: 0 or 1, where o
typically means that the attribute is absent, and 1 means that it is present. Binary attributes are
referred to as Boolean if the two states correspond to true and false.
Example: Given the attribute smoker describing a patient object, 1 indicates that the patient
smokes, Khile Oindicates that the patient does not. Similarly, suppose the patient undergoes a
medical test that has two possible outcomes. The attribute medical_test is binary, where a
value of 1 means the result of the test for the patient is positive, while O means the result is
negative.
3. Ordinal Attributes

An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
Example: Suppose that drink_size corresponds to the size of drinks available at a fast-food
restaurant. This nominal attribute has three possible values: small, medium, and large. The
values have a meaningful sequence (which corresponds to increasing drink size); however, we
cannot tel1 from the values how much bigger, say, a medium is than a large. Other examples
of or~al attributes include grade (e.g., A+, A, A-, B+, and so on) and professional_rank.
Professional ranks can be enumerated in a sequential order: for example, assistant, associate,
and full for professors, and priva te, private first class, specialist, corporal, and sergeant for
army ranks.
4. N umeric Attributes
A numeric attribute is quantitative; tha t is, it is a measurable ql,lantity, represented in integer
or real values. Numeric attributes can be interval-scaled or ratio-scaled.

• Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of
interval-scaled attributes have order and can be positive, 0, or negative. Thus, ill
.. . . Introduction to Data Mining O CHAPTER 2 159
add1t10n to prov1dmg a ranking of values, such attributes allow us to compare and
quantify the difference between values.

Example: A temperature attribute is interval-scaled. Suppose that we have the outdoor


temperature
. tl value for a number. of different days, where each d ay 1s
· an ob'1ect. By
ord~r!ng 1e values, we obtam a ranking of the objects with respect to temperature. In
ad~iti?n,_we can quantify the difference between values. For example, a temperature of
20 C 1s five degrees higher than a temperature of 15°C. Calendar dates are another
example. For instance, the years 2002 and 2010 are eight years apart.
• Ratio-Scaled Attributes

A ratio-scaled _attri~ute is a numeric attribute with an inherent zero-point. That is, if a


measurement 1s ratio-scaled, we can speak of a value as being a multiple (or Iatio) of
a~other value. In addition, the values are ordered, and we can also compute the
difference between values, as well as the mean, median, and mode.
Example: Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K) temperature
scale has what is considered a true zero-point (0°K = - 273.15°C): It is the point at which
the particles that comprise matter have zero kinetic energy. Other examples of ratio-
scaled attributes include count attributes such as years_of_experience (e.g., the objects
are employees) anc!_ number_of_words (e.g., the objects are documents). Additional
examples include attributes to measure weight, height, and speed, and longitude and
monetary quantities (e.g., you are 100 times richer with $100 than with $1).

STATISTICAL DESCRIPTION OF DATA

For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic
statistical descriptions can be used to Jdentify properties of the data and highlight which data values
should be treated as noise or outliers.
Statistics simply means numerical data, and is field of math that generally deals with collection of data,
tabulation, and interpretation of numerical data. It is actually a form of mathematical analysis that uses
different quantitative models to produce a set of experimental data or studies of real life. It is an area of
applied mathematics concern with data collection analysis, interpretation, and presentation. Statistics
deals with how data can be used to solve complex problems. Some people consider statistics to be a
distinct mathematical science rather than a branch of mathematics. Statistics makes work easy and
simple and provides a clear and clean picture of work you do on a regular basis.
Descriptive statistics uses data that provides a description of the population either through
numerical calculation or graph or table. It provides a graphical summary of data. It is simply used
for summarizing objects, etc. There are two categories in this as following below.
• Measure of central tendency and
• Measure of Variability or measure of dispersion
1. Measure of central tendency
Measure of central tendency is also known as summary statistics that is used to represents the
center point or a particular value of a data set or sample set. In statistics, there are three
common measures of central tendency as shown below:
• Mean
• Median and
• Mode
• Data Warehousing and Datn Mining
Mean: It is me,,sure of ovcr,1gc of ..111 value in a s.implc seL. It is found by adding all data
points and dividing by the number of data points.
(4+1+7) 12
Example: The mNn of 4, 1, nnd 7 is 3 = 3 "" 4
Median: The middle number; found by ordering all data points and picking out the one in the
middle (or i( there are two middle numbers, taking the mean of those two numbers).
rd 4 7
Example 1~ The median oi 4, 1, and 7 is ,t because when the numbers are put in °
er {l, , 1, the

number 4 is in the middle.


Example 2: find the medi.ln of thb data: {I0, 40, 20, SO}
Put the data in order first: {10, 20, 40, 50) .
There is an e,·en number of data points, so the median is the average of the nuddle two data

points.
Median = (2{) ; 40} "' ~ -= 30

Til4e mNlian is 30
Mode: The mode for a set of data is the value that occurs most frequently in the set. Hence, it
c.an be c.alculated for both qualitative and quantitative attributes.The re is an equal possibility
that a dataset might have two modes. Such datasets are known as bimodal. In general, a
data..i.et with two or more modes is known as multimodal.
&.ample: The mode of {4, 2, 4, 3, 2, 2) is 2 because it occurs three times, which is more than
any other number.
2 Measure of Variability/ Measure of dispersion
Measure of Variability is also known as measure of dispersion and used to describe variahility
m a sample or population. In statistics, there are three common measures of variability as
shown below:
• Range
• Variance and
• Dispersion
Range
It is gi~ m~ure of how to spread apart values in sample set or data set.
Range - Maxrmum value - Minimum value
Example: In (4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9
So, the range is 9 - 3 = 6. ·

Variance
is calculated by taking the average o f squared
fr is a measure of variability. It
dThe· variance
·
ev1ations om the mean. The sample variance formula looks like this:

Formula Explanation

• s2 = sample variance
s2 = I(X- ~)2
n-1 • l: = sum of.. .

• X = each value

• x = sample mean

• n = number of values in the sample


Inlroduclion to Data Mining 0 CHAPTER 2 I 61
Dispersion

Dispersion is the state of getting dispersed or spread Statistical dispersion means the extent to
which a numerical data is likely to vary about an average value. In other words, dispersion
helps to understand the distribution of the data.

As the name suggests, the measure of dispersion shows the scatterings of the data. It tells the
variation of n,e data from one another and gives a clear idea about the distribution of the data.
The measure of dispersion shows the homogeneity or the heterogeneity of the distribution of
the observations.

DATA MINING APPLICATIONS

Applications Usage
Communications Data mining techniques are used in communication sector to predict customer
behavior to offer highly targeted and relevant campaigns.
Insurance Data mining helps insurance companies to price their products profitable and
promote new offers to their new or existing customers.
Education Data mining benefits educators to access student data, predict achievement levels
and find students or groups of students which need extra attention. For example,
students who are weak in mathematic subject.
Manufacturing With the help of Data Mining Manufacturers can predict wear and tear of
production assets. They can anticipate maintenance which helps them reduce them
to minimize downtime.
Banking Data mining helps finance sector to get a view of market risks and manage
regulatory compliance. It helps banks to identify probable defaulters to decide
whether to issue credit cards, loans, etc.
Retail Data Mining techniques help retail malls and grocery stores identify and arrange
most sellable items in the most attentive positions. It helps store owners to come up
with the offer which encourages customers to increase their spending.
Service Providers Service providers like mobile phone and utility industries use Data Mining to
predict the reasons when a customer leaves their company. They analyze billing
details, customer service interactions, complaints made to the company to assign
each customer a probability score and offers incentives.
£-Commerce E-commerce websites use Data Mining to offer cross-sells and up-sells through their
websites. One of the most famous names is Amazon, who use Data mining
techniques to get more customers into their ecommerce store.
Super Markets Data Mining allows supermarket's developed rules to predict if their shoppers were
likely to be expecting. By evaluating their buying pattern, they could find woman
customers who are most likely pregnant. They can start targeting products like baby
powder, baby shop, diapers and so on.
Crime Investigation Data Mining helps crime investigation agencies to deploy police workforce (where
is a crime most likely to happen and when?), who to search at a border crossing etc.
Bioinformatics Data Mining helps to mine biological data from massive datasets gathered in
biology and medicine.
12 Data Warehousing and Data Mining

CHALLENGES OF DATA MINING

· a crucial technology for business ¾d


N owa d ays Data Mining and knowledge discovery are evolvmg
researchers m . . 1s
· many domains. Data Mming • deve Ioping
· m · to established and
. trusted discip1:_
4lle,
. pending challenges have to be solved. Some of th ese challenges are given be1ow.
many still
1. Securit) and Social Challenges
Decision-Making strategics are done through d a ta collection-sharing, .. .so it requir
. es

considerable security. Private · a bou t m
·information · dividuals and sensitive
. information are
collected for customer's profiles, user behavior pattern understandmg. m_egal access to
infonnation and the confidential nature of information becoming an important issue.

2. User Interface
The knowledge discovered is discovered using data mining tools is useful only if it is
interesting and above all understandable by the user. From good visualization interpretation
of data, mining results can be eased and helps better understand their requirements. To obtain
good visualization many researches is carried out for big data sets that display and
manipulate mined knowledge.

a) Mining based on Level of Abstraction: Data Mining process needs to be collaborative


because it allows users to concentrate on pattern finding, presenting and optimizing
requests for data mining based on returned results.

b) Integration of Background Knowledge: Previous information may be used to express


discovered patterns to direct the exploration processes and to express discovered
patterns.
3. Mining Methodology Challenges

These challenges are related to data mining approaches and their limitations. Mining
approaches that cause the problem are:
a) Versatility of the mining approaches,
b) Diversity of data available,
c)Dimensionality of the domain,

d) Control and handling of noise in data, etc.

Different approaches may implement differently based upon data ·d · So


cons1 eration. me
algorithms require noise-free data. Most data sets contain excepti · alid • It
. . . . . ons, mv or mcomp e e
information led to comphcahon m the analysis process and som · th
precision of the results. e cases compromise e

4. Complex Data

Real-world data is heterogeneous and it could be multimedia data c t · · · dio


. on auung unages, au
~nd_ v~deo, complex data, temp_o ral data, spatial data, time series, natural language text etc. It
15 difficult to handle these vanous kinds of data and extract the required information. New
tools and methodologies are developing to extract relevant information.
Introduction to Data Mining O CHAPTER 2 163
a) Complex data types: TI1e database can include complex data elements, objects with
graphical data, spatial data, and temporal data. Mining all these kinds of data is not
practical to be done one device.

b) Mining from Varied Sources: The data is gathered from different sources on Network.
The data source may be of different kinds depending on how they are stored such as
structured, semi structured or unstructured.
5, Performance

The performance of the data mining system depends on the efficiency of algorithms and
techniques are using. The algorithms and techniques designed are not up to the mark lead to
affect the performance of the data mining process.

a) Efficiency and Scalability of the Algorithms: The data mining algorithm must be
efficient and scalable to extract information from huge amounts of data in the database.
b) Improvement of Mining Algorithms: Factors such as the enormous size of the
database, the entire data flow arid the difficulty of data mining approaches inspire the
creation of parallel & distributed data mining algorithms.

ADVANTAGE AND DISADVANTAGES OF DATA MINING

Advantages of Data Mining

• The Data Mining technique enables organizations to obtain knowledge-based data.


• Data mining enables organizations to make profitable modifications in operation and
production.
• Compared with other statistical data applications, data mining is a cost-efficient.
• Data Mining helps the decision-making process of ari organization.
• It facilitates the automated discovery of hidden patterns as well as the prediction of
trends arid behaviors.
• It can be induced in the new system as well as the existing platforms.
• It is a quick process that makes it easy for new users to arialyze enormous amounts of
data in a short time.

Disadvantages of Data Mining

• There is a probability that the organizations may sell useful data of customers to other
organizations for money.
• Many data mining analytics software is difficult to operate and needs advance training
to work on.
• Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining tools is
a very challenging task.
• The data mining techniques are not precise, so that it may lead to severe consequences
in certain conditions.
~----~........
4:bercis!)
Define data mining. ALc;o descn'be data mining functionalities . What are the different
f:mctims o: Data Mining?
2. Wha! is KDD, E.,;.plain ";th suitable block diagram.
.:,. \\m!do}OO mean by data object and attribute types? Explain.
Describe statistical description of data.
Desaibedata mining system with suitable example.
Descrt'be advanta.s:es
.., and disadvantages of data mining.
Dffiereruiate between KDD and Data Mining.
6. Exp..m the application of mining used in WWW.
9. Expla::l the data mining query language with example.
Difrereruiate between Data-Warehouse and Data-mining. Explain the stages of knowledg
Giscm'erv in database with example.
11. What are foundations of data mining? What is the scope of data mining?
12. Gn-e a brief introduction to data mining process? What is a Model in the field of Data Mining?
13. Na=ie areas of applications of data mining? What are the most significant advantages of Dat
Mining?
N'a.:ne the different Data Mining techniques and explain the scope of Data Mining.
W:hat is the fundamental difference between Data Warehousing and Data Mining?
16.
.Explain the different stages of Data Mining. What are the most significant disadvantages o
Data Mining?
~:"1at
ar_e the common issues faced during Data Mining? What are the required technological
onvers m Data Mining?
18.
What are the key features of Data Mining? What are the different fields where data mining·
used?
19. ¾'hat are the different problems that ''Data Mining" can solve?
20. How do Data Mining and Data Warehousing work together?

000

You might also like