Data Warehouse and Data Mining - Unit 2
Data Warehouse and Data Mining - Unit 2
DATA MINING
CHAPT!R OUTuNE
The major reason for using data mining techniques 1s . . t of useful informati on and
requircmen · b d.
knowledge from huge nmoun~ l'f data. The inforn,,,tion ,md know lC\.i ge g, •ained can e use m many
. . t . t
applications such as business m,m,,gl'ml'nt, production control etc. 0 a ta. minmg came in o ex.is ence
h . the datab
as a result of the natural evolution of intorm,ltion technology. Evolutionary pc1t m ase
industry has developed. the folio" ing tunction,,litics. Tlwy aa'
1970 to 1980
Databas e Manage ment System
♦ Hierarchical and network
database system
♦ Relation database systems
♦ Ota modeling: entity-re lationshi p
models etc.
♦ Query language: SQL, etc.
♦ Online transatio n processi ng (OLTP)
1980 to Present
t
1980 to present
Advanc ed Databaae Syatem Advance d Databas e System
♦ Advance d data models: extende- • Data warehou se and OLAP.
relationa J, object relational etc. • Data mining and knoedge discovery
♦ Integrat ion of heteroge neous sources
♦ Managin g uncertai n data and data
• Mining complex types of data
Cleanin g • Data mining applicat ions
♦ Data mining and society
♦ Web-ba sed databas eds
(XML, semanti c web)
♦ Data streams and cyberph ysical
data system.
♦ Exterely large data manage ment
Databas e system tuning and
adaptive systems
r---- -......_ Preae11t to future
illformatlo11 a,.tem
♦ Issues of data privacy and security .
♦ Cloud comput ingand parallel data
process ing
The process of extracting informatio n to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision from huge sets of data is called Data Mining. In other
viords, w e can say that Data Mining is the process of investigatin g hidden patterns of information to
various perspective s for categorizat ion into useful data, which is collected and assembled in
particular areas such as data warehouses , efficient analysis, data mining algorithm, helping decision
making and other data requiremen t to eventually cost-cutting and generating reven ue.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events. Data Mining is also called
Knowledge Discovery of Data (KDD). It is a process used by organizations to extract specific data from
huge databases to solve business problems. It primarily turns raw data into useful information.
The data mining process involves several component s, and these components constitute a data
mining system architectur e. The significant component s of data mining systems are a data source,
data mining engine, data warehouse server, the pattern evaluation module, graphical user interface,
and knowledge base. The architectur e of a typical data mining system may have the following major
components.
• Database, Data Warehouse, World Wide Web, or Other Information Repository: This
is one or a set of databases, data warehouses, spreadshee ts, or other kinds of
informatio n repositories. Data cleaning and data integration techniques may be
performed on the data.
• Database or Data Warehouse Server: The database or data warehouse server is
responsible for fetching the relevant data, based on the user's data mining request.
• Knowledg e Base: This is the domain knowledge that is used to guide the search or
evaluate the interestingn ess of resulting patterns. It is simply stored in the form of set
of rules. Such knowledge can include concept hierarchies, used to organize attributes or
attribute values into different levels of abstraction.
u
Data Wauhousing and Data Mining d .deally consists of
• Data Mining Engine: This is essential to the ~ as characterization, assoaati~n and
. ata mining system an 1 . .
User Interface
Pattern Evaluation
Knowledge
Base
Data Mining Engine
Database or Data
Warehouse Server
----,
;-D~;a-c~eaning, integration and selection :
'- - - - - - - - - - - - - - -- --- - -
There are number of different data stores on which mining can be performed. This includes
relational databases, data warehouses, transactional databases, advanced database systems, flat files,
and the World-Wide Web. Advanced database systems include object-oriented and object-relational
databases, and specific application-oriented databases such as spatial databases, lime-1leries
databases, text databases, and multimedia databases.
Relational Database
A relational database is a collection of multiple data sets formally organized by tables, records, and
columns from which data can be accessed in various ways without having to recognize the database
tables. Tables convey and share information, which facilitates data searchability, reporting, and
organization.
lntrod11<:tion trJ Dota Minin O CHA,ra
. . g 2 tC5
A ,,•1,,ti 1111,,1 d,11,,t,,1,;1• 1a ,1 10ll1'1 l1or1 of 1,,hl,•s, l',1ch of which is ass· •d .
, ' ign{ · a unique name, Each table
u, 11 .,i~l!i nl II i;1' I 111111111h11l1'S (1 ol11mm1 m · fif'ld s) ,ind usut11ly i;ton•s a I be
• . , - arge num r of tuples (records
111 11 n\'S). h111 h I11ph• 111 •' ll'ln lion.ti l,1hlc n•pri•i;,•nts ,1 record identified by .
a unique key and described
h)' ., o;('f 111 .11111h11h• \ ' 1lhll1 8.
r~antl'lc:
Stu,lrnl
~~ N,11111•
- A,f,lress Co"rs,•-11) Foreign Keys
S-12
-
P,IW,111 Jo~h ~ C002
S-14
S.St Abin -
\ ,\llll\l,11\
I•
K,1rki
5,llHJ -
-
C021
C321 -
S-11 A,H-.w S,tud C002 ..
I +
C'nurllP-1D
Course
Course-Name
Relationships C002 C++
C021 DBMS
C321 Account
Figure 2.3: Relationship between tables of relational database
Dat:i Warehouses
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from multiple
places such as Marketing and Finance. The extracted data is utilized for analytical purposes and
helps in decision- making for a business organization. The data warehouse is designed for the
analysis of data rather than transaction processing.
A data warehouse is usually modeled by a multidimensional database structure, where each dimension
corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some
aggregate measure, such as count or sales amount. The actual physical structure of a data warehouse
may be a relational data store or a multidimensional data cube. It provides a multidimensional view of
data and allows the pre-computation and fast accessing of summarized data .
. ,jl
G~
,,.._~ .
. 0.... Kawasot1
~
"'v(f Dhangadi
- fl
QI 605 825 14
1V PC AP SSD
Product (types)
Data Repositories (
The Data Repository generally refers to a destination for data storage. However, many IT E
professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure. t
For example, a group of databases, where an organization has kept various kinds of information. l
I
Object-Relational Database !
I
Transactions
~ ~
Relation al databas e systems have been widely used in business applications. With the
advances of
database technolo gy, various kinds of advance d database systems have emerged and are
undergo ing
develop ment to address the requirem ents of new database applications. The
new database
applicat ions include handlin g spatial data (such as maps), enginee ring design data
(such as the
design of building s, system compon ents, or integrat ed circuits), hypertex t and multime
dia data
(includin g text, image, video, and audio data), time-rel ated data (such as historical records
or stock
exchang e data), and the World-W ide Web (a huge, widely distribu ted informa tion reposito
ry made
available by Internet ). These applicat ions require efficient data structur es. In response
to these needs,
advance d databas e systems and specific applicat ion-orie nted database systems have been
develop ed.
For data mining to be effective, much careful work is needed in defining the aims of
the enterpri se
data mining and then in selection, cleaning, transfor mation and perhaps separate storage
of data that
is suitable for data mining. This separate database may be a data warehou se that stores
informa tion
that is needed by the enterpri se's decision makers. Such a database should include not
just current
data but also historica l data. Data Mining is a process of discovering various models,
summaries,
and derived values from a given collection of data.As noted above, the data mining
process
involves much hard work.W e discuss two differen t approac hes now. The first approac
h is an
adaptati on of the well-kn own software develop ment process and the second one is
Cross-Industry
Standar d Process for Data Mining (CRISP-OM) approac h.
A typical data mining process is likely to include the followin g steps. These steps
are based on a
typical software develop ment process. It should be noted that all such processes are
iterative and
work done at Step 1 may need to be revised based on some new informa tion or new
insight further
down in the process.
41 Data Wnrchou~ing and Data Mining
.--------- ---,
Sr'ATE THE PROBLEM
PI RfORM PRFPROCESSING
In the observational setting, data are usually collected from the existing databases, data
warehouses, and data marts. Data preprocessing usually includes at least two common tasks:
Introdu ction to Data Mining O CHAPTER 2 149
a) Outlier detection (and removal): Outlier s are unusua l data values that
are not
con~istcnt with most observations. Commonly, outliers result from measur
ement
errors, coding .u,d recordi ng errors, and, sometimes, are natural , abnorm
al values.
Such non reprcSl'nt<llivc sample s can seriously affect the model produc ed
later. There
.m~ two slt-.-\tegies for dealing with outliers. (i) Detect and eventua lly remove
outliers as
a p.ut of tlw preproc essing phase, or (ii) Develop robust modeli ng method
s that are
insensitive lo outliers.
b) Scaling, encoding, and selecting features: Data preproc essing include s several
steps
like v,ui.1blc scaling and differing types of encoding. For instance, one
feature with
r,mgc [0 1] ,md other with range [100, 1000] will not have an equiva lent weight
within
applied techniq ue. They are going to also influence ultimat e data-m ining
results
differently. Therefore, it is recomm ended to scale them and convey both feature
s to an
equiYalent weight for further analysis. Also, application-specific encodi
ng method s
usually achieve dimens ionality reducti on by providi ng a smaller
numbe r of
inform ative feature s for subseq uent data modeling.
These two classes of preproc essing tasks are only illustrative sample s of an
outsize d spectru m
of prepro cessing activiti es during a data-m ining process. Data-p reproce ssing
steps should not
be consid ered comple tely indepe ndent from other data-m ining phases.
In every iteratio n of
data-m ining process , all activities, together, could define new and improv
ed data sets for
subseq uent iteratio ns. Genera lly, an honest preproc essing method provid
es an optima l
represe ntation for a data-m ining techniq ue by incorpo rating a prior knowle
dge within sort of
applica tion-sp ecific scaling and encodin g.
4. Estimate the model
The selectio n and implem entatio n of the approp riate data-m ining techniq
ue is the main task
in this phase. This process is not straigh tforwar d; usually , in practice, the
implem entatio n is
based on several models , and selecting the best one is an additio nal task.
5. Interpret the model and draw conclusions
In most cases, data-m ining models should help in decisio n making . Hence,
such models need
to be interpr etable in order to be useful becaus e human s are not likely to
base their decisio ns
on comple x black-b ox models . Note that the goals of accurac y of the model
and accurac y of its
interpr etation are somew hat contrad ictory. Usually, simple models are
more interpr etable,
but they are also less accurat e. Moder n data-m ining method s are expect
ed to yield highly
accurat e results using high dimens ional models . The proble m of interpr
eting these models ,
also very import ant, is consid ered a separat e task, with specific techniq
ues to validat e the
results. A user does not want hundre ds of pages of numeri c results. He
does not unders tand
them; he cannot summa rize, interpr et, and use them for success ful decisio
n making .
CRISP-OM Approach
There is a Cross- Industr y Standa rd Proces s for Data Mining (CRISP
-OM) claime d to be more
practical, success ful, widely adopte d and used by industr y membe rs.
This model consist s of six
phases intend ed as a cyclical proces s (see figure 2.7):
5D Data Warehousing und Datu Mining
Data
Bu siness - -----i unde rstan ding
Unders tanding
Data
Prepar ation
Dcplo~ mc.nl
Modeling
Data
Evaluation
Data mining is not an easy task, as the algorithms used can get very complex and data is not always
available at one place. It needs to be integrated from various heterogeneous data sources. These
factors also create some issues. Data mining systems face a lot of challenges and issues in today's
world some of them are:
1. Mining methodology and user interaction issues
., Performance issues
3. ls.5ues relating to the diversity of database types
I Data Mining Issues
i-
l
Mining different kinds of ie Efficiency and scalability of
!
• Handling of relational and
knowledge in database data mining algorithms complex types of data
i- Interactive mining of knowledge i- Parallel, distributed and • Mining information from
at multiple levels of abstraction incremental mining heterogeneous database and
i- Incorporation of background algorithms global information systems
knowledge
~ Data mining query languages and
ad hoc data mining
• Presentation and visualization of
data mining results
• Handling noisy or incomplete
data
• Pattern evaluation
figure 2.8: Data mining iuues
a) Handling of relational and complex types of data: Because relational databases and
da ta warehouse:. are widely used, the development of efficient and effective dall
mining systems for such data is important. However, other databases may contain
complex data objects, hypertext and multimedia data, spatial data, temporal data, or
transaction data.
Data mining functionalities are used to specify the ldnd of patterns to be found m data mmmg l,t.- ,
Data mining functions are used to define the trends or correlations co;; tamed m data rnmmg
activities. Data mining tasks can be classified into two categories: desc.riptive and predu.twt
1. Descriptive Task; Descriptive mining tasks chaiactt>n.a: · ne general prope:ties of the data m
the database. It includes certain knowledge to understand what is hapvem11g within the data
without a previous idea. The common data features are highlighted in the data sci. for
examples: count, average etc.
2. Predictive Task; Predictive mining tasks perform inference on the current data m order to
make predictions. It helps developers to provide unlabeled definitions of attributes, Based on
previous tests, the software estimates the characteristics that are absent. For e"'Mirnple: Judging
from the findings of a patient's medical exanunations that is he suffering frmn any particUla:
disease.
There are number of data mining functionalities that the organized and scientific methods offer,
Major data mining functionalities are described as follows:
1. OassfConcept Descriptions: Characterization and Discrimination
Gasses or definitions can be correlated with results. In simplified, descriptive and yet a«urate
ways, it can be helpful to define individual groups and concepts. It is important to liru data
with groups or related items. For example, computers and printers are types of goods for sale
in the Hardware Shop. These class or concept definitions are referred to as class/concept
descriptions.
• Data Characterization: The characterization of data is a description of the key
characteristics of objects in a target class which creates what is called a characteristic
rule. To do this, a user can run a database query to compute the user-specified class
through predefined modules to retrieve desired results from data at various abstraction
levels. This refers to the summary of general characteristics or features of the class that
is under the study. For example. To study the characteristics of a software product
whose sales increased by 15% two years ago, anyone can collect these types of data
related to such products by running SQL queries.
• Data Discrimination: It compares common features of class which is under study. The
output of this process can be represented in many forms. E.g., bar charts, curves and
pie charts.
2. Mining Frequent Patterns
Frequent patterns are patterns that occur frequently in data. Frequent patterns are nothing but
things that are found to be most common in the data. There are different kinds of frequency
that can be observed in the dataset.
• Frequent item set This applies to a number of items that can be seen together regularly
for e.g.: milk and sugar.
• Frequent Subsequence: This refers to the pattern series that often occurs regularly such
as purchasing a phone followed by a back cover.
• Frequent Substructure: It refers to the different kinds of data structures such as trees
and graphs that may be combined with the item.set or subsequence.
54 Data Warehousing and Data Mining
3. Association Analysis
The process involves uncovering the rclJtionship
bctWl'Cll d,,t,, and d1•1 iJin~ 111" rut,,• of tlh,
association. It is a way of discovering thC' rel,1lions 1
hip bt1twecn v.trt0 115 ili·tn~. Por 'X,1tnpl.., it
can be used lo determine the sales of items that arc
frequently purch,H,<'d togt'llwr.
4. Correlation Analysis
Correlation is a mathematical technique that can show st
whether and how rongly lbc p,lir11 of
attributes are related to each other. For example, I Jeigh
ted people tend lo hnvl' more WPight.
S. Classification and Prediction
Oassification is the process of finding a model (or
function) that describt>s and distinguishes
data classes or concepts. Mainly it is used to pred
ict the class of objects whose clac;s lab<'l is
unknown. The derived model is based on the analy
sis of a set of training dnta. Dc.1t,1 obh t
whose class label is known is considered as train
ing data. The derived model may ~
represented in various forms, such as classifica
tion 0F-TIIEN) rules, decision ln.'L'S,
mathematical formulae, neural networks etc.
For example, a decision tree performs the classifica
tion in the form of tree structure, It brt1aks
down the dataset into small subsets and a decision
tree can be designed simultaneously. The
final result is a tree with decision node.
The following decision tree can be designed
to declare a result, whe ther an applicant is
eligible or not eligible to get the driving license.
Driving license
Yes No
('ht'lll't' An,1ly11i~
(!.
l lnlik,• r1,,ssitic,1tion nnd pH•di, lion, whit h ,m,,lyzc cl,1ss-l,1bclcd data objects, clustering
,in,ily1.l'!'- d,11,1 obj,•,·ls without consulting ,, known cl,1c;s label. Clustering can be used to
h,•m•1nl1' such 1,,lwb. 'l lw nhj,•,•ts illl' l'ht9h•rcd or grouped based on the principle of
m,\\imizing tlh• inlt,,-, l,1ss simil,,1 ity ,md minimizing tlw interclass similarity. That is, clusters
,,1 l,bj,•ds ,m• 1,,rnwd so th.it nhj,•ct~ within n duc,t,•r h,wc high similarity in comparison to one
,motlw1, but ,m' \'l't")' dbsimil.n lo ohjl'l ts in ollwr clusters.
7. Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of
the d ata. These data objects are outliers. Most data mining methods discard outliers as noise
or exceptions. However, in some applications such as fraud detection, the rare events can be
more interesting than the more regularly occurring ones. Outliers may be detected using
statistical tests that assume a distribution or probability model for the data, or using distance
measures where objects that are a substantial distance from any other cluster are considered
outliers. For example, Outlier analysis may uncover fraudulent usage of credit cards by
detecting purchases of extremely large amounts for a given account number in comparison to
regular charges incurred by the same account.
8. Evolution Analysis
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time. Distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data analysis. For example,
you have the major stock market (time-series) data of the last several years available from the
Nepal Stock Exchange and you would like to invest in shares of high-tech industrial
companies. A data mining study of stock exchange data may identify stock evolution
regularities for overall stocks and for the stocks of particular companies. Such regularities may
help predict future trends in stock market prices, contributing to your decision-making
regarding stock investments.
51 Data Warehousing and Data Mining
ES(KDD)
KNOWLEDGE DISCOVERY IN DATABAS '
. cful knowledge from a
I discovering us .
in dntabascs (K
'DD) ,s the proecss
~ technique
.
O
.
• Judes data preparation
.
pro<:css that incd · terprehng
Knowledge discovcn.• ·.1 • • 1s a accurate
collection of dnta. 11,b widely used d.tta ""nmg d ,c on d,1ta sets an in . fr
. . t',
11 8
~ prior know1c 8 . Jude marketin~ aud
and selection, data cleansing, m1..·orpoia . I<DD .,pplication ,treas inc the knowled
solution., from the observed results. ~faJM •ning is the core p,Ht of . . ge
In turing Datu nu th" by using data mmmg
detection, telecommunicatio n and munu , L • l i c in dat,,, it does is
• <: .,1 finding know c1.. g ount of data.
dLc;co,·ery proces~ 1'PP 1::, a proce,. l d' knowledge from large am
. rd' ~ tract den1.1n ,ng
methods (algonthm!'-) m o d to c:1. Knowledge
~-~•-r
Data CINll7=
~L-lectt-on __L.--+
!
""' ...i»ta..!_ntagratlon I
,, _j,
'·◄-----
Dat.lW•
Figure 2.11: Knowledge Discovery Process (KDP)
Simply stated, data mining refers to extracting or "mining" knowledge from large amounts of data
stored in databases, data warehouses, or other information repositories. Many people treat data
mining as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD.
Alternatively, others view data mining as simply an essential step in the process of knowledge
discovery. Knowledge discovery consists of an iterative sequence of the following steps:
• Data deaning: First step in the Knowledge Discovery Process is Data cleaning in which
noJ.Se and inconsistent data is removed.
o Cleaning in case of Missing values.
o Oeaning noisy data, where noise is a random or variance error.
o Cleaning with Data discrepancy detection and Data transformation tools.
• Data Integration: Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWar ehouse).
o Data integration using Data Migration tools.
o Data integration using Data Synchronizatio n tools.
o Data integration using ETL (Extract-Load-T ransformation) process.
Introductio l D
. n o ata Mining O CHAPTER 2 57
• Data Selection: Data scleclion · d f'
. . d . is c med as the I
ana1ys1s 1s cctded and retrieved f process where data relevant t th
ro,n the database o e
o Data selection using neural network.
o Date'\ selection using Decision T
rees.
0 Data selection using Naive bayes.
0 Data selection using Clusteiing, R .
egress1on,
etc.
• Data Transformation· Tn Data T
. · f
rans ormation data f
appropnate for mining by performin summ ' are tr~ ormed into forms
Transformation 1s a two-step process: g ary or aggregation operations. Data
o Data Mapping· Assign·
transformation~. mg e1ements from source base to destination to capture
0 Code
. . generation: Creation of the actual transformation
. program.
• Data l\lmmg: In Data Mining, data . . .
to e'.tract data patterns. nurung me thods (algonthms) are applied in order
0 Transforms task relevant data into patterns.
0
Decides purpose of model using classification or characterization.
• ~attern_Evaluation: In Pattern Evaluation, data patterns are identified based on some
mteresting measures.
o Find interestingness score of each pattern.
0 Uses summarization and Visualization to make data understandable by user.
• Knowl~dge Presentation; In Knowledge Presentation, knowledge is represented to
user usmg many knowledge representation techniques.
o Generate reports.
o Generate tables.
o Generate discriminant rules, classification rules, characterization rules, etc.
Data sets are made up of data objects. A data object represents an entity-in a sales database, the
objects may be customers, store items, and sales; in a medical database, the objects may be patients;
in a university database, the objects may be students, professors, and courses. Data objects are
typically described by attributes. Data objects can also be referred to as samples, examples, instances,
data points, or objects. If the data objects are stored in a database, they are data tuples. That is, the
rows of a database correspond to the data objects, and the columns correspond to the attributes.
An attribute is a data field, representing a characteristic or feature of a data object. The nouns
attribute, dimension, feature, and variable are often used interchangeably in the literature. The term
dimension is commonly used in data warehousing. Machine learning literature tends to use the term
feature, while statisticians prefer the term variable. Data mining and database professionals
commonly use the term attribute, and we do here as well. Attributes describing a customer object can
include, for example, customer_ID, name, and address. Observed values for a given attribute are
known as observations. A set of attributes used to describe a given object is called an attribute vector
(or feature vector). The distribution of data involving one attribute (or variable) is called univariate.
A bivariate distribution involves two attributes, and so on.
58 Data Warehousing and Do.ta Mining
J
The type of an attribute is dt•tt•rmilwd by tlw !i!'I of possihl1· v,1lucs-nornin,II, binary, ordinal, or
numeric-the attribute rnn h,Wl'. In the follow111g subst'l'tions, we introduce each typ(•.
1. Nominal Attributes
Nominal me,ms "rel,1ting to n,mws." The v,,ltH'S of ,1 nornin.il clllribulc ctre symbols or narnes
of things. Each vnltw n•presl'llls some kind of C,lll'gory, c:odc, or slate, and so nominal
attributes are also referred to ,1s c.1lcgoric,1! Tlw v,1lucs do nol h,wc i.JnY meaning( ul order. In
computer science, lhe v,1lues an• ulso known as l'llt1m<•r,1l1ons.
Nominal attribu ll' \'alm•s do not h,1w any me,mingful order abou_l the~ and are not
quantitative, it m,1kcs no sense to fmd the mean (.ivcr,1ge) value or mcdrnn (middle) value for
such an attribu te. giwn a Sl'I of objects. One thing th,1I ;., of interest, however, is the atlribute's
most commonly occurring value This value, known ns the mode, is one of the measures of
central tendency
Example: Suppose th,1t h,1ir color and marita l slc1lus are two attributes describing person
objects. In our ,lpphcahon, possible va lues for hair_color are black, br?wn, blond: red, ~uburn,
gray, and white. The attribute marital_slatus can lake on the values single, married, divorced,
and widowed. Both hair color and marital_stalus are nominal a ttributes. Another example of
a nominal attribute is oc~upation, with the values teacher, dentist, programmer, farmer, and
soon.
2. Binary Attributes
A binan· attribute is a nominal attribute with only two categories or s tates: 0 or 1, where o
typically means that the attribute is absent, and 1 means that it is present. Binary attributes are
referred to as Boolean if the two states correspond to true and false.
Example: Given the attribute smoker describing a patient object, 1 indicates that the patient
smokes, Khile Oindicates that the patient does not. Similarly, suppose the patient undergoes a
medical test that has two possible outcomes. The attribute medical_test is binary, where a
value of 1 means the result of the test for the patient is positive, while O means the result is
negative.
3. Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
Example: Suppose that drink_size corresponds to the size of drinks available at a fast-food
restaurant. This nominal attribute has three possible values: small, medium, and large. The
values have a meaningful sequence (which corresponds to increasing drink size); however, we
cannot tel1 from the values how much bigger, say, a medium is than a large. Other examples
of or~al attributes include grade (e.g., A+, A, A-, B+, and so on) and professional_rank.
Professional ranks can be enumerated in a sequential order: for example, assistant, associate,
and full for professors, and priva te, private first class, specialist, corporal, and sergeant for
army ranks.
4. N umeric Attributes
A numeric attribute is quantitative; tha t is, it is a measurable ql,lantity, represented in integer
or real values. Numeric attributes can be interval-scaled or ratio-scaled.
• Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of
interval-scaled attributes have order and can be positive, 0, or negative. Thus, ill
.. . . Introduction to Data Mining O CHAPTER 2 159
add1t10n to prov1dmg a ranking of values, such attributes allow us to compare and
quantify the difference between values.
For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic
statistical descriptions can be used to Jdentify properties of the data and highlight which data values
should be treated as noise or outliers.
Statistics simply means numerical data, and is field of math that generally deals with collection of data,
tabulation, and interpretation of numerical data. It is actually a form of mathematical analysis that uses
different quantitative models to produce a set of experimental data or studies of real life. It is an area of
applied mathematics concern with data collection analysis, interpretation, and presentation. Statistics
deals with how data can be used to solve complex problems. Some people consider statistics to be a
distinct mathematical science rather than a branch of mathematics. Statistics makes work easy and
simple and provides a clear and clean picture of work you do on a regular basis.
Descriptive statistics uses data that provides a description of the population either through
numerical calculation or graph or table. It provides a graphical summary of data. It is simply used
for summarizing objects, etc. There are two categories in this as following below.
• Measure of central tendency and
• Measure of Variability or measure of dispersion
1. Measure of central tendency
Measure of central tendency is also known as summary statistics that is used to represents the
center point or a particular value of a data set or sample set. In statistics, there are three
common measures of central tendency as shown below:
• Mean
• Median and
• Mode
• Data Warehousing and Datn Mining
Mean: It is me,,sure of ovcr,1gc of ..111 value in a s.implc seL. It is found by adding all data
points and dividing by the number of data points.
(4+1+7) 12
Example: The mNn of 4, 1, nnd 7 is 3 = 3 "" 4
Median: The middle number; found by ordering all data points and picking out the one in the
middle (or i( there are two middle numbers, taking the mean of those two numbers).
rd 4 7
Example 1~ The median oi 4, 1, and 7 is ,t because when the numbers are put in °
er {l, , 1, the
points.
Median = (2{) ; 40} "' ~ -= 30
Til4e mNlian is 30
Mode: The mode for a set of data is the value that occurs most frequently in the set. Hence, it
c.an be c.alculated for both qualitative and quantitative attributes.The re is an equal possibility
that a dataset might have two modes. Such datasets are known as bimodal. In general, a
data..i.et with two or more modes is known as multimodal.
&.ample: The mode of {4, 2, 4, 3, 2, 2) is 2 because it occurs three times, which is more than
any other number.
2 Measure of Variability/ Measure of dispersion
Measure of Variability is also known as measure of dispersion and used to describe variahility
m a sample or population. In statistics, there are three common measures of variability as
shown below:
• Range
• Variance and
• Dispersion
Range
It is gi~ m~ure of how to spread apart values in sample set or data set.
Range - Maxrmum value - Minimum value
Example: In (4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9
So, the range is 9 - 3 = 6. ·
Variance
is calculated by taking the average o f squared
fr is a measure of variability. It
dThe· variance
·
ev1ations om the mean. The sample variance formula looks like this:
Formula Explanation
• s2 = sample variance
s2 = I(X- ~)2
n-1 • l: = sum of.. .
• X = each value
• x = sample mean
Dispersion is the state of getting dispersed or spread Statistical dispersion means the extent to
which a numerical data is likely to vary about an average value. In other words, dispersion
helps to understand the distribution of the data.
As the name suggests, the measure of dispersion shows the scatterings of the data. It tells the
variation of n,e data from one another and gives a clear idea about the distribution of the data.
The measure of dispersion shows the homogeneity or the heterogeneity of the distribution of
the observations.
Applications Usage
Communications Data mining techniques are used in communication sector to predict customer
behavior to offer highly targeted and relevant campaigns.
Insurance Data mining helps insurance companies to price their products profitable and
promote new offers to their new or existing customers.
Education Data mining benefits educators to access student data, predict achievement levels
and find students or groups of students which need extra attention. For example,
students who are weak in mathematic subject.
Manufacturing With the help of Data Mining Manufacturers can predict wear and tear of
production assets. They can anticipate maintenance which helps them reduce them
to minimize downtime.
Banking Data mining helps finance sector to get a view of market risks and manage
regulatory compliance. It helps banks to identify probable defaulters to decide
whether to issue credit cards, loans, etc.
Retail Data Mining techniques help retail malls and grocery stores identify and arrange
most sellable items in the most attentive positions. It helps store owners to come up
with the offer which encourages customers to increase their spending.
Service Providers Service providers like mobile phone and utility industries use Data Mining to
predict the reasons when a customer leaves their company. They analyze billing
details, customer service interactions, complaints made to the company to assign
each customer a probability score and offers incentives.
£-Commerce E-commerce websites use Data Mining to offer cross-sells and up-sells through their
websites. One of the most famous names is Amazon, who use Data mining
techniques to get more customers into their ecommerce store.
Super Markets Data Mining allows supermarket's developed rules to predict if their shoppers were
likely to be expecting. By evaluating their buying pattern, they could find woman
customers who are most likely pregnant. They can start targeting products like baby
powder, baby shop, diapers and so on.
Crime Investigation Data Mining helps crime investigation agencies to deploy police workforce (where
is a crime most likely to happen and when?), who to search at a border crossing etc.
Bioinformatics Data Mining helps to mine biological data from massive datasets gathered in
biology and medicine.
12 Data Warehousing and Data Mining
2. User Interface
The knowledge discovered is discovered using data mining tools is useful only if it is
interesting and above all understandable by the user. From good visualization interpretation
of data, mining results can be eased and helps better understand their requirements. To obtain
good visualization many researches is carried out for big data sets that display and
manipulate mined knowledge.
These challenges are related to data mining approaches and their limitations. Mining
approaches that cause the problem are:
a) Versatility of the mining approaches,
b) Diversity of data available,
c)Dimensionality of the domain,
4. Complex Data
b) Mining from Varied Sources: The data is gathered from different sources on Network.
The data source may be of different kinds depending on how they are stored such as
structured, semi structured or unstructured.
5, Performance
The performance of the data mining system depends on the efficiency of algorithms and
techniques are using. The algorithms and techniques designed are not up to the mark lead to
affect the performance of the data mining process.
a) Efficiency and Scalability of the Algorithms: The data mining algorithm must be
efficient and scalable to extract information from huge amounts of data in the database.
b) Improvement of Mining Algorithms: Factors such as the enormous size of the
database, the entire data flow arid the difficulty of data mining approaches inspire the
creation of parallel & distributed data mining algorithms.
• There is a probability that the organizations may sell useful data of customers to other
organizations for money.
• Many data mining analytics software is difficult to operate and needs advance training
to work on.
• Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining tools is
a very challenging task.
• The data mining techniques are not precise, so that it may lead to severe consequences
in certain conditions.
~----~........
4:bercis!)
Define data mining. ALc;o descn'be data mining functionalities . What are the different
f:mctims o: Data Mining?
2. Wha! is KDD, E.,;.plain ";th suitable block diagram.
.:,. \\m!do}OO mean by data object and attribute types? Explain.
Describe statistical description of data.
Desaibedata mining system with suitable example.
Descrt'be advanta.s:es
.., and disadvantages of data mining.
Dffiereruiate between KDD and Data Mining.
6. Exp..m the application of mining used in WWW.
9. Expla::l the data mining query language with example.
Difrereruiate between Data-Warehouse and Data-mining. Explain the stages of knowledg
Giscm'erv in database with example.
11. What are foundations of data mining? What is the scope of data mining?
12. Gn-e a brief introduction to data mining process? What is a Model in the field of Data Mining?
13. Na=ie areas of applications of data mining? What are the most significant advantages of Dat
Mining?
N'a.:ne the different Data Mining techniques and explain the scope of Data Mining.
W:hat is the fundamental difference between Data Warehousing and Data Mining?
16.
.Explain the different stages of Data Mining. What are the most significant disadvantages o
Data Mining?
~:"1at
ar_e the common issues faced during Data Mining? What are the required technological
onvers m Data Mining?
18.
What are the key features of Data Mining? What are the different fields where data mining·
used?
19. ¾'hat are the different problems that ''Data Mining" can solve?
20. How do Data Mining and Data Warehousing work together?
000