0% found this document useful (0 votes)
22 views26 pages

Unit I DWDM

Uploaded by

swaroopsai411
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views26 pages

Unit I DWDM

Uploaded by

swaroopsai411
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Warehousing and Data Mining 1

UNIT – I (DWDM)
1. What is data mining?
 Data mining should have been more appropriately named ―knowledge mining from data,‖
in short it is ―Knowledge mining,‖ many people treat data mining as a synonym for
another popularly used term, Knowledge Discovery from Data, or KDD.
Different views of Data mining

1) Data mining as simply an essential step in the process of knowledge discovery.

Fig 1: Data mining as a step in the process of knowledge discovery

Above Figure consists of an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)


2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied in order to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on
some interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present the mined knowledge to the user)
III Year - II Semester
Data Warehousing and Data Mining 2

2. Data mining should be applicable to any kind of data


1) Relational databases
2) Data warehouse
3) Transactional databases
4) Advanced databases
 Object oriented
 Object relational
 Application oriented databases
o Spatial
o Temporal
o Time-Series
o Text
o Multimedia databases
5) Flat files
6) WWW
1) Relational databases
 A Relational database is defined as the collection of data organized in tables
with rows and columns.
 Physical schema in Relational databases is a schema which defines the
structure of tables.
 Logical schema in Relational databases is a schema which defines the
relationship among tables.
 Standard API of relational database is SQL.
2) DWH:
 A data warehouse is defined as the collection of data integrated from multiple
sources that will queries and decision making.
 There are three types of data warehouse: Enterprise data warehouse, Data Mart
and Virtual Warehouse.
 Two approaches can be used to update data in Data Warehouse: Query-driven
Approach and Update-driven Approach.
 Application: Business decision making, Data mining, etc.

3) Transactional databases:
 Transactional databases are a collection of data organized by time stamps,
date, etc to represent transaction in databases.

III Year - II Semester


Data Warehousing and Data Mining 3

 This type of database has the capability to roll back or undo its operation when
a transaction is not completed or committed.
 Highly flexible system where users can modify information without changing
any sensitive information.
 Follows ACID property of DBMS.

4) Advanced databases:
 Object oriented
 Object relational
 Application oriented databases
• Spatial
• Temporal
• Time-Series
• Text
• Multimedia databases
Multimedia Databases:
 Multimedia databases consists audio, video, images and text media.

 They can be stored on Object-Oriented Databases.


 They are used to store complex information in pre-specified formats.

 Application: Digital libraries, video-on demand, news-on demand,


musical database, etc.

Spatial Database:
 Store geographical information.

 Stores data in the form of coordinates, topology, lines, polygons, etc.


 Application: Maps, Global positioning, etc.

Time-series Databases:
 Time series databases contain stock exchange data and user logged

activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: eXtremeDB, Graphite, InfluxDB, etc.

III Year - II Semester


Data Warehousing and Data Mining 4

3. Data Mining Functionalities—What Kinds of Patterns Can Be Mined?


 Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks.
 Data mining tasks can be classified into two categories:
1) Descriptive
2) Predictive

 Descriptive tasks derive patterns that summarize the underlying relationship in the
data. Ex: correlations, trends, clusters, trajectories and anomalies. These are in
explanatory in nature.
 Predictive tasks perform inference on the current data to make predictions. i.e predict
the value of a particular attribute based on the values of other attributes. ex:
classification, regression.

Data mining functionalities, and the kinds of patterns they can discover, are described below:

1. Characterization & Discrimination


2. Association analysis
3.Classification
4. Evolution analysis
5. Clustering
6. Outlier analysis.

1. Concept/Class Description: Characterization and Discrimination

Data can be associated with classes or concepts.

Ex: In the AllElectronics store,


Classes of items for sale include computers and printers
Concepts of customers include bigSpenders and budgetSpenders.

The summarized descriptions of class or a concept are very much useful. Such
descriptions of a class or a concept are called class/concept descriptions. These
descriptions can be derived via (1) data characterization 2) data discrimination, (3) both
data characterization and discrimination
III Year - II Semester
Data Warehousing and Data Mining 5

Data characterization is a summarization of the general characteristics or features of data.


Ex: study the characteristics of software products whose sales increased by 10% in the last year.

 Methods used for this are statistical measures, plots and OLAP operations.
 The output of data characterization can be presented in various forms.
Ex: pie charts, bar charts, curves, multidimensional data cubes, and multidimensional
tables,
Data discrimination is comparison of the target class (the class under study) with one or a set
ofcomparative classes (called the contrasting classes).
Ex: the user may like to compare the general features of software products whose sales increased
by 10% in the last year with those whose sales decreased by at least 30% during the same period
 Methods used and output presentation is same as characterization although discrimination
descriptions should include comparative measures that help distinguish between the
target and contrasting classes

2. Association Analysis
Association analysis is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data. This analysis is widely used for
market basket or transaction data analysis.
Association rules are of the form X => Y is interpreted as ―database tuples that satisfy
the conditions in X are also likely to satisfy the conditions in Y‖.

Ex: Marketing manager of AllElectronics, would like to determine which items are frequently
purchased together within the same transactions. An example of such a rule, mined from the
AllElectronics transactional database
buys(X, ―computer‖) => buys(X, ―software‖) [support = 1%; confidence = 50%]

where X is a variable representing a customer. A confidence of 50% means that if a


customer buys a computer, there is a 50% chance that she will buy software as well, and 1% of
all of the transactions contain both computer and software were purchased together
3. Classification and Prediction
Classification is the process of finding a model that describes and distinguishes data
classes or concepts. To predict the class of objects whose class label is unknown. The derived
III Year - II Semester
Data Warehousing and Data Mining 6

model is based on the analysis of a set of training data (i.e., data objects whose class label is
known).
The derived model may be represented in various forms, such as classification (IF-
THEN) rules, decision trees, mathematical formulae, or neural networks

Fig: classification model can be represented in various forms, such as (a) IF-THEN rules,(b) a decision tree, or a (c) neural
network.

Ex: AllElectronics, items are classified into 3 classes good response, mild response and no
response. based on the descriptive features of the items based on price, brand, place made, type,
and category.
Predict missing or unavailable data values are referred as Prediction.

4. Evolution Analysis
It describes and models regularities or trends for objects whose behavior changes over
time, this may include characterization, discrimination, association and correlation analysis,
classification, prediction, clustering.
Ex: Stock market data analysis to predict the future trends using previous years data for decision
making regarding stock investments.

5. Cluster Analysis
Cluster is a group of similar data points or objects for analysis. The objects within a
cluster have high similarities in comparison to one another but are very dissimilar to objects in

III Year - II Semester


Data Warehousing and Data Mining 7

other clusters.

Ex: Cluster AllElectronics customer data with respect to customer locations in a city. These
clusters may represent individual target groups for marketing.

Fig : 2-D plot of customer data with respect to customer locations in a city, showing three data clusters.
Each cluster ―center‖ is marked with a ―+‖.
6. Outlier Analysis
A database may contain data objects that do not comply with the general behavior or
model of the data. These data objects are outliers (noise in the data) outliers may be detected
using statistical tests
Ex : Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account.

4. Which technologies are used?: Data mining is an interdisciplinary field, the


confluence of a set of disciplines, including database systems, statistics, machine learning,
visualization, and information science. Because of the diversity of disciplines contributing to data
mining, data mining research is expected to generate a large variety of data mining systems. Data
mining systems can be categorized according to various criteria, as follows:

III Year - II Semester


Data Warehousing and Data Mining 8

Statistics studies the collection, analysis, interpretation or explanation, and presentation of


data. Data mining has an inherent connection with statistics.

A statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
distributions.

Statistical methods can also be used to verify data mining results. For example, after a
classification or prediction model is mined, the model should be verified by statistical
hypothesis testing. A statistical hypothesis test (sometimes called confirmatory data
analysis) makes statistical decisions using experimental data.

Machine learning investigates how computers can learn (or improve their performance)
based on data. A main research area is for computer programs to automatically learn to
recognize complex patterns and make intelligent decisions based on data.

Supervised learning is basically a synonym for classification. The supervision in the


learning comes from the labeled examples in the training data set.

Unsupervised learning is essentially a synonym for clustering. The learning process is


unsupervised since the input examples are not class labeled. Typically, we may use
clustering to discover classes within the data.

Semi-supervised learning is a class of machine learning techniques that make use of


both labeled and unlabeled examples when learning a model. In one approach, labeled
examples are used to learn class models and unlabeled examples are used to refine the
boundaries between classes.

Active learning is a machine learning approach that lets users play an active role in the
learning process. An active learning approach can ask a user (e.g., a domain expert) to
label an example, which may be from a set of unlabeled examples or synthesized by the
learning program.

Database systems research focuses on the creation, maintenance, and use of databases for
organizations and end-users. Particularly, database systems researchers have established
highly recognized principles in data models, query languages, query processing and
optimization methods, data storage, and indexing and accessing methods. Database systems
are often well known for their high scalability in processing very large, relatively structured
data sets.

Data warehouse integrates data originating from multiple sources and various timeframes. It
consolidates data in multidimensional space to form partially materialized data cubes. The
data cube model not only facilitates OLAP in multidimensional databases but also promotes
multidimensional data mining.

Information retrieval (IR) is the science of searching for documents or information in


documents. Documents can be text or multimedia, and may reside on the Web. The
differences between traditional information retrieval and database systems are twofold:
Information retrieval assumes that (1) the data under search are unstructured; and (2) the
III Year - II Semester
Data Warehousing and Data Mining 9

queries are formed mainly by keywords, which do not have complex structures (unlike SQL
queries in database systems).

 Classification according to the kinds of databases mined:

Database systems can be classified according to different criteria (such as data


models, or the types of data or applications involved), each of which may require its own
data mining technique.
o classifying according to data models, we may have a relational, transactional,
object-relational, or data warehouse mining system.

o classifying according to the type of data handled, we may have a spatial, time-
series, text, stream data, multimedia data mining system, or a WorldWideWeb
mining system.
 Classification according to the kinds of knowledge mined:
Data mining systems can be categorized according to the kinds of knowledge
they mine, that is, based on data mining functionalities, such as characterization,
discrimination, association and correlation analysis, classification, prediction, clustering,
outlier analysis, and evolution analysis.

 Classification according to the kinds of techniques utilized: 


Data mining systems can be categorized according to the underlying data mining
techniques employed. These techniques can be described according to the
o degree of user interaction involved (e.g., autonomous systems, interactive
exploratory systems, query-driven systems)
or
o the methods of data analysis employed (e.g., database-oriented or data warehouse–
oriented techniques, machine learning, statistics, visualization, pattern recognition,
neural networks, and so on).

 Classification according to the applications adapted:

Data mining systems can also be categorized according to the applications they adapt.
For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
III Year - II Semester
Data Warehousing and Data Mining 10

5. Which Kinds of Applications Are Targeted?


It is impossible to enumerate all applications. We briefly discuss two highly successful and
popular application examples of data mining: business intelligence and search engines.
Business intelligence:
It is critical for businesses to acquire a better understanding of the commercial context of their
organization, such as their customers, the market, supply and resources, and competitors. Business
intelligence (BI) technologies provide historical, current, and predictive views of business operations.
Examples include reporting, online analytical processing, business performance management,
competitive intelligence, benchmarking, and predictive analytics.
Web search engine:
A Web search engine is a specialized computer server that searches for information on the
Web. The search results of a user query are often returned as a list (sometimes called hits). The hits
may consist of web pages, images, and other types of files. Some search engines also search and return
data available in public databases or open directories. Search engines differ from web directories in
that web directories are maintained by human editors whereas search engines operate algorithmically
or by a mixture of algorithmic and human input.

6. Major Issues in Data Mining

Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data
sources. These factors also create some issues.

 Mining Methodology and User Interaction


 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

III Year - II Semester


Data Warehousing and Data Mining 11

Mining Methodology and User Interaction Issues


It refers to the following kinds of issues −

 Mining different kinds of knowledge in databases − Different users may be


interested in different kinds of knowledge. Therefore it is necessary for data mining
to cover a broad range of knowledge discovery task.

 Interactive mining of knowledge at multiple levels of abstraction − The data


mining process needs to be interactive because it allows users to focus the search for
patterns, providing and refining data mining requests based on the returned results.

 Incorporation of background knowledge − To guide discovery process and to


express the discovered patterns, the background knowledge can be used. Background
knowledge may be used to express the discovered patterns not only in concise terms
but at multiple levels of abstraction.

 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are
discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.

 Handling noisy or incomplete data − The data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the data
cleaning methods are not there then the accuracy of the discovered patterns will be
poor.

 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms − In order to effectively


extract the information from huge amount of data in databases, data mining algorithm
must be efficient and scalable.

 Parallel, distributed, and incremental mining algorithms − The factors such as


huge size of databases, wide distribution of data, and complexity of data mining
methods motivate the development of parallel and distributed data mining
algorithms. These algorithms divide the data into partitions which is further
III Year - II Semester
Data Warehousing and Data Mining 12

processed in a parallel fashion. Then the results from the partitions are merged. The
incremental algorithms, update databases without mining the data again from scratch.

Diverse Data Types Issues


 Handling of relational and complex types of data − The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is
not possible for one system to mine all these kind of data.

 Mining information from heterogeneous databases and global information


systems − The data is available at different data sources on LAN or WAN. These
data source may be structured, semi structured or unstructured. Therefore mining the
knowledge from them adds challenges to data mining.

7. Data Objects and Attribute Types


 Data sets are made up of data objects.
 A data object represents an entity—in a sales database, the objects may be customers, store
items, and sales; in a medical database, the objects may be patients; in a university database,
the objects may be students, professors, and courses.
 Data objects are typically described by attributes.
 Data objects can also be referred to as samples, examples, instances, data points, or objects.
 If the data objects are stored in a database, they are data tuples. That is, the rows of a database
correspond to the data objects, and the columns correspond to the attributes.

 An attribute is a data field, representing a characteristic or feature of a data object. The nouns attribute,
dimension, feature, and variable are often used interchangeably in the literature.
 The term dimension is commonly used in data warehousing. Machine learning literature tends to use
the term feature, while statisticians prefer the term variable. Data mining and database professionals
commonly use the term attribute, and we do here as well. Attributes describing a customer object can
include, for example, customer ID, name, and address. Observed values for a given attribute are known
as observations.
 A set of attributes used to describe a given object is called an attribute vector (or feature vector). The
distribution of data involving one attribute (or variable) is called univariate. A bivariate distribution
involves two attributes, and so on.
The type of an attribute is determined by the set of possible value.
1. Nominal attribute
2. Binary attribute
3. Ordinal attribute
4. Numeric attribute

III Year - II Semester


Data Warehousing and Data Mining 13

1. Nominal attribute:
Nominal means ―relating to names.‖ The values of a nominal attribute are symbols or names of things.
Each value represents some kind of category, code, or state, and so nominal attributes are also referred to as
categorical. The values do not have any meaningful order. In computer science, the values are also known as
enumerations.
Ex: Suppose that hair color and marital status are two attributes describing person objects. In our
application, possible values for hair color are black, brown, blond, red, auburn, gray, and white. The attribute
marital status can take on the values single, married, divorced, and widowed.
2. Binary attribute:
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means
that the attribute is absent and 1 means that it is present. Binary attributes are referred to as Boolean if the
two states correspond to true and false.
Ex: The attribute medical test is binary, where a value of 1 means the result of the test for the patient is
positive, while 0 means the result is negative.
A binary attribute is symmetric if both of its states are equally valuable and carry the same weight. One
such example could be the attribute gender having the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not equally important, such as the
positive and negative outcomes of a medical test for HIV. By convention, we code the most important
outcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other by 0 (e.g., HIV negative).

3. Ordinal attribute:
An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among
them, but the magnitude between successive values is not known.
Ex: Suppose that drink size corresponds to the size of drinks available at a fast-food restaurant. This
nominal attribute has three possible values: small, medium, and large. The values have a meaningful sequence
(which corresponds to increasing drink size); however, we cannot tell from the values how much bigger.
4. Numeric attribute:
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values.
Numeric attributes can be interval-scaled or ratio-scaled.
a. Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-scaled
attributes have order and can be positive, 0, or negative.
Ex: A temperature attribute is interval-scaled. Suppose that we have the outdoor temperature value for a
number of different days, where each day is an object.
b. Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a measurement is
ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value. In addition, the
values are ordered, and we can also compute the difference between values, as well as the mean,
median, and mode.
Ex: examples of ratio-scaled attributes include count attributes such as years of experience (e.g., the
III Year - II Semester
Data Warehousing and Data Mining 14

objects are employees) and number of words (e.g., the objects are documents).
Discrete versus Continuous Attributes:
 A discrete attribute has a finite or countably infinite set of values, which may or may not be
represented as integers. The attributes hair color, smoker, medical test, and drink size each have a finite
number of values, and so are discrete. Note that discrete attributes may have numeric values, such as 0
and 1 for binary attributes or, the values 0 to 110 for the attribute age.
 If an attribute is not discrete, it is continuous. The terms numeric attribute and continuous attribute are
often used interchangeably in the literature. In practice, real values are represented using a finite number
of digits. Continuous attributes are typically represented as floating-point variables.
8. Basic Statistical Descriptions of Data
For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic
statistical descriptions can be used to identify properties of the data and highlight which data values should be
treated as noise or outliers.
There are 3 areas of basic statistical descriptions:
1. Measures of central tendency - which measure the location of the middle or center of a data
distribution.
2. Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and
Interquartile Range.
3. Graphic Displays of Basic Statistical Descriptions of Data.

1. Measuring the Central Tendency: Mean, Median, and Mode


The most common and effective numeric measure of the ―center‖ of a set of data is the (arithmetic) mean.
Let x1,x2,...,xN be a set of N values or observations, such as for some numeric attribute X, like salary. The
mean of this set of values is

Ex: Suppose we have the following values for salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.

Sometimes, each value xi in a set may be associated with a weight wi for i = 1,...,N. The weights reflect the
significance, importance, or occurrence frequency attached to their respective values. In this case, we can
compute

III Year - II Semester


Data Warehousing and Data Mining 15

For skewed (asymmetric) data, a better measure of the center of data is the median, which is the middle
value in a set of ordered data values. It is the value that separates the higher half of a data set from the lower
half.
Suppose that a given data set of N values for an attribute X is sorted in increasing order. If N is odd, then
the median is the middle value of the ordered set. If N is even, then the median is not unique; it is the two
middlemost values and any value in between. If X is a numeric attribute in this case, by convention, the median
is taken as the average of the two middlemost values.
Median. Let’s find the median of 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. The data are already sorted in
increasing order. There is an even number of observations (i.e., 12); therefore, the median is not unique. It can
be any value within the two middlemost values of 52 and 56 (that is, within the sixth and seventh values in the
list). By convention, we assign the average of the two middlemost values as the median; that is, (52+56)/2 = 108
/2 = 54. Thus, the median is $54,000. Suppose that we had only the first 11 values in the list. Given an odd
number of values, the median is the middlemost value. This is the sixth value in this list, which has a value of
$52,000.
The mode for a set of data is the value that occurs most frequently in the set. Therefore, it can be
determined for qualitative and quantitative attributes. It is possible for the greatest frequency to correspond to
several different values, which results in more than one mode. Data sets with one, two, or three modes are
respectively called unimodal, bimodal, and trimodal. In general, a data set with two or more modes is
multimodal. At the other extreme, if each data value occurs only once, then there is no mode.
Mode. The data 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 are bimodal. The two modes are $52,000 and
$70,000

The midrange can also be used to assess the central tendency of a numeric data set. It is the average of the
largest and smallest values in the set.
Midrange. The midrange of the data 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 is
(30,000+110,000)/2 = $70,000
2. Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile
Range.
Range : Let x1,x2,...,xN be a set of observations for some numeric attribute, X. The range of the set is the
difference between the largest (max()) and smallest (min()) values.

Quartiles: Suppose that the data for attribute X are sorted in increasing numeric order. Imagine that we can
pick certain data points so as to split the data distribution into equal-size consecutive sets, as in Following

III Year - II Semester


Data Warehousing and Data Mining 16

Figure These data points are called quantiles.

Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially equalsize
consecutive sets.
The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It corresponds
to the median.
The 100-quantiles are more commonly referred to as percentiles. They divide the data distribution into 100
equal-sized consecutive sets.
The quartiles give an indication of a distribution’s center, spread, and shape. The first quartile, denoted
by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The third quartile, denoted by Q3, is the
75th percentile—it cuts off the lowest 75% (or highest 25%) of the data. The second quartile is the 50th
percentile. As the median, it gives the center of the data distribution.
The distance between the first and third quartiles is a simple measure of spread that gives the range
covered by the middle half of the data. This distance is called the interquartile range (IQR) and is defined
as IQR = Q3 − Q1.

Ex : Interquartile range. The quartiles are the three values that split the sorted data set into four equal parts.
The data of Example 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 contain 12 observations, already sorted in
increasing order. Thus, the quartiles for this data are the third, sixth, and ninth values, respectively, in the sorted
list. Therefore, Q1 = $47,000 and Q3 is $63,000. Thus, the interquartile range is IQR = 63 − 47 = $16,000.
(Note that the sixth value is a median, $52,000, although this data set has two medians since the number of data
values is even.)

Variance and Standard Deviation


The variance of N observations, x1,x2,...,xN , for a numeric attribute X is

Where x¯ is the mean value of the observations. The standard deviation, σ, of the observations is the square
root of the variance, σ 2.

Ex: Variance and standard deviation. In Example for data 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 , we
found mean= x¯ = $58,000. To determine the variance and standard deviation of the data from that example, we
set N = 12

III Year - II Semester


Data Warehousing and Data Mining 17

 σ measures spread about the mean and should be considered only when the mean is chosen as the measure
of center.
 σ = 0 only when there is no spread, that is, when all observations have the same value. Otherwise, σ > 0.

3. Graphic Displays of Basic Statistical Descriptions of Data


These include quantile plots, quantile–quantile plots, histograms, and scatter plots. Such graphs are
helpful for the visual inspection of data, which is useful for data preprocessing.
A quantile plot is a simple and effective way to have a first look at a univariate data distribution. First, it
displays all of the data for the given attribute (allowing the user to assess both the overall behavior and unusual
occurrences). Second, it plots quantile information.
Let xi , for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest observation and xN is
the largest for some ordinal or numeric attribute X. Each observation, xi , is paired with a percentage, fi , which
indicates that approximately fi × 100% of the data are below the value, xi . We say ―approximately‖ because
there may not be a value with exactly a fraction, fi , of the data below xi . Note that the 0.25 percentile
corresponds to quartile Q1, the 0.50 percentile is the median, and the 0.75 percentile is Q3.

Example: Quantile plot. Following Figure shows a quantile plot for the unit price data of Table given below.

III Year - II Semester


Data Warehousing and Data Mining 18

A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution against the
corresponding quantiles of another. It is a powerful visualization tool in that it allows the user to view whether
there is a shift in going from one distribution to another.
Suppose that we have two sets of observations for the attribute or variable unit price, taken from two
different branch locations. Let x1,...,xN be the data from the first branch, and y1,..., yM be the data from the
second, where each data set is sorted in increasing order.
If M = N (i.e., the number of points in each set is the same), then we simply plot yi against xi , where yi and
xi are both (i − 0.5)/N quantiles of their respective data sets.
If M < N (i.e., the second branch has fewer observations than the first), there can be only M points on the q-
q plot. Here, yi is the (i − 0.5)/M quantile of the y data, which is plotted against the (i − 0.5)/M quantile of the x
data. This computation typically involves interpolation.

Histograms (or frequency histograms) are at least a century old and are widely used. ―Histos‖ means pole or
mast, and ―gram‖ means chart, so a histogram is a chart of poles. Plotting histograms is a graphical method for
summarizing the distribution of a given attribute, X. If X is nominal, such as automobile model or item type,
then a pole or vertical bar is drawn for each known value of X. The height of the bar indicates the frequency
(i.e., count) of that X value. The resulting graph is more commonly known as a bar chart.

The range of values for X is partitioned into disjoint consecutive subranges. The subranges, referred to as
buckets or bins, are disjoint subsets of the data distribution for X. The range of a bucket is known as the width.
For example, a price attribute with a value range of $1 to $200 (rounded up to the nearest dollar) can be
partitioned into subranges 1 to 20, 21 to 40, 41 to 60, and so on. For each subrange, a bar is drawn with a height
that represents the total count of items observed within the subrange.

A scatter plot is one of the most effective graphical methods for determining if there appears to be a
III Year - II Semester
Data Warehousing and Data Mining 19

relationship, pattern, or trend between two numeric attributes. To construct a scatter plot, each pair of
values is treated as a pair of coordinates in an algebraic sense and plotted as points in the plane.

9. Data Visualization: Data visualization aims to communicate data clearly and effectively
through graphical representation.
Data visualization approaches,
 Pixel-oriented techniques
 Geometric projection techniques
 Icon-based techniques
 Hierarchical and graph-based techniques.
Pixel-oriented techniques: For a data set of m dimensions, pixel-oriented techniques create m
windows on the screen, one for each dimension. The m dimension values of a record are
mapped to m pixels at the corresponding positions in the windows. The colors of the pixels
reflect the corresponding values.

III Year - II Semester


Data Warehousing and Data Mining 20

Ex: Pixel-oriented visualization. AllElectronics maintains a customer information table, which


consists of four dimensions: income, credit limit, transaction volume, and age.
The pixel colors are chosen so that the smaller the value, the lighter the shading. Using pixelbased
visualization, we can easily observe the following: credit limit increases as income increases,
customers whose income is in the middle range are more likely to purchase more from AllElectronics,
there is no clear correlation between income and age.

10. Data Preprocessing: Data preprocessing is an important step in the data mining process.
It refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis.
The goal of data preprocessing is to improve the quality of the data and to make it more suitable for
the specific data mining task.

Major tasks in Data preprocessing:

1) Data Cleaning

2) Data Integration

3) Data reduction and

4) Data Transformation

1) DATA Cleaning: Data cleaning routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Various methods for handling
this problem:

Handling Missing Values: The various methods for handling the problem of missing values in
data tuples include:

(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining
task involves classification or description). This method is not very effective unless the tuple
contains several attributes with missing values. It is especially poor when the percentage of
missing values per attribute varies considerably.

III Year - II Semester


Data Warehousing and Data Mining 21

(b) Manually filling in the missing value: In general, this approach is time-consuming and may
not be a reasonable task for large data sets with many missing values, especially when the value
to be filled in is not easily determined.

(c) Using a global constant to fill in the missing value: Replace all missing attribute values by the
same constant, such as a label like “Unknown,” or -∞. If missing values are replaced by, say,
“Unknown,” then the mining program may mistakenly think that they form an interesting
concept, since they all have a value in common — that of “Unknown.” Hence, although this
method is simple, it is not recommended.

(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical
(nominal) values, for all samples belonging to the same class as the given tuple: For example, if
classifying customers according to credit risk, replace the missing value with the average income
value for customers in the same credit risk category as that of the given tuple.

(e) Using the most probable value to fill in the missing value: This may be determined with
regression, inference-based tools using Bayesian formalism, or decision tree induction. For
example, using the other customer attributes in your data set, you may construct a decision tree
to predict the missing values for income.

Handling Noisy data: Noise is a random error or variance in a measured variable. Data
smoothing tech is used for removing such noisy data.

Several Data smoothing techniques:


1 Binning methods: Binning methods smooth a sorted data value by consulting the
neighborhood", or values around it. The sorted values are distributed into a number of 'buckets',
or bins. Because binning methods consult the neighborhood of values, they perform local
smoothing.
In this technique,
1. The data for first sorted
2. Then the sorted list partitioned into equi-depth of bins.
3. Then one can smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.

a. Smoothing by bin means: Each value in the bin is replaced by the mean
value of the bin.
b. Smoothing by bin medians: Each value in the bin is replaced by the bin
median.

c. Smoothing by boundaries: The min and max values of a bin are identified as
the bin boundaries. Each bin value is replaced by the closest boundary value.

In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For
example, the mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in

III Year - II Semester


Data Warehousing and Data Mining 22

this bin is replaced by the value 9. Similarly, smoothing by bin medians can be employed, in
which each bin value is replaced by the bin median. In smoothing by bin boundaries, the
minimum and maximum values in a given bin are identified as the bin boundaries. Each bin
value is then replaced by the closest boundary value.

2 Clustering: Outliers in the data may be detected by clustering, where similar values are
organized into groups, or ‘clusters’. Values that fall outside of the set of clusters may be
considered outliers.

3 Regression : smooth by fitting the data into regression functions.


 Linear regression involves finding the best of line to fit two variables, so that one variable can
be used to predict the other.
 Multiple linear regression is an extension of linear regression, where more than two variables
are involved and the data are fit to a multidimensional surface.

2) Data integration: Data integration in data mining refers to the process of combining data from
multiple sources into a single, unified view. This can involve cleaning and transforming the data, as
well as resolving any inconsistencies or conflicts that may exist between the different sources. The
goal of data integration is to make the data more useful and meaningful for the purposes of analysis
and decision making. There are mainly 2 major approaches for data integration – one is the ―tight
coupling approach‖ and another is the ―loose coupling approach‖.

Tight Coupling: This approach involves creating a centralized repository or data warehouse to store
the integrated data. The data is extracted from various sources, transformed and loaded into a data
warehouse. Data is integrated in a tightly coupled manner, meaning that the data is integrated at a high
level, such as at the level of the entire dataset or schema. This approach is also known as data
warehousing, and it enables data consistency and integrity, but it can be inflexible and difficult to
change or update.

 Here, a data warehouse is treated as an information retrieval component.


 In this coupling, data is combined from different sources into a single physical location through
the process of ETL – Extraction, Transformation, and Loading.

Loose Coupling: This approach involves integrating data at the lowest level, such as at the level of
individual data elements or records. Data is integrated in a loosely coupled manner, meaning that the
data is integrated at a low level, and it allows data to be integrated without having to create a central
repository or data warehouse. This approach is also known as data federation, and it enables data
flexibility and easy updates, but it can be difficult to maintain consistency and integrity across multiple
data sources.

 Here, an interface is provided that takes the query from the user, transforms it in a way the
source database can understand, and then sends the query directly to the source databases to
obtain the result.
III Year - II Semester
Data Warehousing and Data Mining 23

 And the data only remains in the actual source databases.

There are three issues to consider during data integration: Schema Integration, Redundancy
Detection, and resolution of data value conflicts. These are explained in brief below.

1. Schema Integration

 Integrate metadata from different sources.


 The real-world entities from multiple sources are referred to as the entity identification
problem.ER

2. Redundancy Detection:

 An attribute may be redundant if it can be derived or obtained from another attribute or set of
attributes.
 Inconsistencies in attributes can also cause redundancies in the resulting data set.
 Some redundancies can be detected by correlation analysis.

3. Resolution of data value conflicts:

 This is the third critical issue in data integration.


 Attribute values from different sources may differ for the same real-world entity.
 An attribute in one system may be recorded at a lower level of abstraction than the ―same‖
attribute in another.

3) Data reduction: Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial in situations where the
dataset is too large to be processed efficiently, or where the dataset contains a large amount of
irrelevant or redundant information.

There are several different data reduction techniques that can be used in data mining, including:

1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features into
a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that
are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off between the accuracy and the size
of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.
III Year - II Semester
Data Warehousing and Data Mining 24

In conclusion, data reduction is an important step in data mining, as it can help to improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it is important to be aware of the trade-off between the size and accuracy of the data, and
carefully assess the risks and benefits before implementing it.

Methods of data reduction: These are explained as following below.

1. Data Cube Aggregation: This technique is used to aggregate data in a simpler form. For example,
imagine the information you gathered for your analysis for the years 2012 to 2014, that data includes
the revenue of your company every three months. They involve you in the annual sales, rather than the
quarterly average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.

2. Dimension reduction: Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates outdated or redundant
features.

 Step-wise Forward Selection: The selection begins with an empty set of attributes later on we
decide the best of the original attributes on the set based on their relevance to other attributes.
We know it as a p-value in statistics.

Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: { }

Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

 Step-wise Backward Selection: This selection starts with a set of complete attributes in the
original data and at each point, it eliminates the worst remaining attribute in the set.

Suppose there are the following attributes in the data set in which few attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

III Year - II Semester


Data Warehousing and Data Mining 25

 Combination of forwarding and Backward Selection: It allows us to remove the worst and
select the best attributes, saving time and making the process faster.

3. Data Compression: The data compression technique reduces the size of the files using different
encoding mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types
based on their compression techniques.

 Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the precise
original data from the compressed data.
 Lossy Compression: Methods such as the Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression. For e.g., the JPEG image
format is a lossy compression, but we can find the meaning equivalent to the original image. In
lossy-data compression, the decompressed data may differ from the original data but are useful
enough to retrieve information from them.

4. Numerosity Reduction: In this reduction technique, the actual data is replaced with mathematical
models or smaller representations of the data instead of actual data, it is important to only store the
model parameter. Or non-parametric methods such as clustering, histogram, and sampling.

4) Data Transformation: Data transformation in data mining refers to the process of converting
raw data into a format that is suitable for analysis and modeling. The goal of data transformation is to
prepare the data for data mining so that it can be used to extract useful insights and knowledge. Data
transformation typically involves several steps, including:

1. Smoothing: It is a process that is used to remove noise from the dataset using some algorithms. It
allows for highlighting important features present in the dataset. It helps in predicting the patterns.
When collecting data, it can be manipulated to eliminate or reduce any variance or any other noise
form. The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders who need to look at a
lot of data which can often be difficult to digest for finding patterns that they wouldn’t see otherwise.

2. Aggregation: Data collection or aggregation is the method of storing and presenting data in a
summary format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis insights is
highly dependent on the quantity and quality of the data used. Gathering accurate data of high quality
and a large enough quantity is necessary to produce relevant results. The collection of data is useful for
everything from decisions concerning financing or business strategy of the product, pricing,
operations, and marketing strategies. For example, Sales, data may be aggregated to compute
monthly& annual total amounts.

3. Discretization: It is a process of transforming continuous data into set of small intervals. Most Data
Mining activities in the real world require continuous attributes. Yet many of the existing data mining
frameworks are unable to handle these attributes. Also, even if a data mining task can manage a
continuous attribute, it can significantly improve its efficiency by replacing a constant quality attribute
with its discrete values. For example, (1-10, 11-20) (age:- young, middle age, senior).
III Year - II Semester
Data Warehousing and Data Mining 26

4. Attribute Construction: Where new attributes are created & applied to assist the mining process
from the given set of attributes. This simplifies the original data & makes the mining more efficient.

5. Generalization: It converts low-level data attributes to high-level data attributes using concept
hierarchy. For Example Age initially in Numerical form (22, 25) is converted into categorical value
(young, old). For example, Categorical attributes, such as house addresses, may be generalized to
higher-level definitions, such as town or country.

6. Normalization: Data normalization involves converting all data variables into a given range.
Techniques that are used for normalization are:

 Min-Max Normalization:
o This transforms the original data linearly.
o Suppose that: min_A is the minima and max_A is the maxima of an attribute, P
o Where v is the value you want to plot in the new range.
o v’ is the new value you get after normalizing the old value.
 Z-Score Normalization:
o In z-score normalization (or zero-mean normalization) the values of an attribute (A), are
normalized based on the mean of A and its standard deviation
o A value, v, of attribute A is normalized to v’ by computing
 Decimal Scaling:
o It normalizes the values of an attribute by changing the position of their decimal points
o The number of points by which the decimal point is moved can be determined by the
absolute maximum value of attribute A.
o A value, v, of attribute A is normalized to v’ by computing
o where j is the smallest integer such that Max(|v’|) < 1.
o Suppose: Values of an attribute P varies from -99 to 99.
o The maximum absolute value of P is 99.
o For normalizing the values we divide the numbers by 100 (i.e., j = 2) or (number of
integers in the largest number) so that values come out to be as 0.98, 0.97 and so on.

III Year - II Semester

You might also like