0% found this document useful (0 votes)
64 views27 pages

Unit I DATA MINING AAGAC

1) Data mining is the process of discovering interesting patterns from large amounts of data through techniques including data cleaning, integration, transformation, and knowledge presentation. It is used to make decisions based on facts rather than intuition. 2) Common types of data that can be mined include relational database data, data warehouse data, transactional data, and other structured or unstructured data like text, images, videos, networks and web data. 3) Data mining tasks can be descriptive, aiming to characterize data properties, or predictive to perform induction for predictions on unknown data. Frequent pattern mining, association analysis, classification, regression and cluster analysis are examples of data mining techniques.

Uploaded by

Kesavan Kesavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views27 pages

Unit I DATA MINING AAGAC

1) Data mining is the process of discovering interesting patterns from large amounts of data through techniques including data cleaning, integration, transformation, and knowledge presentation. It is used to make decisions based on facts rather than intuition. 2) Common types of data that can be mined include relational database data, data warehouse data, transactional data, and other structured or unstructured data like text, images, videos, networks and web data. 3) Data mining tasks can be descriptive, aiming to characterize data properties, or predictive to perform induction for predictions on unknown data. Frequent pattern mining, association analysis, classification, regression and cluster analysis are examples of data mining techniques.

Uploaded by

Kesavan Kesavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

DATA MINING

Unit-I

Introduction
Data mining is the process of discovering interesting patterns from massive amounts of
data. As a knowledge discovery process, it typically involves data cleaning, data integration,
data selection, data transformation, pattern discovery, pattern evaluation, and knowledge
presentation.

Data mining is used to extract valuable information from a larger set of any raw data.
similar meaning to data mining are knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.

Data mining is searching for knowledge (interesting patterns) in data. Also known as
Knowledge discovery in databases (KDD).

Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data. The data sources can include databases, data warehouses, the Web,
other information repositories, or data that are streamed into the system dynamically.

In general, data mining techniques are designed either to explain or understand the past
and predict the future.

Data mining techniques are used to take decisions based on facts rather than intuition.

What Kinds of Data Can Be Mined?


Database Data

A database system, also called a database management system (DBMS), consists of a


collection of interrelated data, known as a database, and a set of software programs to manage
and access the data.

Relational databases are one of the most commonly available and richest information
repositories, and thus they are a major data form in the study of data mining.

Data Warehouses

Data warehouses store data from multiple databases, which makes it easier to analyze.
And a Data Warehousing is a process for collecting and managing data electronically from
varied sources to provide meaningful business insights. It is designed to analyze, report,
integrate transaction data from different sources.

Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.

1
A data warehouse is usually modeled by a multidimensional data structure, called a
data cube.

A Comparison of Database vs Data Warehouse as follows:


Parameter Database Data Warehouse
Purpose Is designed to record Is designed to analyze
The database uses the Online Data warehouse uses Online
Processing Method
Transactional Processing (OLTP) Analytical Processing (OLAP).
The Database helps to perform
Data warehouse allows you to
Usage fundamental operations for your
analyze your business.
business
Table and joins are simple in a data
Tables and joins of a database are
Tables and Joins warehouse because they are
complex as they are normalized.
denormalized.
Is an application-oriented collection It is a subject-oriented collection of
Orientation
of data data
Generally limited to a single Stores data from any number of
Storage limit
application applications
Data is refreshed from source
Availability Data is available real-time
systems as and when needed
ER modeling techniques are used for Data modeling techniques are used
Usage
designing. for designing.
Technique Capture data Analyze data
Current/Historical Data is stored in
Data stored in the Database is up to
Data Type Data Warehouse. May not be upto
date.
date.
It uses dimensional and normalized
Flat Relational Approach method is
Storage of data approach for the data structure. Eg.,
used for data storage.
Star and snowflake schema.
Complex queries are used for
Query Type Simple transaction queries are used.
analysis purpose.
Data Summary Detailed Data is stored in a database. It stores highly summarized data.
2
Transactional Data

In general, each record in a transactional database captures a transaction, such as a


customer’s purchase, a flight booking, or a user’s clicks on a web page. A transaction typically
includes a unique transaction identity number (trans ID) and a list of the items making up the
transaction, such as the items purchased in the transaction. A transactional database may have
additional tables, which contain other information related to the transactions, such as item
description, information about the salesperson or the branch, and so on.

Other Kinds of Data

There are many other kinds of data that have versatile forms and structures and rather
different semantic meanings. Such kinds of data can be seen in many applications: time-related
or sequence data (e.g., historical records, stock exchange data, and time-series and biological
sequence data), data streams (e.g., video surveillance and sensor data, which are continuously
transmitted), spatial data (e.g., maps), engineering design data (e.g., the design of buildings,
system components, or integrated circuits), hypertext and multimedia data (including text,
image, video, and audio data), graph and networked data (e.g., social and information
networks), and the Web (a huge, widely distributed information repository made available by
the Internet).

Data Mining Patterns


Data mining functionalities are used to specify the kinds of patterns to be found in data
mining tasks. In general, such tasks can be classified into two categories: descriptive and
predictive.

- Descriptive mining tasks characterize properties of the data in a target data set.

It Answers- what is in data?, What doesn’t look like?, Are there any unusual patterns?,
What does the data suggest for customer segmentation?, User may have no idea on which
kind of patterns are interesting.

- Predictive mining tasks perform induction on the current data in order to make
predictions. It is suitable for unknown dataset.

It Answers:- Which product will give high profit?, Which customer are likely to leave in
next six months. Judge if a patient has specific disease based on his medical test results.

3
1) Class/Concept Description: Characterization and Discrimination
Data entries can be associated with classes or concepts. For example, in the
AllElectronics store, classes of items for sale include computers and printers, and concepts of
customers include bigSpenders and budgetSpenders. It can be useful to describe individual
classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions can be derived
using (1) data characterization, by summarizing the data of the class under study (often called
the target class) in general terms, or (2) data discrimination, by comparison of the target class
with one or a set of comparative classes (often called the contrasting classes), or (3) both data
characterization and discrimination.

2) Mining Frequent Patterns, Associations, and Correlations


Frequent patterns, as the name suggests, are patterns that occur frequently in data.
There are many kinds of frequent patterns, including frequent itemsets, frequent subsequences
(also known as sequential patterns), and frequent substructures.

Mining frequent patterns leads to the discovery of interesting associations and


correlations within data.

Example

Association analysis. Suppose that, as a marketing manager at AllElectronics, you


want to know which items are frequently purchased together (i.e., within the same transaction).
Additional analysis can be performed to uncover interesting statistical correlations
between associated attribute–value pairs.

3) Classification and Regression for Predictive Analysis


Classification is the process of finding a model (or function) that describes and
distinguishes data classes or concepts.

4
Regression analysis is a statistical methodology that is most often used for numeric
prediction, although other methods exist as well. Regression also encompasses the
identification of distribution trends based on the available data.

Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data sets,
clustering analyzes data objects without consulting class labels.

Outlier Analysis

A data set may contain objects that do not comply with the general behavior or model
of the data. These data objects are outliers. Many data mining methods discard outliers as
noise or exceptions.

The analysis of outlier data is referred to as outlier analysis or anomaly mining.


Outliers may be detected using statistical tests that assume a distribution or probability model
for the data.

Technologies used for Data Mining

As a highly application-driven domain, data mining has incorporated many techniques


from other domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, algorithms, high performance
computing, and many application domains

5
Statistics

Statistics studies the collection, analysis, interpretation or explanation, and presentation


of data. Data mining has an inherent connection with statistics.

A statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
distributions. Statistical models are widely used to model data and data classes.

Database Systems and Data Warehouses

Database systems research focuses on the creation, maintenance, and use of databases
for organizations and end-users. Particularly, database systems researchers have established
highly recognized principles in data models, query languages, query processing and
optimization methods, data storage, and indexing and accessing methods.

A data warehouse integrates data originating from multiple sources and various
timeframes. It consolidates data in multidimensional space to form partially materialized data
cubes.

Information Retrieval
Information retrieval (IR) is the science of searching for documents or information in
documents. Documents can be text or multimedia, and may reside on the Web. The differences
between traditional information retrieval and database systems are two fold:
Information retrieval assumes that (1) the data under search are unstructured; and (2)
the queries are formed mainly by keywords, which do not have complex structures.

6
Data mining techniques have been developed and used, including association,
classification, clustering, prediction, sequential patterns, and regression.

(i) Classification. Classification is a more complex data mining technique. This technique
is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.

a. Classification of Data mining fraframeworks


meworks as per the type of data sources
mined.
b. Classification of data mining frameworks as per the database involved.
c. Classification of data mining frameworks as per the kind of knowledge
discovered
d. Classification of data mining frameworks according to data dat mining
techniques used.

(ii) Clustering. Clustering is very similar to classification, but involves grouping chunks of
data together based on their similarities. Clustering is a division of information into
groups of connected objects.

(iii) Regression. Regression ion analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence of the other factor. It
is used to define the probability of the specific variable. Regression, primarily a form of
planning and modelling
modelling.. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given data
set.

7
(iv) Outer Detection. This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or expected behaviour.
This technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outlier mining.

(v) Sequential Patterns: The sequential pattern is a data mining technique specialized for
evaluating sequential data to discover sequential patterns. It comprises of finding
interesting subsequences in a set of sequences, where the stake of a sequence can be
measured in terms of different criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize
similar patterns in transaction data over some time.

(vi) Prediction. Prediction is one of the most valuable data mining techniques. Prediction
used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a
future event.

(vii) Association Rules. Association is related to tracking patterns, but is more specific to
dependently linked variables. This data mining technique helps to discover a link
between two or more items. It finds a hidden pattern in the data set.

Machine Learning
Machine Learning is the science of getting computers to learn and act like humans do,
and improve their learning over time in autonomous fashion, by feeding them data and
information in the form of observations and real-world interactions.

Machine learning is an application of artificial intelligence (AI) that provides systems


the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.

The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions in the
future based on the examples that we provide. The primary aim is to allow the computers learn
automatically without human intervention or assistance and adjust actions accordingly.

Types of machine learning


Machine-learning algorithms use statistics to find patterns in massive amounts of data.
All combinations of machine learning algorithms consist of the following:
• Representation (a set of classifiers or the language that a computer understands)
• Evaluation (aka objective/scoring function)
• Optimization (search method; often the highest-scoring classifier, for example;
there are both off-the-shelf and custom optimization methods used)
8
Machine learning algorithms are often categorized as follows.
(i) Supervised learning. In this type of machine learning, data scientists supply
algorithms with labeled training data and define the variables they want the
algorithm to assess for correlations. Both the input and the output of the algorithm is
specified.

Supervised learning algorithms are good for the following tasks:


• Binary classification. Dividing data into two categories.
• Multi-class classification. Choosing between more than two types of
answers.
• Regression modeling. Predicting continuous values.
• Ensembling. Combining the predictions of multiple machine learning
models to produce an accurate prediction.

(ii) Unsupervised learning. This type of machine learning involves algorithms that
train on unlabeled data. The algorithm scans through data sets looking for any
meaningful connection. Both the data algorithms train on and the predictions or
recommendations they output are predetermined.

Unsupervised learning algorithms are good for the following tasks:


• Clustering. Splitting the data set into groups based on similarity.
• Anomaly detection. Identifying unusual data points in a data set.
• Association mining. Identifying sets of items in a data set that frequently
occur together.
• Dimensionality Reduction. Reducing the number of variables in a data set.

(iii) Semi-supervised learning. This approach to machine learning involves a mix of the
two preceding types. Data scientists may feed an algorithm mostly labeled training
data, but the model is free to explore the data on its own and develop its own
understanding of the data set.

Semi-supervised learning strikes a middle ground between the performance of


supervised learning and the efficiency of unsupervised learning. Some areas where
semi-supervised learning is used include:
• Machine translation. Teaching algorithms to translate language based on
less than a full dictionary of words.
• Fraud detection. Identifying cases of fraud when you only have a few
positive examples.
• Labeling data. Algorithms trained on small data sets can learn to apply data
labels to larger sets automatically.

(iv) Reinforcement learning. Reinforcement learning is typically used to teach a


machine to complete a multi-step process for which there are clearly defined rules.
Data scientists program an algorithm to complete a task and give it positive or
negative cues as it works out how to complete a task. But for the most part, the
algorithm decides on its own what steps to take along the way.

9
Reinforcement learning is often used in areas like:
• Robotics. Robots can learn to perform tasks in the physical world using this
technique.
• Video gameplay. Reinforcement learning has been used to teach bots to play
a number of video games.
• Resource management. Given finite resources and a defined goal,
reinforcement learning can help enterprises plan how to allocate resources.

Machine Learning Applications


Some of the machine learning applications are as follows:

Business Intelligence (BI) -- It is critical for businesses to acquire a better understanding of


the commercial context of their organization, such as their customers, the market, supply and
resources, and competitors. Business intelligence (BI) technologies provide historical, current,
and predictive views of business operations.

BI and analytics vendors use machine learning in their software to identify potentially
important data points, patterns of data points and anomalies. Without data mining, many
businesses may not be able to perform effective market analysis, compare customer feedback
on similar products, discover the strengths and weaknesses of their competitors, retain highly
valuable customers, and make smart business decisions. Clearly, data mining is the core of
business intelligence.

Web Search Engines -- A Web search engine is a specialized computer server that searches
for information on the Web. Web search engines are essentially very large data mining
applications. Various data mining techniques are used in all aspects of search engines, ranging
from crawling, indexing and searching.

Customer relationship management -- CRM software can use machine learning models to
analyze email and prompt sales team members to respond to the most important messages first.
More advanced systems can even recommend potentially effective responses.

Human resource information systems -- HRIS systems can use machine learning models to
filter through applications and identify the best candidates for an open position.

Self-driving cars -- Machine learning algorithms can even make it possible for a semi-
autonomous car to recognize a partially visible object and alert the driver.

Virtual assistants -- Smart assistants typically combine supervised and unsupervised machine
learning models to interpret natural speech and supply context.

Image Recognition - Image recognition is one of the most common uses of machine learning.

Speech Recognition - Speech recognition is the translation of spoken words into the text.It is
also known as computer speech recognition or automatic speech recognition.

10
Medical diagnosis –Machine learning can be used in the techniques and tools that can help in
the diagnosis of diseases.

Classification – A classification is a process of placing each individual under study in many


classes.

Prediction – Machine learning can also be used in the prediction systems.

Deep Learning
Deep Learning is a subfield of machine learning concerned with algorithms inspired by
the structure and function of the brain called artificial neural networks.

11
Steps in Data Mining Process

The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of particular
data mining methods. It is of interest to researchers in machine learning, pattern recognition,
databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data
visualization.

The knowledge discovery process (DATA MINING) is shown in following Figure

12
1. Data cleaning - to remove noise and inconsistent data. The data we have collected are not
clean and may contain errors, missing values, noisy or inconsistent data.

2. Data integration - where multiple data sources may be combined. First of all the data are
collected and integrated from all the different sources.

3. Data selection - where data relevant to the analysis task are retrieved from the database.
We may not all the data we have collected in the first step. So in this step we select only
those data which we think useful for data mining.

4. Data transformation – Transforming data to a proper format (where data are transformed
and consolidated into forms appropriate for mining by performing summary or aggregation
operations)

5. Data mining- an essential process where intelligent methods are applied to extract data
patterns. Techniques (Algorithms) like clustering and association analysis are among the
many different techniques used for data mining.

6. Pattern evaluation - to identify the truly interesting patterns representing knowledge


based on interestingness measures. This step involves visualization, transformation,
removing redundant patterns etc from the patterns we generated.

7. Knowledge presentation - where visualization and knowledge representation techniques.


This step helps user to make use of the knowledge acquired to take better decisions.

Steps 1 through 4 are different forms of data preprocessing, where data are prepared for
mining. The data mining step may interact with the user or a knowledge base. The interesting
patterns are presented to the user and may be stored as new knowledge in the knowledge base.

The preceding view shows data mining as one step in the knowledge discovery process,
albeit an essential one because it uncovers hidden patterns for evaluation.

Major Issues in Data Mining


The major issues have been addressed in recent data mining research and development
are partitioning them into five groups:
(i) mining methodology,
(ii) user interaction,
(iii) efficiency and scalability,
(iv) diversity of data types,
(v) data mining and society.

The issues continue to stimulate further investigation and improvement in data mining.

13
(i) Mining Methodology
Researchers have been vigorously developing new data mining methodologies. This
involves the investigation of new kinds of knowledge, mining in multidimensional space,
integrating methods from other disciplines, and the consideration of semantic ties among data
objects.

The various aspects of mining methodology are:-


i) Mining various and new kinds of knowledge
ii) Mining knowledge in multidimensional space
iii) Data mining—an interdisciplinary effort
iv) Boosting the power of discovery in a networked environment
v) Handling uncertainty, noise, or incompleteness of data
vi) Pattern evaluation and pattern- or constraint-guided mining

(ii) User Interaction

The user plays an important role in the data mining process. Interesting areas of
research include the following:-

i) Interactive mining
ii) Incorporation of background knowledge:
iii) Ad hoc data mining and data mining query languages
iv) Presentation and visualization of data mining results

(iii) Efficiency and Scalability

Efficiency and scalability are always considered when comparing data mining
algorithms.

As data amounts continue to multiply, these two factors are especially critical.

i) Efficiency and scalability of data mining algorithms


ii) Parallel, distributed, and incremental mining algorithms

(iv) Diversity of Database Types

The wide diversity of database types brings about challenges to data mining. These
Include:

i) Handling complex types of data:


ii) Mining dynamic, networked, and global data repositories

14
(v) Data Mining and Society

How does data mining impact society? What steps can data mining take to preserve the
privacy of individuals? Do we use data mining in our daily lives without even knowing that we
do? These questions raise the following issues:

i) Social impacts of data mining


ii) Privacy-preserving data mining
iii) Invisible data mining

Data Objects and Attribute Types


Data sets are made up of data objects. Data objects are typically described by attributes.
Data objects can also be referred to as samples, examples, instances, data points, or objects. If
the data objects are stored in a database, they are data tuples. That is, the rows of a database
correspond to the data objects, and the columns correspond to the attributes.

Attribute

An attribute is a data field, representing a characteristic or feature of a data object.


Attributes describing a customer object can include, for example, customer ID, name, and
address.
There are different type of attribute is used to determined by the set of possible values.
They are:-

Nominal attributes

Nominal means “relating to names.” The values of a nominal attribute are


symbols or names of things. Each value represents some kind of category, code,
or state, and so nominal attributes are also referred to as categorical. The values
do not have any meaningful order. In computer science, the values are also
known as enumerations.

Binary Attributes

A binary attribute is a nominal attribute with only two categories or states: 0 or


1, where 0 typically means that the attribute is absent, and 1 means that it is
present. Binary attributes are referred to as Boolean if the two states correspond
to true and false.

Ordinal Attributes

An ordinal attribute is an attribute with possible values that have a meaningful


order or ranking among them, but the magnitude between successive values is
not known.

15
Numeric Attributes

A numeric attribute is quantitative; that is, it is a measurable quantity,


represented in integer or real values. Numeric attributes can be interval-scaled or
ratio-scaled.

Interval-scaled attributes are measured on a scale of equal-size units. The


values of interval-scaled attributes have order and can be positive, 0, or negative.

A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That


is, if a measurement is ratio-scaled.
In addition, the values are ordered, and we can also compute the difference
between values, as well as the mean, median, and mode.

Discrete versus Continuous Attributes

A discrete attribute has a finite or countably infinite set of values, which may
or may not be represented as integers.
If an attribute is not discrete, it is continuous.

Data Visualization

Data visualization aims to communicate data clearly and effectively through graphical
representation. Data visualization has been used extensively in many applications.

We can take advantage of visualization techniques to discover data relationships that


are otherwise not easily observable by looking at the raw data. Nowadays, people also use data
visualization to create fun and interesting graphics.

Some of the visualization techniques are

(i) Pixel-Oriented Visualization Techniques


A simple way to visualize the value of a dimension is to use a pixel where the
color of the pixel reflects the dimension’s value.

(ii) Geometric Projection Visualization Techniques


A drawback of pixel-oriented visualization techniques is that they cannot
help us much in understanding the distribution of data in a multidimensional
space.

Geometric projection techniques help users find interesting projections of


multidimensional data sets.

16
(iii) Icon-Based Visualization Techniques
Icon-based visualization techniques use small icons to represent
multidimensional data values. We look at two popular icon-based techniques:
Chernoff faces and stick figures.

(iv) Hierarchical Visualization Techniques


Hierarchical visualization techniques partition all dimensions into subsets
(i.e., subspaces). The subspaces are visualized in a hierarchical manner.

(V) Visualizing Complex Data and Relations


Data visualization techniques may be pixel-oriented, geometric-based, icon-
based, or hierarchical. These methods apply to multidimensional relational
data. Additional techniques have been proposed for the visualization of
complex data, such as text and social networks

Measuring Data Similarity and Dissimilarity


Measures of object similarity and dissimilarity are used in data mining applications
such as clustering, outlier analysis, and nearest-neighbor classification.

In data mining, the similarity measure is a way of measuring how data samples are
related or closed to each other. On the other hand, the dissimilarity measure is to tell how
much the data objects are distinct.

Similarity and dissimilarity measures, which are referred to as measures of proximity. A


similarity measure for two objects, i and j, will typically return the value 0 if the objects are
unalike. The higher the similarity value, the greater the similarity between objects.

A dissimilarity measure works the opposite way. It returns a value of 0 if the objects are
the same. The higher the dissimilarity value, the more dissimilar the two objects are.

Data Matrix versus Dissimilarity Matrix


The data matrix (used to store the data objects) and the dissimilarity matrix (used to
store dissimilarity values for pairs of objects).

A data matrix is made up of two entities or “things,” namely rows (for objects) and
columns (for attributes). Therefore, the data matrix is often called a two-mode matrix. The
dissimilarity matrix contains one kind of entity (dissimilarities) and so is called a one-mode
matrix.

17
Proximity Measures for Nominal Attributes
Nominal attributes can have two or more different states e.g. an attribute ‘color’ can
have values like ‘Red’, ‘Green’, ‘Yellow’, ‘Blue’, etc. Dissimilarity for nominal attributes is
calculated as the ratio of total number of mismatches between two data points to the total
number of attributes.
Nominal means “relating to names.

Proximity Measures for Binary Attributes


Proximity measures refer to the Measures of Similarity and Dissimilarity. Similarity
and Dissimilarity are important because they are used by a number of data mining techniques,
such as clustering, nearest neighbour classification, and anomaly detection.

Dissimilarity and similarity measures for objects described by either symmetric or


asymmetric binary attributes.

Dissimilarity of Numeric Data: Minkowski Distance


The distance measures are commonly used for computing the dissimilarity of objects
described by numeric attributes. These measures include the Euclidean, Manhattan, and
Minkowski distances.

Euclidean Distance
The Euclidean distance d, between two points x and y is given by the following
formula:

18
Manhattan Distance
The Manhattan distance, often called Taxicab distance or City Block distance,
calculates the distance between real-valued vectors. Imagine vectors that describe objects on a
uniform grid such as a chessboard. Manhattan distance then refers to the distance between two
vectors if they could only move right angles. There is no diagonal movement involved in
calculating the distance.

Minkowski Distance
Minkowski distance is a distance measured between two points in N-dimensional space.
It is basically a generalization of the Euclidean distance and the Manhattan distance. It is
widely used in the field of Machine learning.

19
Proximity Measures for Ordinal Attributes
An ordinal attribute is an attribute whose possible values have a meaningful order or
ranking among them, but the magnitude between successive values is not known. However, to
do so, it is important to convert the states to numbers where each state of an ordinal attribute is
assigned a number corresponding to the order of attribute values.
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height
{tall, medium, short}.

Cosine Similarity
Cosine similarity measures the similarity between two vectors of an inner product
space. It is measured by the cosine of the angle between two vectors and determines whether
two vectors are pointing in roughly the same direction. It is often used to measure document
similarity in text analysis.

Data Pre-Processing
Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data.

Data preprocessing is a data mining technique which is used to transform the raw data
in a useful and efficient understandable format.

We cannot work with raw data. Raw data is highly susceptible to noise, missing values
and inconsistency. The quality of data affects the data mining results.

Preprocessing involves both data validation and data imputation.

The quality of the data should be checked before applying machine learning or data
mining algorithms.

Examples of Error Values


Out of range values e.g., Income = – 100
Impossible Data Combinations e.g., Gender: Male, Pregnant: Yes)

Purpose of Preprocessing
Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following:
• Accuracy : To check whether the data entered is correct or not.
• Completeness : To check whether the data is available or not recorded.
• Consistency : To check whether the same data is kept in all the places that do or do
not match.
• Timeliness : The data should be updated correctly.
• Believability : The data should be trustable.
• Interpretability : The understandability of the data.
20
Data Preprocessing methods

Data Preprocessing methods are divided into following categories:

1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation and Discretization

1. Data Cleaning:
Data cleaning is the process to remove incorrect data, inaccurate data from the datasets,
and it also replaces or filling the missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data.

21
Handling of Missing Data

If Many tuples(a structure of data that has several parts) have no recorded value for
several attributes, then the missing values can be filled in for the attribute by various
methods are:

(i) Ignore the tuples:


This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

(ii) Fill the Missing values Manually:


Missing values can also be filled manually but it is not recommended when that dataset
is big. This approach is Time Consuming.

(iii) Use a global constant to fill in the missing value:


Standard values like “Not Available” or “NA” can be used to replace the missing
values. Replace all missing attribute values by the same constant, such as label like
“Unknown” or -∞.

(iv) Use a measure of central tendency for the attribute (eg., the mean or median) to fill in the
missing value:
For normal (symmetric) data distributions, the attribute’s mean value can be used to
replace the missing value.

Where in the case of skewed data distribution, median value of the attribute can be
used.

(v) Use the attribute mean or median for all samples belonging to the same class as the given
tuple:
For example, if classifying customers according to credit_risk, we may replace the
missing value with the mean income value for customers in the same credit risk category as
that of the given tuple. If the data distribution for a given class is skewed, the median value is a
better choice.

(vi) Use the most probable value to fill in the missing value:
While using regression or decision tree algorithms or Bayesian formalism, the missing
value can be replaced by the most probable value.

Noisy Data
Noisy generally means random error or containing unnecessary data points.
Noisy data is a meaningless data that can’t be interpreted by machines.
It can be generated due to faulty data collection, data entry errors etc.
22
Here are some of the methods to handle noisy data.

(i) Binning Method:

This method is to smooth or handle noisy data.

This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.

For example, you can bin the values for Age into categories such as 21-35, 36-59, and 60-
79.

There are three methods for smoothing data in the bin.

Smoothing by bin mean method:


In this method, the values in the bin are replaced by the mean value of the bin.

Smoothing by bin median:


In this method, the values in the bin are replaced by the median value

Smoothing by bin boundary


In this method, the using minimum and maximum values of the bin values are taken
and the values are replaced by the closest boundary value.

Following example illustrates some binning techniques,

23
(ii) Outlier analysis:

Outliers may be detected by clustering. For example, where similar values are
organized into groups or “clusters”. Intuitively, values that fall outside of the set of clusters
may be considered outliers. Clustering is generally used in unsupervised learning.

A 2-D plot of customer data with respect to customer locations in a city, showing three
data clusters.

(iii) Regression:

This is used to smooth the data and will help to handle data when unnecessary data is
present. Purpose of regression helps to decide the variable which is suitable for our analysis.

In linear Regression, one variable can be used to predict the other. (having one
independent variable).

In multiple linear regression, more than two variables are involved to predict. (having
multiple independent variables).

2. Data Integration

The process of combining multiple sources (multiple databases, data cubes, flat files)
into a single dataset.

Databases and data warehouses typically have metadata – that is, data about the data.
Such metadata can be used to help avoid errors in schema integration.

24
Careful integration can help reduce and avoid redundancies and inconsistencies in the
resulting dataset. This can help improve the accuracy and speed of the subsequent data mining
process.

There are some problems to be considered during data integration.

• Entity identification problem: Identifying entities from multiple databases. For


example, the system or the use should know student _id of one database and
student_name of another database belongs to the same entity.

• Redundancy and Correlation Analysis:


Redundancy is another important issues in data integration. An attribute (such as
annual revenue, for instance) may be redundant if it can be “derived” from another
attribute or set of attributes. Some redundancies can be detected by correlation analysis.
Given two attributes, such analysis can measure how strongly one attribute
implies the other, based on the available data. For nominal data, we use the χ2 (chi-
square) test. For numeric attributes, we can use the correlation coefficient and
covariance, both of which access how one attribute’s values vary from those of another.

• Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also
be detected at the tuple level(e.g., where there are two or more identical tuples for a
given unique data entry case). Inconsistencies often arise between various duplicates,
due to inaccurate data entry or updating some but not all data occurrences.

• Detecting and resolving data value conflict: The data taken from different databases
while merging may differ. Like the attribute values from one database may differ from
another database. For example, the date format may differ like “MM/DD/YYYY” or
“DD/MM/YYYY”.

3. Data Reduction
Data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this,
we use data reduction technique.

Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results. It aims to increase the
storage efficiency and reduce data storage and analysis costs.

Some of the techniques in data reduction are

(i) Dimensionality reduction


Dimensionality reduction is the process of reducing the number of random variables or
attributes under consideration.
25
In dimensionality reduction, data encoding schemes are applied so as to obtain a
reduced or “compressed” representation of the original data.

Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space and computation time is
reduced. When the data is highly dimensional the problem called “Curse of Dimensionality”
occurs.

The effective methods of dimensionality reduction are:-


(i) Wavelet transforms and
(ii) PCA (Principal Component Analysis).
(iii) Attribute subset selection

(ii) Numerosity reduction


In this method, the representation of the data is made smaller forms by reducing the
original data volume. There will not be any loss of data in this reduction.

This enables to store the model of data instead of whole data, for example: Regression
Models, log-linear models.

(iii) Data compression


The compressed form of data is called data compression. This compression can be
lossless or lossy. When there is no loss of information during compression it is called lossless
compression. Whereas lossy compression reduces information but it removes only the
unnecessary information.

4. Data Transformation & Discretization


In data transformation, the data are transformed or consolidated into forms appropriate
for mining.

The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some methods in
data transformation.

1. Smoothing:

With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we can find even a simple change
that helps in prediction. Such technique include binning, clustering and regression.

2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
26
3. Aggregation:

Where summary or aggregation operations are applied to the data. For Example, the
daily sales data may be aggregated so as to compute monthly and annual total amounts.

In this method, the data is stored and presented in the form of a summary. The data set
which is from multiple sources is integrated into with data analysis description. This is an
important step since the accuracy of the data depends on the quantity and quality of the data.
When the quality and the quantity of the data are good the results are more relevant.

4. Normalization:

It is done in order to scale the data values in a smaller specified range (-1.0 to 1.0 or
0.0 to 1.0)

5. Discretization:

This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

The continuous data here is split into intervals. Discretization reduces the data size. For
example, rather than specifying the class time, we can set an interval like (3 pm-5 pm, 6 pm-8
pm).

Discretization techniques include


- Discretization by Binning
- Discretization by Histogram Analysis
- Discretization by Cluster, Decision Tree, and Correlation Analyses

6. Concept hierarchy generation for nominal data

Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.

27

You might also like