Unit I DATA MINING AAGAC
Unit I DATA MINING AAGAC
Unit-I
Introduction
Data mining is the process of discovering interesting patterns from massive amounts of
data. As a knowledge discovery process, it typically involves data cleaning, data integration,
data selection, data transformation, pattern discovery, pattern evaluation, and knowledge
presentation.
Data mining is used to extract valuable information from a larger set of any raw data.
similar meaning to data mining are knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging.
Data mining is searching for knowledge (interesting patterns) in data. Also known as
Knowledge discovery in databases (KDD).
Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data. The data sources can include databases, data warehouses, the Web,
other information repositories, or data that are streamed into the system dynamically.
In general, data mining techniques are designed either to explain or understand the past
and predict the future.
Data mining techniques are used to take decisions based on facts rather than intuition.
Relational databases are one of the most commonly available and richest information
repositories, and thus they are a major data form in the study of data mining.
Data Warehouses
Data warehouses store data from multiple databases, which makes it easier to analyze.
And a Data Warehousing is a process for collecting and managing data electronically from
varied sources to provide meaningful business insights. It is designed to analyze, report,
integrate transaction data from different sources.
Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.
1
A data warehouse is usually modeled by a multidimensional data structure, called a
data cube.
There are many other kinds of data that have versatile forms and structures and rather
different semantic meanings. Such kinds of data can be seen in many applications: time-related
or sequence data (e.g., historical records, stock exchange data, and time-series and biological
sequence data), data streams (e.g., video surveillance and sensor data, which are continuously
transmitted), spatial data (e.g., maps), engineering design data (e.g., the design of buildings,
system components, or integrated circuits), hypertext and multimedia data (including text,
image, video, and audio data), graph and networked data (e.g., social and information
networks), and the Web (a huge, widely distributed information repository made available by
the Internet).
- Descriptive mining tasks characterize properties of the data in a target data set.
It Answers- what is in data?, What doesn’t look like?, Are there any unusual patterns?,
What does the data suggest for customer segmentation?, User may have no idea on which
kind of patterns are interesting.
- Predictive mining tasks perform induction on the current data in order to make
predictions. It is suitable for unknown dataset.
It Answers:- Which product will give high profit?, Which customer are likely to leave in
next six months. Judge if a patient has specific disease based on his medical test results.
3
1) Class/Concept Description: Characterization and Discrimination
Data entries can be associated with classes or concepts. For example, in the
AllElectronics store, classes of items for sale include computers and printers, and concepts of
customers include bigSpenders and budgetSpenders. It can be useful to describe individual
classes and concepts in summarized, concise, and yet precise terms. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions can be derived
using (1) data characterization, by summarizing the data of the class under study (often called
the target class) in general terms, or (2) data discrimination, by comparison of the target class
with one or a set of comparative classes (often called the contrasting classes), or (3) both data
characterization and discrimination.
Example
4
Regression analysis is a statistical methodology that is most often used for numeric
prediction, although other methods exist as well. Regression also encompasses the
identification of distribution trends based on the available data.
Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data sets,
clustering analyzes data objects without consulting class labels.
Outlier Analysis
A data set may contain objects that do not comply with the general behavior or model
of the data. These data objects are outliers. Many data mining methods discard outliers as
noise or exceptions.
5
Statistics
A statistical model is a set of mathematical functions that describe the behavior of the
objects in a target class in terms of random variables and their associated probability
distributions. Statistical models are widely used to model data and data classes.
Database systems research focuses on the creation, maintenance, and use of databases
for organizations and end-users. Particularly, database systems researchers have established
highly recognized principles in data models, query languages, query processing and
optimization methods, data storage, and indexing and accessing methods.
A data warehouse integrates data originating from multiple sources and various
timeframes. It consolidates data in multidimensional space to form partially materialized data
cubes.
Information Retrieval
Information retrieval (IR) is the science of searching for documents or information in
documents. Documents can be text or multimedia, and may reside on the Web. The differences
between traditional information retrieval and database systems are two fold:
Information retrieval assumes that (1) the data under search are unstructured; and (2)
the queries are formed mainly by keywords, which do not have complex structures.
6
Data mining techniques have been developed and used, including association,
classification, clustering, prediction, sequential patterns, and regression.
(i) Classification. Classification is a more complex data mining technique. This technique
is used to obtain important and relevant information about data and metadata. This data
mining technique helps to classify data in different classes.
(ii) Clustering. Clustering is very similar to classification, but involves grouping chunks of
data together based on their similarities. Clustering is a division of information into
groups of connected objects.
(iii) Regression. Regression ion analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence of the other factor. It
is used to define the probability of the specific variable. Regression, primarily a form of
planning and modelling
modelling.. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given data
set.
7
(iv) Outer Detection. This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or expected behaviour.
This technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outlier mining.
(v) Sequential Patterns: The sequential pattern is a data mining technique specialized for
evaluating sequential data to discover sequential patterns. It comprises of finding
interesting subsequences in a set of sequences, where the stake of a sequence can be
measured in terms of different criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize
similar patterns in transaction data over some time.
(vi) Prediction. Prediction is one of the most valuable data mining techniques. Prediction
used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a
future event.
(vii) Association Rules. Association is related to tracking patterns, but is more specific to
dependently linked variables. This data mining technique helps to discover a link
between two or more items. It finds a hidden pattern in the data set.
Machine Learning
Machine Learning is the science of getting computers to learn and act like humans do,
and improve their learning over time in autonomous fashion, by feeding them data and
information in the form of observations and real-world interactions.
The process of learning begins with observations or data, such as examples, direct
experience, or instruction, in order to look for patterns in data and make better decisions in the
future based on the examples that we provide. The primary aim is to allow the computers learn
automatically without human intervention or assistance and adjust actions accordingly.
(ii) Unsupervised learning. This type of machine learning involves algorithms that
train on unlabeled data. The algorithm scans through data sets looking for any
meaningful connection. Both the data algorithms train on and the predictions or
recommendations they output are predetermined.
(iii) Semi-supervised learning. This approach to machine learning involves a mix of the
two preceding types. Data scientists may feed an algorithm mostly labeled training
data, but the model is free to explore the data on its own and develop its own
understanding of the data set.
9
Reinforcement learning is often used in areas like:
• Robotics. Robots can learn to perform tasks in the physical world using this
technique.
• Video gameplay. Reinforcement learning has been used to teach bots to play
a number of video games.
• Resource management. Given finite resources and a defined goal,
reinforcement learning can help enterprises plan how to allocate resources.
BI and analytics vendors use machine learning in their software to identify potentially
important data points, patterns of data points and anomalies. Without data mining, many
businesses may not be able to perform effective market analysis, compare customer feedback
on similar products, discover the strengths and weaknesses of their competitors, retain highly
valuable customers, and make smart business decisions. Clearly, data mining is the core of
business intelligence.
Web Search Engines -- A Web search engine is a specialized computer server that searches
for information on the Web. Web search engines are essentially very large data mining
applications. Various data mining techniques are used in all aspects of search engines, ranging
from crawling, indexing and searching.
Customer relationship management -- CRM software can use machine learning models to
analyze email and prompt sales team members to respond to the most important messages first.
More advanced systems can even recommend potentially effective responses.
Human resource information systems -- HRIS systems can use machine learning models to
filter through applications and identify the best candidates for an open position.
Self-driving cars -- Machine learning algorithms can even make it possible for a semi-
autonomous car to recognize a partially visible object and alert the driver.
Virtual assistants -- Smart assistants typically combine supervised and unsupervised machine
learning models to interpret natural speech and supply context.
Image Recognition - Image recognition is one of the most common uses of machine learning.
Speech Recognition - Speech recognition is the translation of spoken words into the text.It is
also known as computer speech recognition or automatic speech recognition.
10
Medical diagnosis –Machine learning can be used in the techniques and tools that can help in
the diagnosis of diseases.
Deep Learning
Deep Learning is a subfield of machine learning concerned with algorithms inspired by
the structure and function of the brain called artificial neural networks.
11
Steps in Data Mining Process
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of particular
data mining methods. It is of interest to researchers in machine learning, pattern recognition,
databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data
visualization.
12
1. Data cleaning - to remove noise and inconsistent data. The data we have collected are not
clean and may contain errors, missing values, noisy or inconsistent data.
2. Data integration - where multiple data sources may be combined. First of all the data are
collected and integrated from all the different sources.
3. Data selection - where data relevant to the analysis task are retrieved from the database.
We may not all the data we have collected in the first step. So in this step we select only
those data which we think useful for data mining.
4. Data transformation – Transforming data to a proper format (where data are transformed
and consolidated into forms appropriate for mining by performing summary or aggregation
operations)
5. Data mining- an essential process where intelligent methods are applied to extract data
patterns. Techniques (Algorithms) like clustering and association analysis are among the
many different techniques used for data mining.
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for
mining. The data mining step may interact with the user or a knowledge base. The interesting
patterns are presented to the user and may be stored as new knowledge in the knowledge base.
The preceding view shows data mining as one step in the knowledge discovery process,
albeit an essential one because it uncovers hidden patterns for evaluation.
The issues continue to stimulate further investigation and improvement in data mining.
13
(i) Mining Methodology
Researchers have been vigorously developing new data mining methodologies. This
involves the investigation of new kinds of knowledge, mining in multidimensional space,
integrating methods from other disciplines, and the consideration of semantic ties among data
objects.
The user plays an important role in the data mining process. Interesting areas of
research include the following:-
i) Interactive mining
ii) Incorporation of background knowledge:
iii) Ad hoc data mining and data mining query languages
iv) Presentation and visualization of data mining results
Efficiency and scalability are always considered when comparing data mining
algorithms.
As data amounts continue to multiply, these two factors are especially critical.
The wide diversity of database types brings about challenges to data mining. These
Include:
14
(v) Data Mining and Society
How does data mining impact society? What steps can data mining take to preserve the
privacy of individuals? Do we use data mining in our daily lives without even knowing that we
do? These questions raise the following issues:
Attribute
Nominal attributes
Binary Attributes
Ordinal Attributes
15
Numeric Attributes
A discrete attribute has a finite or countably infinite set of values, which may
or may not be represented as integers.
If an attribute is not discrete, it is continuous.
Data Visualization
Data visualization aims to communicate data clearly and effectively through graphical
representation. Data visualization has been used extensively in many applications.
16
(iii) Icon-Based Visualization Techniques
Icon-based visualization techniques use small icons to represent
multidimensional data values. We look at two popular icon-based techniques:
Chernoff faces and stick figures.
In data mining, the similarity measure is a way of measuring how data samples are
related or closed to each other. On the other hand, the dissimilarity measure is to tell how
much the data objects are distinct.
A dissimilarity measure works the opposite way. It returns a value of 0 if the objects are
the same. The higher the dissimilarity value, the more dissimilar the two objects are.
A data matrix is made up of two entities or “things,” namely rows (for objects) and
columns (for attributes). Therefore, the data matrix is often called a two-mode matrix. The
dissimilarity matrix contains one kind of entity (dissimilarities) and so is called a one-mode
matrix.
17
Proximity Measures for Nominal Attributes
Nominal attributes can have two or more different states e.g. an attribute ‘color’ can
have values like ‘Red’, ‘Green’, ‘Yellow’, ‘Blue’, etc. Dissimilarity for nominal attributes is
calculated as the ratio of total number of mismatches between two data points to the total
number of attributes.
Nominal means “relating to names.
Euclidean Distance
The Euclidean distance d, between two points x and y is given by the following
formula:
18
Manhattan Distance
The Manhattan distance, often called Taxicab distance or City Block distance,
calculates the distance between real-valued vectors. Imagine vectors that describe objects on a
uniform grid such as a chessboard. Manhattan distance then refers to the distance between two
vectors if they could only move right angles. There is no diagonal movement involved in
calculating the distance.
Minkowski Distance
Minkowski distance is a distance measured between two points in N-dimensional space.
It is basically a generalization of the Euclidean distance and the Manhattan distance. It is
widely used in the field of Machine learning.
19
Proximity Measures for Ordinal Attributes
An ordinal attribute is an attribute whose possible values have a meaningful order or
ranking among them, but the magnitude between successive values is not known. However, to
do so, it is important to convert the states to numbers where each state of an ordinal attribute is
assigned a number corresponding to the order of attribute values.
Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height
{tall, medium, short}.
Cosine Similarity
Cosine similarity measures the similarity between two vectors of an inner product
space. It is measured by the cosine of the angle between two vectors and determines whether
two vectors are pointing in roughly the same direction. It is often used to measure document
similarity in text analysis.
Data Pre-Processing
Data mining is the process of discovering interesting patterns and knowledge from
large amounts of data.
Data preprocessing is a data mining technique which is used to transform the raw data
in a useful and efficient understandable format.
We cannot work with raw data. Raw data is highly susceptible to noise, missing values
and inconsistency. The quality of data affects the data mining results.
The quality of the data should be checked before applying machine learning or data
mining algorithms.
Purpose of Preprocessing
Preprocessing of data is mainly to check the data quality. The quality can be checked by
the following:
• Accuracy : To check whether the data entered is correct or not.
• Completeness : To check whether the data is available or not recorded.
• Consistency : To check whether the same data is kept in all the places that do or do
not match.
• Timeliness : The data should be updated correctly.
• Believability : The data should be trustable.
• Interpretability : The understandability of the data.
20
Data Preprocessing methods
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation and Discretization
1. Data Cleaning:
Data cleaning is the process to remove incorrect data, inaccurate data from the datasets,
and it also replaces or filling the missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data.
21
Handling of Missing Data
If Many tuples(a structure of data that has several parts) have no recorded value for
several attributes, then the missing values can be filled in for the attribute by various
methods are:
(iv) Use a measure of central tendency for the attribute (eg., the mean or median) to fill in the
missing value:
For normal (symmetric) data distributions, the attribute’s mean value can be used to
replace the missing value.
Where in the case of skewed data distribution, median value of the attribute can be
used.
(v) Use the attribute mean or median for all samples belonging to the same class as the given
tuple:
For example, if classifying customers according to credit_risk, we may replace the
missing value with the mean income value for customers in the same credit risk category as
that of the given tuple. If the data distribution for a given class is skewed, the median value is a
better choice.
(vi) Use the most probable value to fill in the missing value:
While using regression or decision tree algorithms or Bayesian formalism, the missing
value can be replaced by the most probable value.
Noisy Data
Noisy generally means random error or containing unnecessary data points.
Noisy data is a meaningless data that can’t be interpreted by machines.
It can be generated due to faulty data collection, data entry errors etc.
22
Here are some of the methods to handle noisy data.
This method works on sorted data in order to smooth it. The whole data is divided into
segments of equal size and then various methods are performed to complete the task. Each
segmented is handled separately. One can replace all data in a segment by its mean or
boundary values can be used to complete the task.
For example, you can bin the values for Age into categories such as 21-35, 36-59, and 60-
79.
23
(ii) Outlier analysis:
Outliers may be detected by clustering. For example, where similar values are
organized into groups or “clusters”. Intuitively, values that fall outside of the set of clusters
may be considered outliers. Clustering is generally used in unsupervised learning.
A 2-D plot of customer data with respect to customer locations in a city, showing three
data clusters.
(iii) Regression:
This is used to smooth the data and will help to handle data when unnecessary data is
present. Purpose of regression helps to decide the variable which is suitable for our analysis.
In linear Regression, one variable can be used to predict the other. (having one
independent variable).
In multiple linear regression, more than two variables are involved to predict. (having
multiple independent variables).
2. Data Integration
The process of combining multiple sources (multiple databases, data cubes, flat files)
into a single dataset.
Databases and data warehouses typically have metadata – that is, data about the data.
Such metadata can be used to help avoid errors in schema integration.
24
Careful integration can help reduce and avoid redundancies and inconsistencies in the
resulting dataset. This can help improve the accuracy and speed of the subsequent data mining
process.
• Tuple Duplication
In addition to detecting redundancies between attributes, duplication should also
be detected at the tuple level(e.g., where there are two or more identical tuples for a
given unique data entry case). Inconsistencies often arise between various duplicates,
due to inaccurate data entry or updating some but not all data occurrences.
• Detecting and resolving data value conflict: The data taken from different databases
while merging may differ. Like the attribute values from one database may differ from
another database. For example, the date format may differ like “MM/DD/YYYY” or
“DD/MM/YYYY”.
3. Data Reduction
Data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this,
we use data reduction technique.
Data reduction obtains a reduced representation of data set that is much smaller in
volume, yet produces the same (or almost the same) analytical results. It aims to increase the
storage efficiency and reduce data storage and analysis costs.
Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space and computation time is
reduced. When the data is highly dimensional the problem called “Curse of Dimensionality”
occurs.
This enables to store the model of data instead of whole data, for example: Regression
Models, log-linear models.
The change made in the format or the structure of the data is called data transformation.
This step can be simple or complex based on the requirements. There are some methods in
data transformation.
1. Smoothing:
With the help of algorithms, we can remove noise from the dataset and helps in
knowing the important features of the dataset. By smoothing we can find even a simple change
that helps in prediction. Such technique include binning, clustering and regression.
2. Attribute construction (or feature construction), where new attributes are constructed and
added from the given set of attributes to help the mining process.
26
3. Aggregation:
Where summary or aggregation operations are applied to the data. For Example, the
daily sales data may be aggregated so as to compute monthly and annual total amounts.
In this method, the data is stored and presented in the form of a summary. The data set
which is from multiple sources is integrated into with data analysis description. This is an
important step since the accuracy of the data depends on the quantity and quality of the data.
When the quality and the quantity of the data are good the results are more relevant.
4. Normalization:
It is done in order to scale the data values in a smaller specified range (-1.0 to 1.0 or
0.0 to 1.0)
5. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
The continuous data here is split into intervals. Discretization reduces the data size. For
example, rather than specifying the class time, we can set an interval like (3 pm-5 pm, 6 pm-8
pm).
Concept hierarchies can be used to reduce the data by collecting and replacing low-
level concepts with higher-level concepts.
27