0% found this document useful (0 votes)
11 views26 pages

DM Notes

Data mining is the process of extracting knowledge from large datasets, focusing on discovering patterns, predicting outcomes, and creating actionable information. It encompasses various tasks such as anomaly detection, association rule learning, clustering, classification, regression, and summarization, while facing challenges like handling noisy data and ensuring efficiency. Data mining is widely applied in sectors like finance, retail, telecommunications, and biology to derive insights and improve decision-making.

Uploaded by

kimrushda07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views26 pages

DM Notes

Data mining is the process of extracting knowledge from large datasets, focusing on discovering patterns, predicting outcomes, and creating actionable information. It encompasses various tasks such as anomaly detection, association rule learning, clustering, classification, regression, and summarization, while facing challenges like handling noisy data and ensuring efficiency. Data mining is widely applied in sectors like finance, retail, telecommunications, and biology to derive insights and improve decision-making.

Uploaded by

kimrushda07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT – 1

 What Is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. The term is
actually a misnomer. Thus, data mining should have been more appropriately named as knowledge
mining which emphasis on mining from large amounts of data.

It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and transform
it into an understandable structure for further use.

 The key properties of data mining are:


 Automatic discovery of patterns.
 Prediction of likely outcomes.
 Creation of actionable information.
 Focus on large data sets and databases.

 The Scope of Data Mining

Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:

Created by Ashwini Gopwad


Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive hands-
on analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.

 Tasks of Data Mining

Data mining involves six common classes of tasks:

Anomaly detection (Outlier/change/deviation detection) – The identification of unusual


data records, that might be interesting or data errors that require further investigation.
Association rule learning (Dependency modelling) – Searches for relationships between
variables. For example, a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.

Created by Ashwini Gopwad


Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".

Regression – attempts to find a function which models the data with the least error.

Created by Ashwini Gopwad


Summarization – providing a more compact representation of the data set, including
visualization and report generation.

 Major Issues in Data Mining:

 Mining different kinds of knowledge in databases. - The need of different users is


not the same. And Different user may be in interested in different kind of knowledge.
Therefore, it is necessary for data mining to cover broad range of knowledge discovery task.

 Interactive mining of knowledge at multiple levels of abstraction. - The data mining


process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on returned results.

 Incorporation of background knowledge. - To guide discovery process and to express the


discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.

Created by Ashwini Gopwad


 Data mining query languages and ad hoc data mining. - Data Mining Query language that allows
the user to describe ad hoc mining tasks, should be integrated with a data warehousequery
language and optimized for efficient and flexible data mining.

 Presentation and visualization of data mining results. - Once the patterns are discovered it needs
to be expressed in high level languages, visual representations. These representations should be easily
understandable by the users.

 Handling noisy or incomplete data. - The data cleaning methods are required that can handle the
noise, incomplete objects while mining the data regularities. If data cleaning methods are not there
then the accuracy of the discovered patterns will be poor.

 Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered shouldbe
interesting because either they represent common knowledge or lack novelty.

 Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.

 Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed parallel. Then the results from the partitions is merged. The
incremental algorithms, updates databases without having mine the data again from scratch.

 Major challenges in data mining:

1. Parallel, distributed, stream, and incremental mining methods


2. Handling high-dimensionality
3. Handling noise, uncertainty and incompleteness of data
4. Incorporation of constraints, expert knowledge and background knowledge in data mining
5. Pattern evaluation and knowledge integration
6. Mining diverse and heterogeneous kinds of data: e.g., bioinformatics, web, software/ system
engineering, information networks

Created by Ashwini Gopwad


7. Application-oriented and domain specific data mining
8. Invisible data mining (embedded in other functional modules)
9. Protection of security, integrity, and privacy in data mining

 Difference between DBMS vs data mining:

A DBMS (Database Management System) is a complete system used for managing digital databases that
allows storage of database content, creation/maintenance of data, search and other functionalities. On the other
hand, Data Mining is a field in computer science, which deals with the extraction of previously unknown and
interesting information from raw data. Usually, the data used as the input for the Data mining process is stored
in databases. Users who are inclined toward statistics use Data Mining. They utilize statistical models to look
for hidden patterns in data. Data miners are interested in finding useful relationships between different data
elements, which is ultimately profitable for businesses.

 DBMS:
DBMS, sometimes just called a database manager, is a collection of computer programs that is dedicated for
the management (i.e., organization, storage and retrieval) of all databases that are installed in a system (i.e.
hard drive or network). There are different types of Database Management Systems existing in the world, and
some of them are designed for the proper management of databases configured for specific purposes. Most
popular commercial Database Management Systems are Oracle, DB2 and Microsoft Access. All these
products provide means of allocation of different levels of privileges for different users, making it possible
for a DBMS to be controlled centrally by a single administrator or to be allocated to several different people.
There are four important elements in any Database Management System. They are the modeling language,
data structures, query language and mechanism for transactions. The modeling language defines the language
of each database hosted in the DBMS. Currently several popular approaches like hierarchal, network,
relational and object are in practice. Data structures help organize the data such as individual records, files,
fields and their definitions and objects such as visual media. Data query language maintains the security of
the database by monitoring login data, access rights to different users, and protocols to add data to the system.
SQL is a popular query language that is used in Relational Database Management Systems. Finally, the
mechanism that allows for transactions help concurrency and multiplicity. That mechanism will make sure
that the same record will not be modified by multiple users at the same time, thus keeping the data integrity
intact. Additionally, DBMS provide backup and other facilities as well.

Created by Ashwini Gopwad


 Data mining:
Data mining is also known as Knowledge Discovery in Data (KDD). As mentioned above, it is a felid of
computer science, which deals with the extraction of previously unknown and interesting information from
raw data. Due to the exponential growth of data, especially in areas such as business, data mining has become
very important tool to convert this large wealth of data in to business intelligence, as manual extraction of
patterns has become seemingly impossible in the past few decades. For example, it is currently been used for
various applications such as social network analysis, fraud detection and marketing. Data mining usually deals
with following four tasks: clustering, classification, regression, and association. Clustering is identifying
similar groups from unstructured data. Classification is learning rules that can be applied to new data and will
typically include following steps: preprocessing of data, designing modeling, learning/feature selection and
Evaluation/validation. Regression is finding functions with minimal error to model data. And association is
looking for relationships between variables. Data mining is usually used to answer questions like what are the
main products that might help to obtain high profit next year in Wal-Mart?

 What is the difference between DBMS and Data mining?


DBMS is a full-fledged system for housing and managing a set of digital databases. However, Data Mining is
a technique or a concept in computer science, which deals with extracting useful and previously unknown
information from raw data. Most of the times, these raw data are stored in very large databases. Therefore,
Data miners use the existing functionalities of DBMS to handle, manage and even pre-process raw data before
and during the Data mining process. However, a DBMS system alone cannot be used to analyse data. But,
some DBMS at present have inbuilt data analysing tools or capabilities.

 Discuss with data mining techniques:

Created by Ashwini Gopwad


Created by Ashwini Gopwad


Created by Ashwini
Gopwad

Created by Ashwini
Gopwad
Created by Ashwini
Gopwad

Created by Ashwini
Gopwad

Created by Ashwini
Gopwad

Created by Ashwini
Gopwad
Created by Ashwini Gopwad
 Data Mining Applications
Here is the list of areas where data mining is widely used −

 Financial Data Analysis


 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
Financial Data Analysis

The financial data in banking and financial industry is generally reliable and of
high quality which facilitates systematic data analysis and data mining. Some of
the typical cases are as follows −

 Design and construction of data warehouses for multidimensional


data analysis and data mining.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.
Retail Industry

Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will
continue to expand rapidly because of the increasing ease, availability and
popularity of the web.

Data mining in retail industry helps in identifying customer buying patterns and
trends that lead to improved quality of customer service and good customer
retention and satisfaction. Here is the list of examples of data mining in the retail
industry −

Created by Ashwini Gopwad


 Design and Construction of data warehouses based on the benefits of
data mining.
 Multidimensional analysis of sales, customers, products, time and
region.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.
Telecommunication Industry

Today the telecommunication industry is one of the most emerging industries


providing various services such as fax, pager, cellular phone, internet messenger,
images, e-mail, web data transmission, etc. Due to the development of new
computer and communication technologies, the telecommunication industry is
rapidly expanding. This is the reason why data mining is become very important
to help and understand the business.

Data mining in telecommunication industry helps in identifying the


telecommunication patterns, catch fraudulent activities, make better use of
resource, and improve quality of service. Here is the list of examples for which
data mining improves telecommunication services −

 Multidimensional Analysis of Telecommunication data.


 Fraudulent pattern analysis.
 Identification of unusual patterns.
 Multidimensional association and sequential patterns analysis.
 Mobile Telecommunication services.
 Use of visualization tools in telecommunication data analysis.
Biological Data Analysis

In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological

Created by Ashwini Gopwad


data mining is a very important part of Bioinformatics. Following are the aspects
in which data mining contributes for biological data analysis −

 Semantic integration of heterogeneous, distributed genomic and


proteomic databases.
 Alignment, indexing, similarity search and comparative analysis
multiple nucleotide sequences.
 Discovery of structural patterns and analysis of genetic networks and
protein pathways.
 Association and path analysis.
 Visualization tools in genetic data analysis.
Other Scientific Applications

The applications discussed above tend to handle relatively small and


homogeneous data sets for which the statistical techniques are appropriate. Huge
amount of data have been collected from scientific domains such as geosciences,
astronomy, etc. A large amount of data sets is being generated because of the fast
numerical simulations in various fields such as climate and ecosystem modeling,
chemical engineering, fluid dynamics, etc. Following are the applications of data
mining in the field of Scientific Applications −

 Data Warehouses and data preprocessing.


 Graph-based mining.
 Visualization and domain specific knowledge.
Intrusion Detection

Intrusion refers to any kind of action that threatens integrity, confidentiality, or


the availability of network resources. In this world of connectivity, security has
become the major issue. With increased usage of internet and availability of the
tools and tricks for intruding and attacking network prompted intrusion detection
to become a critical component of network administration. Here is the list of areas
in which data mining technology may be applied for intrusion detection −

Created by Ashwini Gopwad


 Development of data mining algorithm for intrusion detection.
 Association and correlation analysis, aggregation to help select and
build discriminating attributes.
 Analysis of Stream data.
 Distributed data mining.
 Visualization and query tools.
Data Mining System Products

There are many data mining system products and domain specific data mining
applications. The new data mining systems and applications are being added to
the previous systems. Also, efforts are being made to standardize data mining
languages.

Choosing a Data Mining System

The selection of a data mining system depends on the following features −

 Data Types − The data mining system may handle formatted text,
record-based data, and relational data. The data could also be in
ASCII text, relational database data or data warehouse data.
Therefore, we should check what exact format the data mining
system can handle.
 System Issues − We must consider the compatibility of a data
mining system with different operating systems. One data mining
system may run on only one operating system or on several. There
are also data mining systems that provide web-based user interfaces
and allow XML data as input.
 Data Sources − Data sources refer to the data formats in which data
mining system will operate. Some data mining system may work
only on ASCII text files while others on multiple relational sources.
Data mining system should also support ODBC connections or OLE
DB for ODBC connections.
Created by Ashwini Gopwad
 Data Mining functions and methodologies − There are some data
mining systems that provide only one data mining function such as
classification while some provides multiple data mining functions
such as concept description, discovery-driven OLAP analysis,
association mining, linkage analysis, statistical analysis,
classification, prediction, clustering, outlier analysis, similarity
search, etc.
 Coupling data mining with databases or data warehouse
systems − Data mining systems need to be coupled with a database
or a data warehouse system. The coupled components are integrated
into a uniform information processing environment. Here are the
types of coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
 Scalability − There are two scalability issues in data mining −
o Row (Database size) Scalability − A data mining
system is considered as row scalable when the number
or rows are enlarged 10 times. It takes no more than 10
times to execute a query.
o Column (Dimension) Scalability − A data mining
system is considered as column scalable if the mining
query execution time increases linearly with the number
of columns.
 Visualization Tools − Visualization in data mining can be
categorized as follows −
o Data Visualization
o Mining Results Visualization

Created by Ashwini Gopwad


o Mining process visualization
o Visual data mining
 Data Mining query language and graphical user interface − An
easy-to-use graphical user interface is important to promote user-
guided, interactive data mining. Unlike relational database systems,
data mining systems do not share underlying data mining query
language.
Trends in Data Mining

Data mining concepts are still evolving and here are the latest trends that we get
to see in this field −

 Application Exploration.
 Scalable and interactive data mining methods.
 Integration of data mining with database systems, data warehouse
systems and web database systems.
 Standardisation of data mining query language.
 Visual data mining.
 New methods for mining complex types of data.
 Biological data mining.
 Data mining and software engineering.
 Web mining.
 Distributed data mining.
 Real time data mining.
 Multi database data mining.
 Privacy protection and information security in data mining.

Created by Ashwini Gopwad


 Data mining Applications- case studies:
1. Housing loan prepayment prediction:
Data mining methods like attribute selection and attribute ranking will analyze
the customer payment history and select important factors such as payment to
income ratio, credit history, the term of the loan, etc. The results will help the
banks decide its loan granting policy, and also grant loans to the customers as per
factor analysis.

2. Crime detection:
Data Mining detects outliers across a vast amount of data. The criminal data
includes all details of the crime that has happened. Data Mining will study the
patterns and trends and predict future events with better accuracy. The agencies
can find out which area is more prone to crime, how much police personnel
should be deployed, which age group should be targeted, vehicle numbers to be
scrutinized, etc.

3. Mortgage Loan delinquency prediction:

the first users of data mining technology as it helps them with credit assessment.
Data mining analyses what services offered by banks are used by customers, what
type of customers use ATM cards and what do they generally buy using their
cards (for cross-selling). Banks use data mining to analyse the transactions which
the customer does before they decide to change the bank to reduce customer
attrition. Also, some outliers in transactions are analysed for fraud detection.

 Basic DM task:

The functionalities are to perceive the various forms of patterns to be identified


in data mining activities. To define the type of patterns to be discovered in data

Created by Ashwini Gopwad


mining activities, data mining features are used. Data mining has a wide
application for forecasting and characterizing data in big data.

Data mining tasks are majorly categorized into two categories: descriptive and
predictive.

1. Descriptive data mining:


Descriptive data mining offers a detailed description of the data, for
example- it gives insight into what's going on inside the data without any
prior idea. This demonstrates the common characteristics in the results. It
includes any information to grasp what's going on in the data without a
prior idea.
2. Predictive Data Mining:
This allows users to consider features that are not specifically available.
For example, the projection of the market analysis in the next quarters with
the output of the previous quarters, In general, the predictive analysis
forecasts or infers the features of the data previously available. For an
instance: judging by the outcomes of medical records of a patient who
suffers from some real illness.

Key Data Mining Tasks

1) Characterization and Discrimination

 Data Characterization: The characterization of data is a description of the


general characteristics of objects in a target class which creates what are
called characteristic rules.
A database query usually computes the data applicable to a user-specified
class and runs through a description component to retrieve the meaning of
the data at various abstraction levels.
E.g.; -Bar maps, curves, and pie charts.

Created by Ashwini Gopwad


 Data Discrimination: Data discrimination creates a series of rules called
discriminate rules that is simply a distinction between the two classes
aligned with the goal class and the opposite class of the general
characteristics of objects.

2) Prediction

To detect the inaccessible data, it uses regression analysis and detects the missing
numeric values in the data. If the class mark is absent, so classification is used to
render the prediction. Due to its relevance in business intelligence, the prediction
is common. If the class mark is absent, so the prediction is performed using
classification. There are two methods of predicting data. Due to its relevance in
business intelligence, a prediction is common. The prediction of the class mark
using the previously developed class model and the prediction of incomplete or
incomplete data using prediction analysis are two ways of predicting data.

3) Classification

Classification is used to create data structures of predefined classes, as the model


is used to classify new instances whose classification is not understood. The
instances used to produce the model are known as data from preparation. A
decision tree or set of classification rules is based on such a form of classification
process that can be collected to identify future details, for example by classifying
the possible compensation of the employee based on the classification of salaries
of related employees in the company.

4) Association Analysis

The link between the data and the rules that bind them is discovered. And two or
more data attributes are associated. It associates qualities that are transacted
together regularly. They work out what are called the rules of partnerships that

Created by Ashwini Gopwad


are commonly used in the study of stock baskets. To link the attributes, there are
two elements. One is the trust that suggests the possibility of both associated
together, and another helps, which informs of associations past occurrence.

5) Outlier Analysis

Data components that cannot be clustered into a given class or cluster are outliers.
They are often referred to as anomalies or surprises and are also very important
to remember.

Although in some contexts, outliers can be called noise and discarded, they can
disclose useful information in other areas, and hence can be very important and
beneficial for their study.

6) Cluster Analysis

Clustering is the arrangement of data in groups. Unlike classification, however,


class labels are undefined in clustering and it is up to the clustering algorithm to
find suitable classes. Clustering is often called unsupervised classification since
provided class labels do not execute the classification. Many clustering methods
are all based on the concept of maximizing the similarity (intra-class similarity)
between objects of the same class and decreasing the similarity between objects
in different classes (inter-class similarity).

7) Evolution & Deviation Analysis

We may uncover patterns and shifts in actions over time, with such distinct
analysis, we can find features such as time-series results, periodicity, and
similarities in patterns. Many technologies from space science to retail marketing
can be found holistically in data processing and features.

Created by Ashwini Gopwad


Created by Ashwini Gopwad

You might also like