0% found this document useful (0 votes)
7 views16 pages

Unit 1 DM

Data mining is the process of extracting useful information from large datasets using techniques from statistics and machine learning, with applications in various industries such as finance and healthcare. It involves several steps including data cleaning, integration, selection, transformation, mining, evaluation, and representation. While data mining offers advantages like cost efficiency and improved decision-making, it also faces challenges such as data quality issues and privacy concerns.

Uploaded by

Chandana.C.N.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views16 pages

Unit 1 DM

Data mining is the process of extracting useful information from large datasets using techniques from statistics and machine learning, with applications in various industries such as finance and healthcare. It involves several steps including data cleaning, integration, selection, transformation, mining, evaluation, and representation. While data mining offers advantages like cost efficiency and improved decision-making, it also faces challenges such as data quality issues and privacy concerns.

Uploaded by

Chandana.C.N.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit 1

Introduction to Data Mining


 Data mining is the process of extracting useful information from large sets of data. It
involves using various techniques from statistics, machine learning, and database
systems to identify patterns, relationships, and trends in the data.
 This information can then be used to make data-driven decisions, solve business
problems, and uncover hidden insights.
 Applications of data mining include customer profiling and segmentation, market
basket analysis, anomaly detection, and predictive modeling.
 Data mining tools and technologies are widely used in various industries, including
finance, healthcare, retail, and telecommunications.
 In general terms, “Mining” is the process of extraction of some valuable material from
the earth e.g. coal mining, diamond mining, etc.
 In the context of computer science, “Data Mining” can be referred to as knowledge
mining from data, knowledge extraction, data/pattern analysis, data archaeology,
and data dredging.
 It is basically the process carried out for the extraction of useful information from a
bulk of data or data warehouses.
 For example, banks typically use „data mining‟ to find out their prospective customers
who could be interested in credit cards, personal loans, or insurance as well.
 Since banks have the transaction details and detailed profiles of their customers, they
analyze all this data and try to find out patterns that help them predict that certain
customers could be interested in personal loans, etc.

Main Purpose of Data Mining

 Basically, Data mining has been integrated with many other techniques from other
domains such as statistics, machine learning, pattern recognition, database and
data warehouse systems, information retrieval, visualization, etc. to gather more
information about the data and to helps predict hidden patterns, future trends, and
behaviors and allows businesses to make decisions.
 Technically, data mining is the computational process of analyzing data from different
perspectives, dimensions, angles and categorizing/summarizing it into meaningful
information.

Divya K,GFGC Shivamoga Page 1


Unit 1

Advantages of Data Mining:

o The Data Mining technique enables organizations to obtain knowledge-based data.


o Data mining enables organizations to make lucrative modifications in operation and
production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the prediction of
trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.

Disadvantages of Data Mining:


o There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance training to
work on.
o Different data mining instruments operate in distinct ways due to the different algorithms
used in their design. Therefore, the selection of the right data mining tools is a very
challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences in
certain conditions.

What is Data Mining?

 The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.

 Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data
stored in databases.

Divya K,GFGC Shivamoga Page 2


Unit 1

Steps Involved in KDD Process:

KDD process

1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure.
Data Transformation is a two step process:
 Data Mapping: Assigning elements from source base to destination to capture
transformations.
 Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures.
 Find interestingness score of each pattern.

Divya K,GFGC Shivamoga Page 3


Unit 1

 Uses summarization and Visualization to make data understandable by user.


7. Knowledge representation: Knowledge representation is defined as technique which
utilizes visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.

Difference between KDD and Data Mining


Parameter KDD Data Mining

KDD refers to a process of identifying Data Mining refers to a process of


valid, novel, potentially useful, and extracting useful and valuable
Definition
ultimately understandable patterns and information or patterns from large
relationships in data. data sets.

To extract useful information from


Objective To find useful knowledge from data.
data.

Data cleaning, data integration, data


Association rules, classification,
selection, data transformation, data
Techniques clustering, regression, decision
mining, pattern evaluation, and
Used trees, neural networks, and
knowledge representation and
dimensionality reduction.
visualization.

Structured information, such as rules Patterns, associations, or insights


Output and models, that can be used to make that can be used to improve
decisions or predictions. decision-making or understanding.

Focus is on the discovery of useful Data mining focus is on the


Focus knowledge, rather than simply finding discovery of patterns or
patterns in data. relationships in data.

Domain expertise is important in Domain expertise is less critical in


Role of
KDD, as it helps in defining the goals data mining, as the algorithms are
domain
of the process, choosing appropriate designed to identify patterns without
expertise
data, and interpreting the results. relying on prior knowledge.

Divya K,GFGC Shivamoga Page 4


Unit 1

DBMS:

 DBMS, sometimes just called a database manager, is a collection of computer programs


that is dedicated for the management (i.e. organization, storage and retrieval) of all
databases that are installed in a system (i.e. hard drive or network).
 There are different types of Database Management Systems existing in the world, and
some of them are designed for the proper management of databases configured for
specific purposes.
 Most popular commercial Database Management Systems are Oracle, DB2 and Microsoft
Access.
 All these products provide means of allocation of different levels of privileges for
different users, making it possible for a DBMS to be controlled centrally by a single
administrator or to be allocated to several different people.
 There are four important elements in any Database Management System. They are the
modeling language, data structures, query language and mechanism for transactions.
 The modeling language defines the language of each database hosted in the DBMS.
 Currently several popular approaches like hierarchal, network, relational and object are
in practice.
 Data structures help organize the data such as individual records, files, fields and their
definitions and objects such as visual media.
 Data query language maintains the security of the database by monitoring login data,
access rights to different users, and protocols to add data to the system.
 SQL is a popular query language that is used in Relational Database Management
Systems.
 Finally, the mechanism that allows for transactions help concurrency and multiplicity.
 That mechanism will make sure that the same record will not be modified by multiple
users at the same time, thus keeping the data integrity in tact.
 Additionally, DBMS provide backup and other facilities as well.

What is the difference between DBMS and Data mining?

 DBMS is a full-fledged system for housing and managing a set of digital databases.
 However Data Mining is a technique or a concept in computer science, which deals with
extracting useful and previously unknown information from raw data.
 Most of the times, these raw data are stored in very large databases.
 Therefore Data miners use the existing functionalities of DBMS to handle, manage and
even preprocess raw data before and during the Data mining process. However, a DBMS
system alone cannot be used to analyze data.
 But, some DBMS at present have inbuilt data analyzing tools or capabilities.

Divya K,GFGC Shivamoga Page 5


Unit 1

Data Mining Techniques:


1. Classification
 Classification is a technique used to categorize data into predefined classes or categories
based on the features or attributes of the data instances.
 It involves training a model on labeled data and using it to predict the class labels of new,
unseen data instances.
 It is commonly used in fraud detection, customer segmentation, spam filtering etc.
 Ex: A bank can use classification to identify fraud transactions based on predefined
attributes such as transaction amount, location and time.

2. Regression
 Regression is employed to predict numeric or continuous values based on the
relationship between input variables and a target variable.
 It aims to find a mathematical function or model that best fits the data to make accurate
predictions.
 Ex: you may be interested in determining what a crop yield will be based on temperature,
rainfall and other independent variable.
 The second is to determine how strong the relationship is between each variable.
 It is commonly used in demand forecasting, price optimization, trend analysis etc.

Divya K,GFGC Shivamoga Page 6


Unit 1

3. Clustering
 Clustering is a technique used to group similar data instances together based on
their intrinsic characteristics or similarities.
 It aims to discover natural patterns or structures in the data without any
predefined classes or labels.
 It can be used in wide range of applications including marketing segmentation,
image processing and anomaly detection.
 There are various clustering algorithms available, example K-means, Hierarchical
etc.
 A retailer can use clustering to group customers based on their purchasing
behavior.

4. Association Rule
 Association rule mining focuses on discovering interesting relationships or patterns
among a set of items in transactional or market basket data.
 It is typically used in market analysis to identify patterns of co-occurrence of products in
customer transactions.
 Ex: A retailer might use association rule mining to identify that customers who buy bread
also tend to buy milk.

Divya K,GFGC Shivamoga Page 7


Unit 1

5. Anomaly Detection
 Anomaly detection, sometimes called outlier analysis, aims to identify rare or
unusual data instances that deviate significantly from the expected patterns.
 It is useful in detecting fraudulent transactions, network intrusions, manufacturing
defects, or any other abnormal behavior.

6. Time Series Analysis


 Time series analysis focuses on analyzing and predicting data points collected
over time.
 It involves techniques such as forecasting, trend analysis, seasonality detection,
and anomaly detection in time-dependent datasets.
 It involves analyzing data points that are measured at regular intervals of time, to
identify patterns, trends etc.

7. Neural Networks
 Neural networks are a type of machine learning or AI model inspired by the
human brain's structure and function.
 They are composed of interconnected nodes (neurons) and layers that can learn
from data to recognize patterns, perform classification, regression, or other tasks.
 Input layer defines input data, output layer produces the output of the network.

Divya K,GFGC Shivamoga Page 8


Unit 1

 Hidden layer responsible for the complex computations that make neural network
so powerful.

8. Decision Trees
 Decision trees are graphical models that use a tree-like structure to represent
decisions and their possible consequences.
 They recursively split the data based on different attribute values to form a
hierarchical decision-making process.
 Node represents decisions or events, edges represent the possible outcomes.
 Ex: it is used in risk assessment, customer segmentation etc.

9. Ensemble Methods
 Ensemble methods combine multiple models to improve prediction accuracy and
generalization. Techniques like Random Forests and Gradient Boosting utilize a
combination of weak learners to create a stronger, more accurate model.

Divya K,GFGC Shivamoga Page 9


Unit 1

10. Text Mining


Text mining techniques are applied to extract valuable insights and knowledge from
unstructured text data. Text mining includes tasks such as text categorization, sentiment
analysis, topic modeling, and information extraction, enabling your organization to derive
meaningful insights from large volumes of textual data, such as customer reviews, social
media posts, emails, and articles.

12 common problems in Data Mining:


1. Poor data quality such as noisy data, dirty data, missing values, inexact or incorrect values,
2. Integrating conflicting or redundant data from different sources and forms:
multimedia files (audio,video and images), geo data, text, social, numeric, etc…
3. Proliferation of security and privacy concerns by individuals, organizations and governments.
4.Unavailability of data or difficult access to data.
5. Efficiency and scalability of data mining algorithms to effectively extract the information
from huge amount of data in databases.
6.Dealing with huge datasets that require distributed approaches.
7.Dealing with non-static, unbalanced and cost-sensitive data.
8. Mining information from heterogeneous databases and global information systems.
9. Constant updation of models to handle data velocity or new incoming data.
10. High cost of buying and maintaining powerful softwares, servers and storage hardwares
that handle large amounts of data.
11. Processing of large, complex and unstructured data into a structured format.
12. Sheer quantity of output from many data mining methods.

Data Mining Applications:

Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences, product
positioning, and impact on sales, customer satisfaction, and corporate profits. Data mining
enables a retailer to use point-of-sale records of customer purchases to develop products and
promotions that help the organization to attract the customer.

Divya K,GFGC Shivamoga Page 10


Unit 1

1.Data Mining in Healthcare:


 Data mining in healthcare has excellent potential to improve the health system.
 It uses data and analytics for better insights and to identify best practices that will
enhance health care services and reduce costs.
 Analysts use data mining approaches such as Machine learning, Multi-dimensional
database, Data visualization, Soft computing, and statistics.
 Data Mining can be used to forecast patients in each category.
 The procedures ensure that the patients get intensive care at the right place and at the
right time. Data mining also enables healthcare insurers to recognize fraud and abuse.

2. Data Mining in Market Basket Analysis:


 Market basket analysis is a modeling method based on a hypothesis.
 If you buy a specific group of products, then you are more likely to buy another group of
products.
 This technique may enable the retailer to understand the purchase behavior of a buyer.
This data may assist the retailer in understanding the requirements of the buyer and
altering the store's layout accordingly.
 Using a different analytical comparison of results between various stores, between
customers in different demographic groups can be done.

3.Data mining in Education:


 Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments.
 EDM objectives are recognized as affirming student's future learning behavior, studying
the impact of educational support, and promoting learning science.
 An organization can use data mining to make precise decisions and also to predict the
results of the student. With the results, the institution can concentrate on what to teach
and how to teach.

Divya K,GFGC Shivamoga Page 11


Unit 1

4. Data Mining in Manufacturing Engineering:


 Knowledge is the best asset possessed by a manufacturing company.
 Data mining tools can be beneficial to find patterns in a complex manufacturing process.
 Data mining can be used in system-level designing to obtain the relationships between
product architecture, product portfolio, and data needs of the customers.
 It can also be used to forecast the product development period, cost, and expectations
among the other tasks.

5.Data Mining in CRM (Customer Relationship Management):


 Customer Relationship Management (CRM) is all about obtaining and holding
Customers, also enhancing customer loyalty and implementing customer-oriented
strategies.
 To get a decent relationship with the customer, a business organization needs to collect
data and analyze the data. With data mining technologies, the collected data can be used
for analytics.

6. Data Mining in Fraud detection:


 Billions of dollars are lost to the action of frauds.
 Traditional methods of fraud detection are a little bit time consuming and sophisticated.
 Data mining provides meaningful patterns and turning data into information.
 An ideal fraud detection system should protect the data of all the users.
 Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent.
 A model is constructed using this data, and the technique is made to identify whether the
document is fraudulent or not.

7.Data Mining in Lie Detection:


 Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task.
 Law enforcement may use data mining techniques to investigate offenses, monitor
suspected terrorist communications, etc.
 This technique includes text mining also, and it seeks meaningful patterns in data, which
is usually unstructured text.
 The information collected from the previous investigations is compared, and a model for
lie detection is constructed.

8. Data Mining Financial Banking:


 The Digitalization of the banking system is supposed to generate an enormous amount of
data with every new transaction.
 The data mining technique can help bankers by solving business-related problems in
banking and finance by identifying trends, casualties, and correlations in business
information and market costs that are not instantly evident to managers or executives
because the data volume is too large or are produced too rapidly on the screen by experts.
 The manager may find these data for better targeting, acquiring, retaining, segmenting,
and maintain a profitable customer.

Divya K,GFGC Shivamoga Page 12


Unit 1

Issues and Challenges of Implementation in Data mining:


Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc.
The process of data mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.

1. Incomplete and noisy data:


 The process of extracting useful data from large volumes of data is data mining.
 The data in the real-world is heterogeneous, incomplete, and noisy.
 Data in huge quantities will usually be inaccurate or unreliable.
 These problems may occur due to data measuring instrument or because of human errors.
 Suppose a retail chain collects phone numbers of customers who spend more than $ 500,
and the accounting employees put the information into their system.
 The person may make a digit mistake when entering the phone number, which results in
incorrect data.
 Even some customers may not be willing to disclose their phone numbers, which results
in incomplete data. The data could get changed due to human or system error.
 All these consequences (noisy and incomplete data)makes data mining challenging.

2. Data Distribution:
 Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
 Practically, It is a quite tough task to make all the data to a centralized data repository
mainly due to organizational and technical concerns.
 For example, various regional offices may have their servers to store their data. It is not
feasible to store, all the data from all the offices on a central server.
 Therefore, data mining requires the development of tools and algorithms that allow the
mining of distributed data.

3. Complex Data:
 Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on.
 Managing these various types of data and extracting useful information is a tough task.
 Most of the time, new technologies, new tools, and methodologies would have to be refined to obtain
specific information.

Divya K,GFGC Shivamoga Page 13


Unit 1

4. Performance:
 The data mining system's performance relies primarily on the efficiency of algorithms
and techniques used.
 If the designed algorithm and techniques are not up to the mark, then the efficiency of
the data mining process will be affected adversely.

5. Data Privacy and Security:

 Data mining usually leads to serious issues in terms of data security, governance, and
privacy.
 For example, if a retailer analyzes the details of the purchased items, then it reveals data
about buying habits and preferences of the customers without their permission.

6. Data Visualization:

 In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way.
 The extracted data should convey the exact meaning of what it intends to express.
 But many times, representing the information to the end-user in a precise and easy way is
difficult.
 The input data and the output information being complicated, very efficient, and
successful data visualization processes need to be implemented to make it successful.

Data Mining – Issues:


Data mining is integrated from various heterogeneous data sources. These factors also create
some issues. The major issues regarding –
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues

The following diagram describes the major issues.

Divya K,GFGC Shivamoga Page 14


Unit 1

1. Mining Methodology and User Interaction Issues:


It refers to the following kinds of issues −

 Mining different kinds of knowledge in databases –


 Different users may be interested in different kinds of knowledge.
 Therefore it is necessary for data mining to cover a broad range of knowledge
discovery task.

 Interactive mining of knowledge at multiple levels of abstraction –


 The data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on the
returned results.

 Incorporation of background knowledge –


 To guide discovery process and to express the discovered patterns, the
background knowledge can be used.
 Background knowledge may be used to express the discovered patterns not only
in concise terms but at multiple levels of abstraction.

 Data mining query languages and ad hoc data mining –


 Data Mining Query language that allows the user to describe ad hoc mining
tasks, should be integrated with a data warehouse query language and optimized
for efficient and flexible data mining.

 Presentation and visualization of data mining results –


 Once the patterns are discovered it needs to be expressed in high level
languages, and visual representations.
 These representations should be easily understandable.

 Handling noisy or incomplete data –


 The data cleaning methods are required to handle the noise and incomplete
objects while mining the data regularities.
 If the data cleaning methods are not there then the accuracy of the discovered
patterns will be poor.

 Pattern evaluation –
 The patterns discovered should be interesting because either they represent
common knowledge or lack novelty.

Divya K,GFGC Shivamoga Page 15


Unit 1

2. Performance Issues:

There can be performance-related issues such as follows −

 Efficiency and scalability of data mining algorithms –


 In order to effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.

 Parallel, distributed, and incremental mining algorithms –


 The factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of parallel and
distributed data mining algorithms.
 These algorithms divide the data into partitions which is further processed in a
parallel fashion.
 Then the results from the partitions are merged. The incremental algorithms,
update databases without mining the data again from scratch.

3. Diverse Data Types Issues:

 Handling of relational and complex types of data –


 The database may contain complex data objects, multimedia data objects, spatial
data, temporal data etc.
 It is not possible for one system to mine all these kind of data.

 Mining information from heterogeneous databases and global information systems –


 The data is available at different data sources on LAN or WAN.
 These data source may be structured, semi structured or unstructured.
 Therefore mining the knowledge from them adds challenges to data mining.

Divya K,GFGC Shivamoga Page 16

You might also like