Unit 1 DM
Unit 1 DM
Basically, Data mining has been integrated with many other techniques from other
domains such as statistics, machine learning, pattern recognition, database and
data warehouse systems, information retrieval, visualization, etc. to gather more
information about the data and to helps predict hidden patterns, future trends, and
behaviors and allows businesses to make decisions.
Technically, data mining is the computational process of analyzing data from different
perspectives, dimensions, angles and categorizing/summarizing it into meaningful
information.
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data
stored in databases.
KDD process
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse).
Data integration using Data Migration tools.
Data integration using Data Synchronization tools.
Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to the analysis
is decided and retrieved from the data collection.
Data selection using Neural network.
Data selection using Decision Trees.
Data selection using Naive bayes.
Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure.
Data Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination to capture
transformations.
Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract
patterns potentially useful.
Transforms task relevant data into patterns.
Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns
representing knowledge based on given measures.
Find interestingness score of each pattern.
DBMS:
DBMS is a full-fledged system for housing and managing a set of digital databases.
However Data Mining is a technique or a concept in computer science, which deals with
extracting useful and previously unknown information from raw data.
Most of the times, these raw data are stored in very large databases.
Therefore Data miners use the existing functionalities of DBMS to handle, manage and
even preprocess raw data before and during the Data mining process. However, a DBMS
system alone cannot be used to analyze data.
But, some DBMS at present have inbuilt data analyzing tools or capabilities.
2. Regression
Regression is employed to predict numeric or continuous values based on the
relationship between input variables and a target variable.
It aims to find a mathematical function or model that best fits the data to make accurate
predictions.
Ex: you may be interested in determining what a crop yield will be based on temperature,
rainfall and other independent variable.
The second is to determine how strong the relationship is between each variable.
It is commonly used in demand forecasting, price optimization, trend analysis etc.
3. Clustering
Clustering is a technique used to group similar data instances together based on
their intrinsic characteristics or similarities.
It aims to discover natural patterns or structures in the data without any
predefined classes or labels.
It can be used in wide range of applications including marketing segmentation,
image processing and anomaly detection.
There are various clustering algorithms available, example K-means, Hierarchical
etc.
A retailer can use clustering to group customers based on their purchasing
behavior.
4. Association Rule
Association rule mining focuses on discovering interesting relationships or patterns
among a set of items in transactional or market basket data.
It is typically used in market analysis to identify patterns of co-occurrence of products in
customer transactions.
Ex: A retailer might use association rule mining to identify that customers who buy bread
also tend to buy milk.
5. Anomaly Detection
Anomaly detection, sometimes called outlier analysis, aims to identify rare or
unusual data instances that deviate significantly from the expected patterns.
It is useful in detecting fraudulent transactions, network intrusions, manufacturing
defects, or any other abnormal behavior.
7. Neural Networks
Neural networks are a type of machine learning or AI model inspired by the
human brain's structure and function.
They are composed of interconnected nodes (neurons) and layers that can learn
from data to recognize patterns, perform classification, regression, or other tasks.
Input layer defines input data, output layer produces the output of the network.
Hidden layer responsible for the complex computations that make neural network
so powerful.
8. Decision Trees
Decision trees are graphical models that use a tree-like structure to represent
decisions and their possible consequences.
They recursively split the data based on different attribute values to form a
hierarchical decision-making process.
Node represents decisions or events, edges represent the possible outcomes.
Ex: it is used in risk assessment, customer segmentation etc.
9. Ensemble Methods
Ensemble methods combine multiple models to improve prediction accuracy and
generalization. Techniques like Random Forests and Gradient Boosting utilize a
combination of weak learners to create a stronger, more accurate model.
Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences, product
positioning, and impact on sales, customer satisfaction, and corporate profits. Data mining
enables a retailer to use point-of-sale records of customer purchases to develop products and
promotions that help the organization to attract the customer.
2. Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository
mainly due to organizational and technical concerns.
For example, various regional offices may have their servers to store their data. It is not
feasible to store, all the data from all the offices on a central server.
Therefore, data mining requires the development of tools and algorithms that allow the
mining of distributed data.
3. Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on.
Managing these various types of data and extracting useful information is a tough task.
Most of the time, new technologies, new tools, and methodologies would have to be refined to obtain
specific information.
4. Performance:
The data mining system's performance relies primarily on the efficiency of algorithms
and techniques used.
If the designed algorithm and techniques are not up to the mark, then the efficiency of
the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and
privacy.
For example, if a retailer analyzes the details of the purchased items, then it reveals data
about buying habits and preferences of the customers without their permission.
6. Data Visualization:
In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way.
The extracted data should convey the exact meaning of what it intends to express.
But many times, representing the information to the end-user in a precise and easy way is
difficult.
The input data and the output information being complicated, very efficient, and
successful data visualization processes need to be implemented to make it successful.
Pattern evaluation –
The patterns discovered should be interesting because either they represent
common knowledge or lack novelty.
2. Performance Issues: