Module-1 DM
Module-1 DM
Module-1 : Topics: Introduction to Data mining: Definition, KDD, Challenges, Data Mining
Tasks - Data Mining Goals– Stages of the Data Mining Process–Data Mining Techniques–
Applications – Major Issues in Data mining.
In general terms, “Mining” is the process of extraction of some valuable material from the
earth e.g. coal mining, diamond mining, etc. In the context of computer science, “Data
Mining” can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. It is basically the process
carried out for the extraction of useful information from a bulk of data or data
warehouses. One can see that the term itself is a little confusing. In the case of coal or
diamond mining, the result of the extraction process is coal or diamond. But in the case of Data
Mining, the result of the extraction process is not data!! Instead, data mining results are the
patterns and knowledge that we gain at the end of the extraction process. In that sense, we can
think of Data Mining as a step in the process of Knowledge Discovery or Knowledge
Extraction.
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
Data mining is the process of extracting useful information from large sets of data. It
involves using various techniques from statistics, machine learning, and database
systems to identify patterns, relationships, and trends in the data .
Data mining is also called Knowledge Discovery in Database (KDD). The knowledge
discovery process includes Data cleaning, Data integration, Data selection, Data
transformation, Data mining, Pattern evaluation, and Knowledge presentation.
DATA MINING
Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.
1. Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which facilitates data searchability,
reporting, and organization.
2. Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization.
3. Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.
For example, a group of databases, where an organization has kept various kinds of information.
4. Object-Relational Database:
5. Transactional Database:
A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately.
2. Data Integration
Data integration is defined as heterogeneous data from multiple sources combined in a common
source(DataWarehouse). Data integration using Data Migration tools, Data Synchronization
tools and ETL(Extract-Load-Transformation) process.
3. Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and
retrieved from the data collection. For this we can use Neural network, Decision Trees, Naive
bayes, Clustering, and Regression methods.
4. Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form
required by mining procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination to capture
transformations.
2. Code generation: Creation of the actual transformation program.
5. Data Mining
Data mining is defined as techniques that are applied to extract patterns potentially useful. It
transforms task relevant data into patterns, and decides purpose of model
using classification or characterization.
6. Pattern Evaluation
Pattern Evaluation is defined as identifying strictly increasing patterns representing knowledge
based on given measures. It find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable by user.
7. Knowledge Representation
This involves presenting the results in a way that is meaningful and can be used to make
decisions.
KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results. Preprocessing of databases consists of Data cleaning and Data
Integration.
DATA MINING
Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care services and
reduce costs.
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group
of products, then you are more likely to buy another group of products. This technique may enable
DATA MINING
the retailer to understand the purchase behavior of a buyer. This data may assist the retailer in
understanding the requirements of the buyer and altering the store's layout accordingly.
Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives are
recognized as affirming student's future learning behavior, studying the impact of educational
support, and promoting learning science.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be
beneficial to find patterns in a complex manufacturing process.
Customer Relationship Management (CRM) is all about obtaining and holding Customers, also
enhancing customer loyalty and implementing customer-oriented strategies. To get a decent
relationship with the customer, a business organization needs to collect data and analyze the data.
With data mining technologies, the collected data can be used for analytics.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a little
bit time consuming and sophisticated. Data mining provides meaningful patterns and turning data
into information. An ideal fraud detection system should protect the data of all the users.
Supervised methods consist of a collection of sample records, and these records are classified as
fraudulent or non-fraudulent.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging
task. Law enforcement may use data mining techniques to investigate offenses, monitor suspected
terrorist communications, etc. This technique includes text mining also, and it seeks meaningful
patterns in data, which is usually unstructured text. The information collected from the previous
investigations is compared, and a model for lie detection is constructed.
The Digitalization of the banking system is supposed to generate an enormous amount of data with
every new transaction. The data mining technique can help bankers by solving business-related
problems in banking and finance by identifying trends, casualties, and correlations in business
information and market costs that are not instantly evident to managers or executives because the
data volume is too large or are produced too rapidly on the screen by experts.
DATA MINING
The process of extracting useful data from large volumes of data is data mining. The data in the
real-world is heterogeneous, incomplete, and noisy.
2. Data Distribution:
DATA MINING
Real-worlds data is usually stored on various platforms in a distributed computing environment.
Data mining requires the development of tools and algorithms that allow the mining of distributed
data.
3. Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video,
images, complex data, spatial data, time series, and so on. Managing these various types of data
and extracting useful information is a tough task.
4. Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and privacy. For
example, if a retailer analyzes the details of the purchased items, then it reveals data about buying
habits and preferences of the customers without their permission.
6. Data Visualization:
In data mining, data visualization is a very important process because it is the primary method that
shows the output to the user in a presentable way. The extracted data should convey the exact
meaning of what it intends to express.
The primary goal of data mining is to discover hidden patterns and relationships in the data that
can be used to make informed decisions or predictions.
This involves exploring the data using various techniques such as clustering, classification,
regression analysis, association rule mining, and anomaly detection.
DATA MINING
In comparison, Data mining tasks can be classified into two types: descriptive and predictive.:
Data mining includes the utilization of refined data analysis tools to find previously unknown,
valid patterns and relationships in huge data sets. These tools can incorporate statistical models,
machine learning techniques, and mathematical algorithms, such as neural networks or decision
trees. Thus, data mining incorporates analysis and prediction.
1. Association
Association analysis is the finding of association rules showing attribute-value conditions that
occur frequently together in a given set of data. Association analysis is widely used for a market
basket or transaction data analysis. Association rule mining is a significant and exceptionally
dynamic area of data mining research.
DATA MINING
2. Classification
Classification is the processing of finding a set of models (or functions) that describe and
distinguish data classes or concepts, for the purpose of being able to use the model to predict the
class of objects whose class label is unknown. The determined model depends on the
investigation of a set of training data information (i.e. data objects whose class label is known).
3. Prediction
Data Prediction is a two-step process, similar to that of data classification. Although, for
prediction, we do not utilize the phrasing of “Class label attribute” because the attribute for
which values are being predicted is consistently valued(ordered) instead of categorical (discrete-
esteemed and unordered).
4. Clustering
Unlike classification and prediction, which analyze class-labeled data objects or attributes,
clustering analyzes data objects without consulting an identified class label. In general, the class
labels do not exist in the training data simply because they are not known to begin with.
Clustering can be used to generate these labels. The objects are clustered based on the principle
of maximizing the intra-class similarity and minimizing the interclass similarity. That is,
clusters of objects are created so that objects inside a cluster have high similarity in contrast with
each other, but are different objects in other clusters.
5. Regression
Regression can be defined as a statistical modeling method in which previously obtained data is
used to predicting a continuous quantity for new observations. This classifier is also known as
the Continuous Value Classifier. There are two types of regression models: Linear regression
and multiple linear regression models.
DATA MINING
An artificial neural network (ANN) also referred to as simply a “Neural Network” (NN), could
be a process model supported by biological neural networks. It consists of an interconnected
collection of artificial neurons. A neural network is a set of connected input/output units where
each connection has a weight associated with it. During the knowledge phase, the network
acquires by adjusting the weights to be able to predict the correct class label of the input samples.
7. Outlier Detection
A database may contain data objects that do not comply with the general behavior or model of
the data. These data objects are Outliers. The investigation of OUTLIER data is known as
OUTLIER MINING. An outlier may be detected using statistical tests which assume a
distribution or probability model for the data, or using distance measures where objects having
a small fraction of “close” neighbors in space are considered outliers.
Problem Definition
The first stage of data mining is problem definition, which involves identifying a specific business
problem or objective to be achieved through data analysis. This could include
improving customer retention rates to identifying opportunities for cost savings.
Data Collection
Once the problem has been clearly defined, data collection is the next stage of data mining. This
involves gathering relevant data from a variety of sources, including both internal and external
sources.
DATA MINING
Data Analysis
Once you have collected and organized your data, the next data mining stage is data analysis. In
this step, statistical methods and algorithms are used to analyze the data in order to uncover
patterns, relationships, and insights. This can involve using techniques such as regression
analysis, clustering analysis, or decision trees to identify relationships between variables and
make predictions about future outcomes.
Evaluation
After completing the analysis stage of data mining, it’s important to evaluate the results against
your original problem definition. This allows you to determine whether your analysis has
addressed the initial problem or needs further refinement.
Deployment
The deployment stage is the final step in the data mining process. Once analysis has been
completed, it’s essential to integrate the results into business practice by incorporating them into
decision-making processes.