0% found this document useful (0 votes)
21 views11 pages

Data Mining 445545

Data mining, also known as Knowledge Discovery in Database (KDD), is a technique used to extract valuable information from large datasets across various domains such as banking and pharmaceuticals. The KDD process involves several steps including data cleaning, integration, selection, transformation, mining, evaluation, and knowledge representation. While data mining offers advantages like cost efficiency and enhanced decision-making, it also poses challenges such as potential misuse of customer data and the complexity of mining tools.

Uploaded by

Fouziya A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

Data Mining 445545

Data mining, also known as Knowledge Discovery in Database (KDD), is a technique used to extract valuable information from large datasets across various domains such as banking and pharmaceuticals. The KDD process involves several steps including data cleaning, integration, selection, transformation, mining, evaluation, and knowledge representation. While data mining offers advantages like cost efficiency and enhanced decision-making, it also poses challenges such as potential misuse of customer data and the complexity of mining tools.

Uploaded by

Fouziya A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Mining

Definition
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals to extract
valuable information from huge sets of data. Data mining is also called Knowledge Discovery in Database (KDD).
Data mining is the process of extracting the useful information, which is stored in the large database.
It is a powerful tool, which is useful for organizations to retrieve the useful information from available data
warehouses.
Data mining can be applied to relational databases, object-oriented databases, data warehouses, structured-
unstructured databases, etc.
Data mining is used in numerous areas like banking, insurance companies, pharmaceutical companies etc.

KDD and Data mining


The main goal of KDD is to extract knowledge from large databases with the help of data mining methods.
The different steps of KDD are as given below:
1. Data cleaning:
In this step, noise and irrelevant data are removed from the database.
2. Data integration:
In this step, the heterogeneous data sources are merged into a single data source.
3. Data selection:
In this step, the data which is relevant to the analysis process gets retrieved from the database.
4. Data transformation:
In this step, the selected data is transformed in such forms which are suitable for data mining.
5. Data mining:
In this step, the various techniques are applied to extract the data patterns.
6. Pattern evaluation:
In this step, the different data patterns are evaluated.
7. Knowledge representation:This is the final step of KDD, which represents the knowledge.
Advantages of Data Mining
The Data Mining technique enables organizations to obtain knowledge-based data.
Data mining enables organizations to make lucrative modifications in operation and production.
Compared with other statistical data applications, data mining is a cost-efficient.
Data Mining helps the decision-making process of an organization.
It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and behaviors.
It can be induced in the new system as well as the existing platforms.
It is a quick process that makes it easy for new users to analyze enormous amounts of data in a short time.
Disadvantages of Data Mining
There is a probability that the organizations may sell useful data of customers to other organizations for money. As
per the report, American Express has sold credit card purchases of their customers to other organizations.
Many data mining analytics software is difficult to operate and needs advance training to work on.
Different data mining instruments operate in distinct ways due to the different algorithms used in their design.
Therefore, the selection of the right data mining tools is a very challenging task.
The data mining techniques are not precise, so that it may lead to severe consequences in certain conditions.
Data Mining Applications

TYPES OF DATA CAN BE MININED


1. Flat Files
Flat files is defined as data files in text form or binary form with a structure that can be easily extracted by data
mining algorithms.
Data stored in flat files have no relationship or
path among themselves, like if a relational
database is stored on flat file, then there will be
no relations between the tables.
Flat files are represented by data dictionary. Eg:
CSV file.
Application: Used in DataWarehousing to store
data, Used in carrying data to and from server,
etc.

2. Relational Databases
A Relational database is defined as the collection of data organized in tables with rows and columns.
Physical schema in Relational databases is a schema which defines the structure of tables.
Logical schema in Relational databases is a schema which defines the relationship among tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc.
3. Data Warehouse
A datawarehouse is defined as the collection of data
integrated from multiple sources that will queries and
decision making.
There are three types
ofdatawarehouse: Enterprise datawarehouse, Data Mart
and Virtual Warehouse.
Two approaches can be used to update data in
DataWarehouse: Query-driven Approach and Update-
driven Approach.
Application: Business decision making, Data mining, etc.

4. Transactional Databases

Transactional databases is a collection of data organized by time stamps, date, etc


to represent transaction in databases.
This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
Highly flexible system where users can modify information without changing any
sensitive information.
Follows ACID property of DBMS.
Application: Banking, Distributed systems, Object databases, etc.

5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in a pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.

6. Time-series Databases
Time series databases contains stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.

7. WWW
WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc which are
identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML pages, and accessible via
the Internet network.
It is the most heterogeneous repository as it collects data from multiple resources. It
is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.

Types of Data Mining


Each of the following data mining techniques serves several different business problems and provides a different
insight into each of them. However, understanding the type of business problem you need to solve will also help in
knowing which technique will be best to use, which will yield the best results. The Data Mining types can be divided
into two basic parts that are as follows:
Predictive Data Mining Analysis
Descriptive Data Mining Analysis
1. Predictive Data Mining
As the name signifies, Predictive Data-Mining analysis works on the data that may help to know what may happen
later (or in the future) in business. Predictive Data-Mining can also be further divided into four types that are listed
below:
Classification Analysis
Regression Analysis
Time Serious Analysis
Prediction Analysis
2. Descriptive Data Mining
The main goal of the Descriptive Data Mining tasks is to summarize or turn given data into relevant information. The
Descriptive Data-Mining Tasks can also be further divided into four types that are as follows:
Clustering Analysis Summarization
Analysis Association Rules
Analysis Sequence Discovery
Analysis
Here, we will discuss each of the data mining's types in detail. Below are several different data mining techniques that
can help you find optimal outcomes as the results.
1. CLASSIFICATION ANALYSIS
This type of data mining technique is generally used in fetching or retrieving important and relevant information
about the data & metadata. It is also even used to categories the different types of data format into different
classes. If you focus on this article until it ends, you may definitely find out that Classification and clustering are
similar data mining types. As clustering also categorizes or classify the data segments into the different data records
known as the classes. However, unlike clustering, the data analyst would have the knowledge of different classes or
clusters. Therefore in the classification analysis, you have to apply or implement the algorithms to decide in which
way the new data should be categorized or classified. A classic example of classification analysis would be Outlook
email. In Outlook, they use certain algorithms to characterize an email is legitimate or spam.
This technique is usually very helpful for retailers who can use it to study the buying habits of their different
customers. Retailers can also study the past sales data and then lookout (or search ) for products that customers
usually buy together. After which, they can put those products nearby of each other in their retail stores to help
customers save their time and as well as to increase their sales.
2. REGRESSION ANALYSIS
In statistical terms, regression analysis is a process usually used to identify and analyze the relationship among
variables. It means one variable is dependent on another, but it is not vice versa. It is generally used for prediction
and forecasting purposes. It can also help you understand the characteristic value of the dependent variable changes
if any of the independent variables is varied.
3. Time Serious Analysis
A time series is a sequence of data points that are usually recorded at specific time intervals of points. Usually, they
are - most often in regular time intervals (seconds, hours, days, months etc.). Almost every organization generates a
high volume of data every day, such as sales figures, revenue, traffic, or operating cost. Time series data mining can
help in generating valuable information for long-term business decisions, yet they are underutilized in most
organizations.
4. Prediction Analysis
This technique is generally used to predict the relationship that exists between both the independent and dependent
variables as well as the independent variables alone. It can also use to predict profit that can be achieved in future
depending on the sale. Let us imagine that profit and sale are dependent and independent variables, respectively.
Now, on the basis of what the past sales data says, we can make a profit prediction of the future using a regression
curve.
5. Clustering Analysis
In Data Mining, this technique is used to create meaningful object clusters that contain the same characteristics.
Usually, most people get confused with Classification, but they won't have any issues if they properly understand
how both these techniques actually work. Unlike Classification that collects the objects into predefined classes,
clustering stores objects in classes that are defined by it. To understand it in more detail, you can consider the
following given example:
Example
Suppose you are in a library that is full of books on different topics. Now the real challenge for you is to organize those
books so that readers don't face any problem finding out books on any particular topic. So here, we can use
clustering to keep books with similarities in one particular shelf and then give those shelves a meaningful name or
class. Therefore, whenever a reader looking for books on a particular topic can go straight to that shelf. Hence he
won't be required to roam the entire library to find the book he wants to read.
6. SUMMARIZATION ANALYSIS
The Summarization analysis is used to store a group (or a set ) of data in a more compact way and an easier-to-
understand form. We can easily understand it with the help of an example:
Example
You might have used Summarization to create graphs or calculate averages from a given set (or group) of data. This
is one of the most familiar and accessible forms of data mining.
7. ASSOCIATION RULE LEARNING
In general, it can be considered a method that can help us identify some interesting relations (dependency
modeling) between different variables in large databases. This technique can also help us to unpack some hidden
patterns in the data, which can be used to identify the variables within the data. It also helps in detecting the
concurrence of different variables that appear very frequently in the dataset. Association rules are generally used
for examining and forecasting the behavior of the customer. It is also highly recommended in the retail industry
analysis. This technique is also used to determine shopping basket data analysis, catalogue design, product
clustering, and store layout. In IT, programmers also uses the association rules to create programs capable of
machine learning. Or in short, we can say that this data mining technique helps to find the association between two
or more Items. It discovers a hidden pattern in the data set.
8. Sequence Discovery Analysis
The primary goal of sequence discovery analysis is to discover interesting patterns in data on the basis of some
subjective or objective measure of how interesting it is. Usually, this task involves discovering frequent sequential
patterns with respect to a frequency support measure. Some people may often confuse it with time series as both
the Sequence discovery analysis and Time series analysis contains the adjacent observation that are order
dependent. However, if the people see both of them in a little more depth, their confusion can be easily avoided as
the Time series analysis technique contains numerical data, whereas the Sequence discovery analysis contains
discrete values or data.
Kinds of Patterns
Data Mining system is a tool to provide lot of functionality to mine our data in the database. Data mining
functionalities are used to specify the kind of patterns to be found in data mining tasks.
Data mining tasks can be classified into two categories
1.Descriptive
It identify patterns in data. Descriptive mining tasks characterize the general properties of the data in the
database.
2.Predictive models
It predicts unknown values based on known data. Predictive mining tasks perform inference on the current data in
order to make predictions.
Data mining functionalities, and the kinds of patterns are described below.

1.Concept/Class Description: Characterization and Discrimination


Data can be associated with classes or concepts.
For example, in the AllElectronics store, classes of items for sale include computers and printers, and concepts of
customers include bigSpenders and budgetSpenders.
These descriptions can be derived via
(1) data characterization
(2) data discrimination
(3) both data characterization and discrimination.
Data characterization –Summarization of the general characteristics or features of a target class of data.
For example, to study the characteristics of software products whose sales increased by 10% in the last year, the
data collected by executing an SQL query.
The output of data characterization can be presented in various forms. Examples include pie charts, bar charts,
multidimensional data cubes, and multidimensional tables.
Data discrimination – comparing the target class with one or a set of classes.
For example, compare the general features of software products whose sales increased by 10% in the last year with
those whose sales decreased by at least 30% during the same period.
both data characterization and discrimination
2. Mining Frequent Patterns, Associations, and Correlations
Frequent patterns are patterns that occur frequently in data.
Frequent itemset: a set of items that frequently appear together in a transactional data set (Example : milk
and bread)
Frequent subsequence : pattern that customers tend to purchase first a PC, followed by a digital camera, and then
a memory card, is a frequent subsequential pattern. Customer purchase product A followed by a purchase of
product B.
Frequent substructure : refer to different structural forms, like graphs, trees which may be combined with itemsets
or subsequenses.

Association analysis : find frequent patterns. E.g. sample analysis result.


Association rule:
E.g buys(X; “computer”))buys(X; “software”) [support = 1%; confidence = 50%]

X is a variable representing a customer. A confidence 50% means, if a customer buys a computer, there is a 50%
chance that she will buy software. 1% means that transactions analysis showed that computer and software were
purchased together .
Association rules are discarded , if they do not satisfy both minimum support threshold and minimum confidence
threshold .

3.Classification and prediction


Classification
It is the process of finding a model that describes and distinguishes data classes or concepts, for the purpose of being
able to use the model to predict the class of objects whose class label is unknown.
The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).
The derived model can be represented in classification (IF-THEN) rules, decision trees, ,or neural networks (Figure
1.10)
Prediction
predict missing or unavailable numerical data values

4.Cluster Analysis
Class label is unknown: group data to form new classes
Clusters of objects are based on the principle of maximizing the intraclass similarity and minimizing the interclass
similarity
5.Outlier Analysis
A database may contain data objects that do not comply with the general behavior or model of the data.
Outliers are usually discarded as noise or exceptions.
Useful for fraud detection.
E.g. Detect purchases of extremely large amounts.

6. Evolution Analysis
Data evolution analysis describes and model regularities or trends for objects whose behavior changes over time.
E.g. identify stock evolution regularities for overall stock and for the stocks of particular companies.

Technologies Used in data mining

Data mining has incorporated many


techniques from other domain fields like
machine learning, statistics, information
retrieval, data warehouse, pattern
recognition, algorithms, and high-
performance computing. Since it is a highly
application-driven domain, the
interdisciplinary nature is typically very
significant. Research and development in
data mining and its applications prove quite
useful in implementing it. We will see major
technologies utilized in data mining.

1. Statistics:
Data mining has an inherent connection with statistics. It studies the collection, and interpretation performs the
analysis and helps visualize data presentation. A statistical model is used for data classes and data modeling. It
describes the behavior of an object in a class and its probability. Statistical models are the outcomes of data mining
tasks like classification and data characterization. Or we can use the mining task on top of the statistical models.
It uses the mathematical analysis to express representations, model and summarize empirical data or real world
observations.
Statistical analysis involves the collection of methods, applicable to large amount of data to conclude and report the
trend.

Advantage:
Statistics can be used to model noise and missing data values. The tools for forecasting, predicting, or summarizing
data can be availed by statistics. Statistics are useful for pattern mining. After mining a classification model, the
statistical hypothesis is used for verification. A hypothetical test makes the decisions using the test data. The result is
statistically significant if it is not likely to have been incurred by chance.
Disadvantage:
When the statistical model is used on large data set, it increases the complexity cost. When data mining is used to
handle large real-time and streamed data, computation costs increase dramatically.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives computers the ability to learn without being
programmed.
When the new data is entered in the computer, algorithms help the data to grow or change due to machine
learning. In machine learning, an algorithm is constructed to predict the data from the available database (Predictive
analysis). It is related to computational statistics.
The four types of machine learning are:

1. Supervised learning
It is based on the classification.
It is also called as inductive learning. In this method, the desired outputs are included in the training dataset.
2. Unsupervised learning
Unsupervised learning is based on clustering. Clusters are formed on the basis of similarity measures and desired
outputs are not included in the training dataset.
3. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training dataset to generate the appropriate
functions. This method generally avoids the large number of labeled examples (i.e. desired outputs).
4. Active learning
Active learning is a powerful approach in analyzing the data efficiently.
The algorithm is designed in such a way that, the desired output should be decided by the algorithm itself (the user
plays important role in this type).
3. Information Retrieval:
The technique searches for the information in the document, which may be in text, multimedia, or residing on
the Web. It has two main characteristics:
Searched data is unstructured
Queries are formed by keywords that don’t have complex structures.
The most widely used information retrieval approach is the probabilistic model. Information retrieval combined
with data mining techniques is used for finding out any relevant topic in the document or web.
Uses: A large amount of data are available and streamed in the web, both text and multimedia due to the fast
growth of digitalization including the government sector, health care, and many others. The search and ana lysis
have raised many challenges and hence Information Retrieval becomes increasingly important.

4. Database systems and data warehouse


Databases are used for the purpose of recording the data as well as data warehousing.
Online Transactional Processing (OLTP) uses databases for day to day transaction
purpose.
To remove the redundant data and save the storage space, data is normalized and stored in the form of
tables. Entity-Relational modeling techniques are used for relational database management system design.
Data warehouses are used to store historical data which helps to take strategical decision for business.
It is used for online analytical processing (OALP), which helps to analyze the data.
5. Decision support system
Decision support system is a category of information system. It is very useful in decision making for organizations.
It is an interactive software based system which helps decision makers to extract useful information from the data,
documents to make the decision.

Data Mining – Major Issues


Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one
place. It needs to be integrated from various heterogeneous data sources. These factors also create some issues.
Here in this tutorial, we will discuss the major issues regarding −
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
o Mining different kinds of knowledge in databases − Different users may be interested in different kinds of
knowledge. Therefore it is necessary for data mining to cover a broad range of knowledge discovery task.
o Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be
interactive because it allows users to focus the search for patterns, providing and refining data mining
requests based on the returned results.
o Incorporation of background knowledge − To guide discovery process and to express the discovered
patterns, the background knowledge can be used. Background knowledge may be used to express
the discovered patterns not only in concise terms but at multiple levels of abstraction.
o Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to
describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
o Presentation and visualization of data mining results − Once the patterns are discovered it needs to be
expressed in high level languages, and visual representations. These representations should be easily
understandable.
Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete
objects while mining the data regularities. If the data cleaning methods are not there then the accuracy of the
discovered patterns will be poor.
o Pattern evaluation − The patterns discovered should be interesting because either they represent common
knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
o Efficiency and scalability of data mining algorithms − In order to effectively extract the information from
huge amount of data in databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide
distribution of data, and complexity of data mining methods motivate the development of parallel and distributed
data mining algorithms. These algorithms divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The incremental algorithms, update databases without
mining the data again from scratch.
o Diverse Data Types Issues
Handling of relational and complex types of data − The database may contain complex data objects, multimedia data
objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems − The data is available at
different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured.
Therefore mining the knowledge from them adds challenges to data mining.
Data Objects and Attribute Types
A data object represents an entity—
in a sales database, the objects may be customers, store items, and sales; in a medical database, the
objects may be patients; in a university database, the objects may be students, professors, and courses. Data
objects are typically described by attributes. Data objects can also be referred to as samples, examples, instances,
data points, or objects. If the data objects are stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns correspond to the attributes. In this section, we define
attributes and look at the various attribute types.

Attribute:
It can be seen as a data field that represents the characteristics or features of a data object. For a customer, object
attributes can be customer Id, address, etc. We can say that a set of attributes used to describe a given object are
known as attribute vector or feature vector.
Type of attributes :
This is the First step of Data-preprocessing. We differentiate between different types of attributes and then
preprocess the data. So here is the description of attribute types.
Qualitative (Nominal (N), Ordinal (O), Binary(B)).
Quantitative (Numeric, Discrete, Continuous)

Qualitative Attributes:
1. Nominal Attributes – related to names: The values of a Nominal attribute are names of things, some kind of
symbols. Values of Nominal attributes represents some category or state and that’s why nominal attribute also
referred as categorical attributes and there is no order (rank, position) among values of the nominal attribute.
Example :

2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected or unaffected, true or
false.
Symmetric: Both values are equally important (Gender).
Asymmetric: Both values are not equally important (Result).
Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence or ranking(order)
between them, but the magnitude between values is not actually known, the order of values that shows what is
important but don’t indicate how important it is.

Quantitative Attributes:

1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented in integer or real
values. Numerical attributes are of 2 types, interval, and ratio.
An interval-scaled attribute has values, whose differences are interpretable, but the numerical attributes do not
have the correct reference point, or we can call zero points. Data can be added and subtracted at an interval scale
but can not be multiplied or divided. Consider an example of temperature in degrees Centigrade. If a day’s
temperature of one day is twice of the other day we cannot say that one day is twice as hot as another day.
A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is ratio-scaled, we can say of a
value as being a multiple (or ratio) of another value. The values are ordered, and we can also compute the difference
between values, and the mean, median, mode, Quantile-range, and Five number summary can be given.

2. Discrete : Discrete data have finite values it can be numerical and can also be in categorical form. These attributes
has finite or countably infinite set of values.

Example:

3. Continuous: Continuous data have an infinite no of states. Continuous data is of float type. There can be many
values between 2 and 3.
Example :

You might also like