Data Mining Tutorial
Data Mining Tutorial
The data mining tutorial provides basic and advanced concepts of data mining. Our data
mining tutorial is designed for learners and experts.
Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is
also called Knowledge Discovery in Database (KDD). The knowledge discovery
process includes Data cleaning, Data integration, Data selection, Data transformation,
Data mining, Pattern evaluation, and Knowledge presentation.
Our Data mining tutorial includes all topics of Data mining such as applications, Data
mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining
techniques, Clustering in data mining, Challenges in Data mining, etc.
In other words, we can say that Data Mining is the process of investigating hidden
patterns of information to various perspectives for categorization into useful data, which
is collected and assembled in particular areas such as data warehouses, efficient analysis,
data mining algorithm, helping decision making and other data requirement to
eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on
a particular data set, with an objective. This process includes various types of services
such as text mining, web mining, audio and video mining, pictorial data mining, and
social media mining. It is done through software that is simple or highly specific. By
outsourcing data mining, all the work can be done faster with low operation costs.
Specialized firms can also use new technologies to collect data that is impossible to
locate manually. There are tonnes of information available on various platforms, but very
little knowledge is accessible. The biggest challenge is to analyze the data to extract
important information that can be used to solve a problem or for company
development. There are many powerful instruments and techniques available to mine
data and find better insight from it.
Types of Data Mining
Data mining can be performed on the following types of data:
Relational Database:
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights. The huge amount of data
comes from multiple places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision- making for a business
organization. The data warehouse is designed for the analysis of data rather than
transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various
kinds of information.
Object-Relational Database:
One of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even
though this was a unique capability a very long while back, today, most of the relational
database systems support transactional database activities.
These are the following areas where data mining is widely used:
Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance
health care services and reduce costs. Analysts use data mining approaches such as
Machine learning, Multi-dimensional database, Data visualization, Soft computing, and
statistics. Data Mining can be used to forecast patients in each category. The procedures
ensure that the patients get intensive care at the right place and at the right time. Data
mining also enables healthcare insurers to recognize fraud and abuse.
Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments. EDM
objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use
data mining to make precise decisions and also to predict the results of the student.
With the results, the institution can concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools
can be beneficial to find patterns in a complex manufacturing process. Data mining can
be used in system-level designing to obtain the relationships between product
architecture, product portfolio, and data needs of the customers. It can also be used to
forecast the product development period, cost, and expectations among the other tasks.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection
are a little bit time consuming and sophisticated. Data mining provides meaningful
patterns and turning data into information. An ideal fraud detection system should
protect the data of all the users. Supervised methods consist of a collection of sample
records, and these records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify whether the
document is fraudulent or not.
The process of extracting useful data from large volumes of data is data mining. The
data in the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities
will usually be inaccurate or unreliable. These problems may occur due to data
measuring instrument or because of human errors. Suppose a retail chain collects phone
numbers of customers who spend more than $ 500, and the accounting employees put
the information into their system. The person may make a digit mistake when entering
the phone number, which results in incorrect data. Even some customers may not be
willing to disclose their phone numbers, which results in incomplete data. The data
could get changed due to human or system error. All these consequences (noisy and
incomplete data)makes data mining challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository
mainly due to organizational and technical concerns. For example, various regional
offices may have their servers to store their data. It is not feasible to store, all the data
from all the offices on a central server. Therefore, data mining requires the development
of tools and algorithms that allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these
various types of data and extracting useful information is a tough task. Most of the time,
new technologies, new tools, and methodologies would have to be refined to obtain
specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms
and techniques used. If the designed algorithm and techniques are not up to the mark,
then the efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without their
permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data
should convey the exact meaning of what it intends to express. But many times,
representing the information to the end-user in a precise and easy way is difficult. The
input data and the output information being complicated, very efficient, and successful
data visualization processes need to be implemented to make it successful.
There are many more challenges in data mining in addition to the problems above-mentioned.
More problems are disclosed as the actual data mining process begins, and the success of data
mining relies on getting rid of all these difficulties.
Prerequisites
Before learning the concepts of Data Mining, you should have a basic understanding of
Statistics, Database Knowledge, and Basic programming language.
Audience
Our Data Mining Tutorial is prepared for all beginners or computer science graduates to
help them learn the basics to advanced techniques related to data mining.
Problems
We assure you that you will not find any difficulty while learning our Data Mining
tutorial. But if there is any mistake in this tutorial, kindly post the problem or error in the
contact form so that we can improve it.
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the
data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. Data modeling puts clustering from a
historical point of view rooted in statistics, mathematics, and numerical analysis. From a
machine learning point of view, clusters relate to hidden patterns, the search for clusters
is unsupervised learning, and the subsequent framework represents a data concept.
From a practical point of view, clustering plays an extraordinary job in data mining
applications. For example, scientific data exploration, text mining, information retrieval,
spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between
the data. Clustering is very similar to the classification, but it involves grouping chunks
of data together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of planning
and modeling. For example, we might use it to project certain costs, depending on other
factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds
a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how
often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased
and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item
A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also known
as Outlier Analysis or Outilier mining. The outlier is a data point that diverges too much
from the rest of the dataset. The majority of the real-world datasets have an outlier.
Outlier detection plays a significant role in the data mining field. Outlier detection is
valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in
a set of sequences, where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right sequence to
predict a future event.
Data mining is described as a process of finding hidden precious data by evaluating the
huge quantity of information stored in data warehouses, using multiple data mining
techniques such as Artificial Intelligence (AI), Machine learning and statistics.
Let's examine the implementation process for data mining in details:
Tasks:
Access situation:
o It requires a more detailed analysis of facts about all the resources, constraints,
assumptions, and others that ought to be considered.
o A business goal states the target of the business terminology. For example,
increase catalog sales to the existing customer.
o A data mining goal describes the project objectives. For example, It assumes how
many objects a customer will buy, given their demographics details (Age, Salary,
and City) and the price of the item over the past three years.
o It states the targeted plan to accomplish the business and data mining plan.
o The project plan should define the expected set of steps to be performed during
the rest of the project, including the latest technique and better selection of
tools.
2. Data Understanding:
Data understanding starts with an original data collection and proceeds with operations
to get familiar with the data, to data quality issues, to find better insight in data, or to
detect interesting subsets for concealed information hypothesis.
Tasks:
Describe data:
Explore data:
3. Data Preparation:
Tasks:
o Select data
o Clean data
o Construct data
o Integrate data
o Format data
Select data:
o It decides which information to be used for evaluation.
o In the data selection criteria include significance to data mining objectives,
quality and technical limitations such as data volume boundaries or data types.
o It covers the selection of characteristics and the choice of the document in the
table.
Clean data:
Construct data:
Integrate data:
o Integrate data refers to the methods whereby data is combined from various
tables, or documents to create new documents or values.
Format data:
4. Modeling:
In modeling, various modeling methods are selected and applied, and their parameters
are measured to optimum values. Some methods gave particular requirements on the
form of data. Therefore, stepping back to the data preparation phase is necessary.
Tasks:
o It selects the real modeling method that is to be used. For example, decision tree,
neural network.
o If various methods are applied,then it performs this task individually for each
method.
o Generate a procedure or mechanism for testing the validity and quality of the
model before constructing a model. For example, in classification, error rates are
commonly used as quality measures for data mining models. Therefore, typically
separate the data set into train and test set, build the model on the train set and
assess its quality on the separate test set.
Build model:
o To create one or more models, we need to run the modeling tool on the
prepared data set.
Assess model:
o It interprets the models according to its domain expertise, the data mining
success criteria, and the required design.
o It assesses the success of the application of modeling and discovers methods
more technically.
o It Contacts business analytics and domain specialists later to discuss the
outcomes of data mining in the business context.
5. Evaluation:
o At the last of this phase, a decision on the use of the data mining results should
be reached.
o It evaluates the model efficiently, and review the steps executed to build the
model and to ensure that the business objectives are properly achieved.
o The main objective of the evaluation is to determine some significant business
issue that has not been regarded adequately.
o At the last of this phase, a decision on the use of the data mining outcomes
should be reached.
Tasks:
o Evaluate results
o Review process
o Determine next steps
Evaluate results:
o It assesses the degree to which the model meets the organization's business
objectives.
o It tests the model on test apps in the actual implementation when time and
budget limitations permit and also assesses other data mining results produced.
o It unveils additional difficulties, suggestions, or information for future
instructions.
Review process:
o The review process does a more detailed evaluation of the data mining
engagement to determine when there is a significant factor or task that has been
somehow ignored.
o It reviews quality assurance problems.
Tasks:
o Plan deployment
o Plan monitoring and maintenance
o Produce final report
o Review project
Plan deployment:
o To deploy the data mining outcomes into the business, takes the assessment
results and concludes a strategy for deployment.
o It refers to documentation of the process for later deployment.
o It is important when the data mining results become part of the day-to-day
business and its environment.
o It helps to avoid unnecessarily long periods of misuse of data mining results.
o It needs a detailed analysis of the monitoring process.
Review project:
o Review projects evaluate what went right and what went wrong, what was done
wrong, and what needs to be improved.
Introduction
Data mining is a significant method where previously unknown and potentially useful
information is extracted from the vast amount of data. The data mining process involves
several components, and these components constitute a data mining system
architecture.
Different processes:
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources and in
different formats, it can't be used directly for the data mining procedure because the
data may not be complete and accurate. So, the first data requires to be cleaned and
unified. More information than needed will be collected from various data sources, and
only the data of interest will have to be selected and passed to the server. These
procedures are not as easy as we think. Several methods may be performed on the data
as part of selection, integration, and cleaning.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on
data mining as per user request.
In other words, we can say data mining is the root of our data mining architecture. It
comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.
This segment commonly employs stake measures that cooperate with the data mining
modules to focus the search towards fascinating patterns. It might utilize a stake
threshold to filter out discovered patterns. On the other hand, the pattern evaluation
module might be coordinated with the mining module, depending on the
implementation of the data mining techniques used. For efficient data mining, it is
abnormally suggested to push the evaluation of pattern stake as much as possible into
the mining procedure to confine the search to only fascinating patterns.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful
to guide the search or evaluate the stake of the result patterns. The knowledge base
may even contain user views and data from user experiences that might be helpful in
the data mining process. The data mining engine may receive inputs from the
knowledge base to make the result more accurate and reliable. The pattern assessment
module regularly interacts with the knowledge base to get inputs, and also update it.
The main objective of the KDD process is to extract information from data in the context
of large databases. It does this by using Data Mining algorithms to identify what is
deemed knowledge.
The availability and abundance of data today make knowledge discovery and Data
Mining a matter of impressive significance and need. In the recent development of the
field, it isn't surprising that a wide variety of techniques is presently accessible to
specialists and experts.
This is the initial preliminary step. It develops the scene for understanding what should
be done with the various decisions like transformation, algorithms, representation, etc.
The individuals who are in charge of a KDD venture need to understand and
characterize the objectives of the end-user and the environment in which the knowledge
discovery process will occur ( involves relevant prior knowledge).
Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined. This incorporates discovering what data is accessible,
obtaining important data, and afterward integrating all the data for knowledge
discovery onto one set involves the qualities that will be considered for the process. This
process is important because of Data Mining learns and discovers from the accessible
data. This is the evidence base for building the models. If some significant attributes are
missing, at that point, then the entire study may be unsuccessful from this respect, the
more attributes are considered. On the other hand, to organize, collect, and operate
advanced data repositories is expensive, and there is an arrangement with the
opportunity for best understanding the phenomena. This arrangement refers to an
aspect where the interactive and iterative aspect of the KDD is taking place. This begins
with the best available data sets and later expands and observes the impact in terms of
knowledge discovery and modeling.
In this step, data reliability is improved. It incorporates data clearing, for example,
Handling the missing quantities and removal of noise or outliers. It might include
complex statistical techniques or use a Data Mining algorithm in this context. For
example, when one suspects that a specific attribute of lacking reliability or has many
missing data, at this point, this attribute could turn into the objective of the Data Mining
supervised algorithm. A prediction model for these attributes will be created, and after
that, missing data can be predicted. The expansion to which one pays attention to this
level relies upon numerous factors. Regardless, studying the aspects is significant and
regularly revealing by itself, to enterprise data frameworks.
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and
developed. Techniques here incorporate dimension reduction( for example, feature
selection and extraction and record sampling), also attribute transformation(for
example, discretization of numerical attributes and functional transformation). This step
can be essential for the success of the entire KDD project, and it is typically very project-
specific. For example, in medical assessments, the quotient of attributes may often be
the most significant factor and not each one by itself. In business, we may need to think
about impacts beyond our control as well as efforts and transient issues. For example,
studying the impact of advertising accumulation. However, if we do not utilize the right
transformation at the starting, then we may acquire an amazing effect that insights to us
about the transformation required in the next iteration. Thus, the KDD process follows
upon itself and prompts an understanding of the transformation required.
Having the technique, we now decide on the strategies. This stage incorporates
choosing a particular technique to be used for searching patterns that include multiple
inducers. For example, considering precision versus understandability, the previous is
better with neural networks, while the latter is better with decision trees. For each
system of meta-learning, there are several possibilities of how it can be succeeded.
Meta-learning focuses on clarifying what causes a Data Mining algorithm to be fruitful
or not in a specific issue. Thus, this methodology attempts to understand the situation
under which a Data Mining algorithm is most suitable. Each algorithm has parameters
and strategies of leaning, such as ten folds cross-validation or another division for
training and testing.
At last, the implementation of the Data Mining algorithm is reached. In this stage, we
may need to utilize the algorithm several times until a satisfying outcome is obtained.
For example, by turning the algorithms control parameters, such as the minimum
number of instances in a single leaf of a decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the
objective characterized in the first step. Here we consider the preprocessing steps as for
their impact on the Data Mining algorithm results. For example, including a feature in
step 4, and repeat from there. This step focuses on the comprehensibility and utility of
the induced model. In this step, the identified knowledge is also recorded for further
use. The last step is the use, and overall feedback and discovery results acquire by Data
Mining.
9. Using the discovered knowledge
Now, we are prepared to include the knowledge into another system for further activity.
The knowledge becomes effective in the sense that we may make changes to the system
and measure the impacts. The accomplishment of this step decides the effectiveness of
the whole KDD process. There are numerous challenges in this step, such as losing the
"laboratory conditions" under which we have worked. For example, the knowledge was
discovered from a certain static depiction, it is usually a set of data, but now the data
becomes dynamic. Data structures may change certain quantities that become
unavailable, and the data domain might be modified, such as an attribute that may have
a value that was not expected previously.