0% found this document useful (0 votes)
8 views8 pages

Unit2 Notes

Data mining is a process that involves discovering patterns and knowledge from large datasets through a series of steps including problem formulation, data collection, preprocessing, modeling, evaluation, and deployment. The CRISP-DM framework outlines six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment, each critical for successful data mining projects. Key functionalities of data mining include descriptive and predictive tasks, with various techniques such as classification, clustering, and association analysis used to extract valuable insights from data.

Uploaded by

APARNA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Unit2 Notes

Data mining is a process that involves discovering patterns and knowledge from large datasets through a series of steps including problem formulation, data collection, preprocessing, modeling, evaluation, and deployment. The CRISP-DM framework outlines six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment, each critical for successful data mining projects. Key functionalities of data mining include descriptive and predictive tasks, with various techniques such as classification, clustering, and association analysis used to extract valuable insights from data.

Uploaded by

APARNA R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Mining is a process of discovering interesting patterns and knowledge

from large amounts of data. The data sources can include databases, data
warehouses, the web, and other information repositories or data that are
streamed into the system dynamically.

The general experimental procedure adapted to data-mining problem


involves following steps :
 State problem and formulate hypothesis – In this step, a modeler
usually specifies a group of variables for unknown dependency and, if
possible, a general sort of this dependency as an initial hypothesis.
There could also be several hypotheses formulated for one problem at
this stage. The primary step requires combined expertise of an
application domain and a data-mining model. In practice, it always
means an in-depth interaction between data-mining expert and
application expert. In successful data-mining applications, this
cooperation does not stop within initial phase. It continues during
whole data-mining process.
 Collect data – This step cares about how information is generated and
picked up. Generally, there are two distinct possibilities. The primary is
when data-generation process is under control of an expert (modeler).
This approach is understood as a designed experiment. The second
possibility is when expert cannot influence data generation process.
This is often referred to as observational approach. An observational
setting, namely, random data generation, is assumed in most data-
mining applications. Typically, sampling distribution is totally unknown
after data are collected, or it is partially and implicitly given within
data-collection procedure. It is vital, however, to know how data
collection affects its theoretical distribution since such a piece of prior
knowledge is often useful for modeling and, later, for ultimate
interpretation of results. Also, it is important to form sure that
information used for estimating a model and therefore data used later
for testing and applying a model come from an equivalent, unknown,
sampling distribution. If this is often not case, estimated model cannot
be successfully utilized in a final application of results.
 Data Preprocessing – In the observational setting, data is usually
“collected” from prevailing databases, data warehouses, and data
marts. Data preprocessing usually includes a minimum of two common
tasks :
(i) Outlier Detection (and removal) : Outliers are unusual data
values that are not according to most observations. Commonly,
outliers result from measurement errors, coding, and recording
errors, and, sometimes, are natural, abnormal values. Such non-
representative samples can seriously affect model produced
later. There are two strategies for handling outliers : Detect and
eventually remove outliers as a neighborhood of preprocessing
phase. And Develop robust modeling methods that are
insensitive to outliers.
(ii) Scaling, encoding, and selecting features : Data
preprocessing includes several steps like variable scaling and
differing types of encoding. For instance, one feature with range
[0, 1] and other with range [100, 1000] will not have an
equivalent weight within applied technique. They are going to
also influence ultimate data-mining results differently. Therefore,
it is recommended to scale them and convey both features to an
equivalent weight for further analysis. Also, application-specific
encoding methods usually achieve dimensionality reduction by
providing a smaller number of informative features for
subsequent data modeling.
 Estimate model – The selection and implementation of acceptable
data-mining technique is that main task during this phase. This process
is not straightforward. Usually, in practice, implementation is
predicated on several models, and selecting simplest one is a further
task.
 Interpret model and draw conclusions – In most cases, data-mining
models should help in deciding. Hence, such models got to be
interpretable so as to be useful because humans are not likely to base
their decisions on complex “black-box” models. Note that goals of
accuracy of model and accuracy of its interpretation are somewhat
contradictory. Usually, simple models are more interpretable, but they
are also less accurate. Modern data-mining methods are expected to
yield highly accurate results using high dimensional models. The
matter of interpreting these models, also vital, is taken into account a
separate task, with specific techniques to validate results.

Six step Data Mining Process:


The six phases of CRISP-DM include:
#1) Business Understanding: In this step, the goals of the businesses are
set and the important factors that will help in achieving the goal are
discovered.
#2) Data Understanding: This step will collect the whole data and populate
the data in the tool (if using any tool). The data is listed with its data source,
location, how it is acquired and if any issue encountered. Data is visualized
and queried to check its completeness.
#3) Data Preparation: This step involves selecting the appropriate data,
cleaning, constructing attributes from data, integrating data from multiple
databases.
#4) Modeling: Selection of the data mining technique such as decision-tree,
generate test design for evaluating the selected model, building models from
the dataset and assessing the built model with experts to discuss the result is
done in this step.
#5) Evaluation: This step will determine the degree to which the resulting
model meets the business requirements. Evaluation can be done by testing
the model on real applications. The model is reviewed for any mistakes or
steps that should be repeated.
#6) Deployment: In this step a deployment plan is made, strategy to
monitor and maintain the data mining model results to check for its
usefulness is formed, final reports are made and review of the whole process
is done to check any mistake and see if any step is repeated.

Stage 1: Business Understanding


In this stage, your job is to figure out what your company is trying to get out of this
data mining project. Is it to increase revenue? Find better prospects? Attract top
talent? Create more profitable marketing campaigns? It can truly be anything, so
long as you can arrive to an answer by analyzing data.

Stage 2: Data Understanding


Next up, it’s time to identify the datasets you need to answer your question. For
instance, if your goal is to increase revenue, you might need the current number of
customers, the number who has churned, and the average deal size.

Gather your high-quality data and store it in a format that you can easily access.
If you’re just getting started with data mining, you might use something as simple as
Google Sheets. If your business is growing, consider HubSpot’s data sync tool. If
you’re experienced, you might opt for a tool such as Tableau.
Stage 3: Data Preparation
Clean up the data, remove duplicates, and ensure it represents your business
accurately. To avoid errors, you might employ the help of a tool such as Operations
Hub and appoint this task to one person. Allowing multiple people to collaborate on
one dataset at the same time may lead to duplicates and redundancies.
Check out our guides on data quality and data lifecycle management to ensure
you do everything you need to do in this stage.
Stage 4: Modeling
In the modeling stage, you use algorithms, artificial intelligence, and machine
learning to associate, categorize, regress, and cluster your data. If you have a data
analyst on staff, they might use the R and Python programming languages to carry
out these data mining techniques. They might also use data mining software.

If you’re just getting started, you might use the pivot table, filtering, and data
visualization tools in your spreadsheet software.

Stage 5: Evaluation
Next, it’s time to look at the results. Do your findings help you answer the business
question you established in stage one? If not, then it’s time to try stage four again —
it’s totally normal to have to model the data various times before gleaning the right
insights.

Stage 6: Deployment
Last, you compile all of your results in a presentation or dashboard and present it to
key stakeholders. You’ll all convene and figure out what to do based on what you
found in your data.

Data mining has its benefits, but it can sound like a lot to tackle for a beginner in the
subject. One common point of confusion is in regards to the differences between
data mining and data harvesting.

Major issues in Data Mining :


Mining different kinds of knowledge in databases – The need for different
users is not same. Different users may be interested in different kinds of
knowledge. Therefore it is necessary for data mining to cover a broad range
of knowledge discovery tasks.
Interactive mining of knowledge at multiple levels of abstraction – The data
mining process needs to be interactive because it allows users to focus on
search for patterns, providing and refining data mining requests based on
returned results.
Incorporation of background knowledge – To guide discovery process and to
express discovered patterns, background knowledge can be used to express
discovered patterns not only in concise terms but at multiple levels of
abstraction.
Data mining query languages and ad-hoc data mining – Data Mining Query
language that allows user to describe ad-hoc mining tasks should be
integrated with a data warehouse query language and optimized for efficient
and flexible data mining.
Presentation and visualization of data mining results – Once patterns are
discovered it needs to be expressed in high-level languages, visual
representations. These representations should be easily understandable by
users.
Handling noisy or incomplete data – The data cleaning methods are required
that can handle noise, incomplete objects while mining data regularities. If
data cleaning methods are not there then accuracy of discovered patterns
will be poor.
Pattern evaluation – It refers to interestingness of problem. The patterns
discovered should be interesting because either they represent common
knowledge or lack of novelty.
Efficiency and scalability of data mining algorithms – In order to effectively
extract information from huge amount of data in databases, data mining
algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms – The factors such as
huge size of databases, wide distribution of data, and complexity of data
mining methods motivate development of parallel and distributed data
mining algorithms. These algorithms divide data into partitions that are
further processed parallel. Then results from partitions are merged. The
incremental algorithms update databases without having mined data again
from scratch.

Data mining activities can be divided into two categories:

o Descriptive Data Mining: It includes certain knowledge to


understand what is happening within the data without a previous
idea. The common data features are highlighted in the data set. For
example, count, average etc.
o Predictive Data Mining: It helps developers to provide unlabeled
definitions of attributes. With previously available or historical data,
data mining can be used to make predictions about critical business
metrics based on data's linearity. For example, predicting the
volume of business next quarter based on performance in the
previous quarters over several years or judging from the findings of
a patient's medical examinations that is he suffering from any
particular disease.

Data mining functionalities


Data mining functionalities are used to represent the type of patterns
that have to be discovered in data mining tasks. In general, data mining
tasks can be classified into two types including descriptive and predictive.
Descriptive mining tasks define the common features of the data in the
database and the predictive mining tasks act inference on the current
information to develop predictions.
There are various data mining functionalities which are as follows −
 Data characterization − It is a summarization of the general
characteristics of an object class of data. The data
corresponding to the user-specified class is generally collected
by a database query. The output of data characterization can
be presented in multiple forms.
 Data discrimination − It is a comparison of the general
characteristics of target class data objects with the general
characteristics of objects from one or a set of contrasting
classes. The target and contrasting classes can be represented
by the user, and the equivalent data objects fetched through
database queries.
 Association Analysis − It analyses the set of items that
generally occur together in a transactional dataset. There are
two parameters that are used for determining the association
rules −
o It provides which identifies the common item set in
the database.
o Confidence is the conditional probability that an
item occurs in a transaction when another item
occurs.
 Classification − Classification is the procedure of discovering
a model that represents and distinguishes data classes or
concepts, for the objective of being able to use the model to
predict the class of objects whose class label is anonymous.
The derived model is established on the analysis of a set of
training data (i.e., data objects whose class label is common).
 Prediction − It defines predict some unavailable data values
or pending trends. An object can be anticipated based on the
attribute values of the object and attribute values of the
classes. It can be a prediction of missing numerical values or
increase/decrease trends in time-related information.
 Clustering − It is similar to classification but the classes are
not predefined. The classes are represented by data
attributes. It is unsupervised learning. The objects are
clustered or grouped, depends on the principle of maximizing
the intraclass similarity and minimizing the intraclass
similarity.
 Outlier analysis − Outliers are data elements that cannot be
grouped in a given class or cluster. These are the data objects
which have multiple behaviour from the general behaviour of
other data objects. The analysis of this type of data can be
essential to mine the knowledge.
 Evolution analysis − It defines the trends for objects whose
behaviour changes over some time.

In general terms, “Mining” is the process of extraction. In the context of


computer science, Data Mining can be referred to as knowledge mining
from data, knowledge extraction, data/pattern analysis, data archaeology,
and data dredging. There are other kinds of data like semi-structured or
unstructured data which includes spatial data, multimedia data, text data,
web data which require different methodologies for data mining.
 Mining Multimedia Data: Multimedia data objects include
image data, video data, audio data, website hyperlinks, and
linkages. Multimedia data mining tries to find out interesting
patterns from multimedia databases. This includes the
processing of the digital data and performs tasks like image
processing, image classification, video, and audio data mining,
and pattern recognition. Multimedia Data mining is becoming
the most interesting research area because most of the social
media platforms like Twitter, Facebook data can be analyzed
through this and derive interesting trends and patterns.
 Mining Web Data: Web mining is essential to discover
crucial patterns and knowledge from the Web. Web content
mining analyzes data of several websites which includes the
web pages and the multimedia data such as images in the web
pages. Web mining is done to understand the content of web
pages, unique users of the website, unique hypertext links,
web page relevance and ranking, web page content
summaries, time that the users spent on the particular
website, and understand user search patterns. Web mining
also finds out the best search engine and determines the
search algorithm used by it. So it helps improve search
efficiency and finds the best search engine for the users.
 Mining Text Data: Text mining is the subfield of data mining,
machine learning, Natural Language processing, and statistics.
Most of the information in our daily life is stored as text such
as news articles, technical papers, books, email messages,
blogs. Text Mining helps us to retrieve high-quality information
from text such as sentiment analysis, document
summarization, text categorization, text clustering. We apply
machine learning models and NLP techniques to derive useful
information from the text. This is done by finding out the
hidden patterns and trends by means such as statistical
pattern learning and statistical language modeling. In order to
perform text mining, we need to preprocess the text by
applying the techniques of stemming and lemmatization in
order to convert the textual data into data vectors.
 Mining Spatiotemporal Data: The data that is related to
both space and time is Spatiotemporal data. Spatiotemporal
data mining retrieves interesting patterns and knowledge from
spatiotemporal data. Spatiotemporal Data mining helps us to
find the value of the lands, the age of the rocks and precious
stones, predict the weather patterns. Spatiotemporal data
mining has many practical applications like GPS in mobile
phones, timers, Internet-based map services, weather
services, satellite, RFID, sensor.
 Mining Data Streams: Stream data is the data that can
change dynamically and it is noisy, inconsistent which contain
multidimensional features of different data types. So this data
is stored in NoSql database systems. The volume of the stream
data is very high and this is the challenge for the effective
mining of stream data. While mining the Data Streams we
need to perform the tasks such as clustering, outlier analysis,
and the online detection of rare events in data streams.

The two primary methods for data analysis are qualitative data analysis
techniques and quantitative data analysis techniques. These data analysis
techniques can be used independently or in combination with the other to
help business leaders and decision-makers acquire business insights from
different data types.

Quantitative data is anything measurable, comprising specific quantities


and numbers. Some examples of quantitative data include sales figures,
email click-through rates, number of website visitors, and percentage
revenue increase. Quantitative data analysis techniques focus on the
statistical, mathematical, or numerical analysis of (usually large) datasets.
This includes the manipulation of statistical data using computational
techniques and algorithms. Quantitative analysis techniques are often used
to explain certain phenomena or to make predictions.

Qualitative data cannot be measured objectively, and is therefore open


to more subjective interpretation. Some examples of qualitative data
include comments left in response to a survey question, things people have
said during interviews, tweets and other social media posts, and the text
included in product reviews. With qualitative data analysis, the focus is on
making sense of unstructured data (such as written text, or transcripts of
spoken conversations). Often, qualitative analysis will organize the data
into themes—a process which, fortunately, can be automated.

Data analysts work with both quantitative and qualitative data, so it’s
important to be familiar with a variety of analysis methods. Let’s take a look
at some of the most useful techniques now.

You might also like