0% found this document useful (0 votes)
9 views60 pages

1st Slides

Data mining is the process of discovering patterns and knowledge from large data sets, often referred to as knowledge discovery from data (KDD). It involves several steps including data preparation, mining, evaluation, and presentation, and can be applied to various types of structured, semi-structured, and unstructured data. The document also discusses different data mining techniques such as classification, regression, clustering, and deep learning, highlighting their applications and methodologies.

Uploaded by

ahmed.s.g9800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views60 pages

1st Slides

Data mining is the process of discovering patterns and knowledge from large data sets, often referred to as knowledge discovery from data (KDD). It involves several steps including data preparation, mining, evaluation, and presentation, and can be applied to various types of structured, semi-structured, and unstructured data. The document also discusses different data mining techniques such as classification, regression, clustering, and deep learning, highlighting their applications and methodologies.

Uploaded by

ahmed.s.g9800
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

What is data mining?

• Data mining is the process of discovering interesting patterns, models,


and other kinds of knowledge in large data sets.
• Many other terms have a similar meaning to data mining—for
example, knowledge mining from data, KDD (i.e., Knowledge
Discovery from Data), pattern discovery, knowledge extraction, data
archaeology, data analytics, and information harvesting.
Data mining turns a large collection of data into knowledge:
• A search engine (e.g., Google) receives billions of queries every day.
• What novel and useful knowledge can a search engine learn from
such a huge collection of queries collected from users over time?
• Interestingly, some patterns found in user search queries can disclose
invaluable knowledge that cannot be obtained by reading individual
data items alone.
• For example, Google’s Flu Trends uses specific search terms as
indicators of flu activity.
• It found a close relationship between the number of people who
search for flu-related information and the number of people who
actually have flu symptoms.
• A pattern emerges when all of the search queries related to flu are
aggregated.
• Using aggregated Google search data, Flu Trends can estimate flu
activity up to two weeks faster than what traditional systems can.
• This example shows how data mining can turn a large collection of
data into knowledge that can help meet a current global challenge.
Data mining: an essential step in knowledge discovery
• Many people treat data mining as a synonym for another popularly
used term, knowledge discovery from data, or KDD, whereas others
view data mining as merely an essential step in the overall process of
knowledge discovery.
• The overall knowledge discovery process is shown in the following
figure as an iterative sequence of the following steps:
Data mining: An essential step in the process of knowledge discovery.
1. Data preparation
a) Data cleaning (to remove noise and inconsistent data)
b) Data integration (where multiple data sources may be combined)
c) Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
d) Data selection (where data relevant to the analysis task are retrieved from the
database)
2. Data mining (an essential process where intelligent methods are
applied to extract patterns or construct models)
3. Pattern/model evaluation (to identify the truly interesting patterns or
models representing knowledge based on interestingness measures)
4. Knowledge presentation (where visualization and knowledge
representation techniques are used to present mined knowledge to
users)
• Steps 1(a) through 1(d) are different forms of data preprocessing, where data are
prepared for mining.
• The data mining step may interact with a user or a knowledge base.
• The interesting patterns are presented to the user and may be stored as new
knowledge in the knowledge base.
• The preceding view shows data mining as one step in the knowledge discovery
process, albeit an essential one because it uncovers hidden patterns or models
for evaluation.
• However, in industry, in media, and in the research milieu, the term data mining
is often used to refer to the entire knowledge discovery process (perhaps
because the term is shorter than knowledge discovery from data).
• Therefore, we adopt a broad view of data mining functionality: Data mining is the
process of discovering interesting patterns and knowledge from large amounts of
data.
• The data sources can include databases, data warehouses, the Web, other
information repositories, or data that are streamed into the system dynamically.
Diversity of data types for data mining
• As a general technology, data mining can be applied to any kind of
data as long as the data are meaningful for a target application.
• However, different kinds of data may need rather different data
mining methodologies, from simple to rather sophisticated, making
data mining a rich and diverse field.
Structured vs. unstructured data
• Based on whether data have clear structures, we can categorize data
as structured vs. unstructured data.
• Data stored in relational databases, data cubes, data matrices, and
many data warehouses have uniform, record- or table-like structures,
defined by their data dictionaries, with a fixed set of attributes (or
fields, columns), each with a fixed set of value ranges and semantic
meaning.
• These data sets are typical examples of highly structured data.
• In many real applications, such strict structural requirement can be
relaxed in multiple ways to accommodate semistructured nature of
the data, such as to allow a data object to contain a set value, a small
set of heterogeneous typed values, or nested structures or to allow
the structure of objects or subobjects to be defined flexibly and
dynamically (e.g., XML structures).
• There are many data sets that may not be as structured as relational
tables or data matrices.
• However, they do have certain structures with clearly defined
semantic meaning.
• For example, a transactional data set may contain a large set of
transactions each containing a set of items.
• A sequence data set may contain a large set of sequences each
containing an ordered set of elements that can in turn contain a set of
items.
• Many application data sets, such as shopping transaction data, time-
series data, gene or protein data, or Weblog data, belong to this
category.
• A more sophisticated type of semistructured data set is graph or
network data, where a set of nodes are connected by a set of edges
(also called links); and each node/link may have its own semantic
description or substructures.
• Each of such categories of structured and semistructured data sets may
have special kinds of patterns or knowledge to be mined and many
dedicated data mining methods, such as sequential pattern mining,
graph pattern mining, and information network mining methods, have
been developed to analyze such data sets.
• Beyond such structured or semistructured data, there are also large
amounts of unstructured data, such as text data and multimedia (e.g.,
audio, image, video) data.
• Although some studies treat them as one-dimensional or
multidimensional byte streams, they do carry a lot of interesting
semantics.
• Domain-specific methods have been developed to analyze such data
in the fields of natural language understanding, text mining, computer
vision, and pattern recognition.
• Moreover, recent advances on deep learning have made tremendous
progress on processing text, image, and video data.
• Nevertheless, mining hidden structures from unstructured data may
greatly help understand and make good use of such data.
• The real-world data can often be a mixture of structured data,
semistructured data, and unstructured data.
• For example, an online shopping website may host information for a
large set of products, which can be essentially structured data stored
in a relational database, with a fixed set of fields on product name,
price, specifications, and so on.
• However, some fields may essentially be text, image, and video data,
such as product introduction, expert or user reviews, product images,
and advertisement videos.
• Data mining methods are often developed for mining some particular
type of data, and their results can be integrated and coordinated to
serve the overall goal.
Data associated with different applications
• Different applications may generate or need to handle very different
data sets and require rather different data analysis methods.
• Thus when categorizing data sets for data mining, we should take
specific applications into consideration.
• Take sequence data as an example.
Biological sequences such as DNA or protein sequences may have very
different semantic meaning from shopping transaction sequences or Web
click streams, calling for rather different sequence mining methods.
• A special kind of sequence data is time-series data where a time-series may
contain an ordered set of numerical values with equal time interval, which is
also rather different from shopping transaction sequences, which may not
have fixed time gaps (a customer may shop at anytime she likes).
• Data in some applications can be associated with spatial information,
time information, or both, forming spatial, temporal, and
spatiotemporal data, respectively
• Special data mining methods, such as spatial data mining, temporal
data mining, spatiotemporal data mining, or trajectory pattern
mining, should be developed for mining such data sets as well.
• For graph and network data, different applications may also need
rather different data mining methods.
• For example, social networks (e.g., Facebook or LinkedIn data),
computer communication networks, biological networks, and
information networks (e.g., authors linking with keywords) may carry
rather different semantics and require different mining methods.
• Even for the same data set, finding different kinds of patterns or knowledge may
require different data mining methods.
• For example, for the same set of software (source) programs, finding plagiarized
subprogram modules or finding copy-and-paste bugs may need rather different
data mining techniques.
• Rich data types and diverse application requirements call for very diverse data
mining methods.
• Thus data mining is a rich and fascinating research domain, with lots of new
methods waiting to be studied and developed.
Stored vs. streaming data
• Usually, data mining handles finite, stored data sets, such as those stored in
various kinds of large data repositories.
• However, in some applications such as video surveillance or remote sensing, data
may stream in dynamically and constantly, as infinite data streams.
• Mining stream data will require rather different methods than stored data, which
may form another interesting theme in our study.
Mining various kinds of knowledge
• Different kinds of patterns and knowledge can be uncovered via data
mining. In general, data mining tasks can be put into two categories:
descriptive data mining and predictive data mining.
• Descriptive mining characterizes properties of the interested set of
data, whereas predictive mining performs induction on the data set in
order to make predictions.
• Different data mining functionalities generate different kinds of
results that are often called patterns, models, or knowledge.
Multidimensional data summarization
• It is often tedious for a user to go over the details of a large set of
data.
• Thus it is desirable to automatically summarize an interested set of
data and compare it with the contrasting sets at some high levels.
• Such summaritive description of an interested set of data is called
data summarization.
• Data summarization can often be conducted in a multidimensional
space.
• If the multidimensional space is well defined and frequently used,
such as product category, producer, location, or time, massive
amounts of data can be aggregated in the form of data cubes to
facilitate user’s drill-down or roll-up of the summarization space with
mouse clicking.
• The output of such multidimensional summarization can be presented
in various forms, such as pie charts, bar charts, curves,
multidimensional data cubes, and multidimensional tables, including
crosstabs.
• For structured data, multidimensional aggregation methods have
been developed to facilitate such precomputation or online
computation of multidimensional aggregations using data cube
technology.
• For unstructured data, such as text, this task becomes challenging.
Mining frequent patterns, associations, and correlations
• Frequent patterns, as the name suggests, are patterns that occur
frequently in data.
• There are many kinds of frequent patterns, including frequent
itemsets, frequent subsequences (also known as sequential patterns),
and frequent substructures.
• A frequent itemset typically refers to a set of items that often appear
together in a transactional data set—for example, milk and bread,
which are frequently bought together in grocery stores by many
customers.
• A frequently occurring subsequence, such as the pattern that
customers, tend to purchase first a laptop, followed by a computer
bag, and then other accessories, is a (frequent) sequential pattern.
• A substructure can refer to different structural forms (e.g., graphs,
trees, or lattices) that may be combined with itemsets or
subsequences.
• If a substructure occurs frequently, it is called a (frequent) structured
pattern.
• Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
Classification and regression for predictive analysis
• Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts.
• The model is derived based on the analysis of a set of training data
(i.e., data objects for which the class labels are known).
• The model is used to predict the class labels of objects for which the
class labels are unknown.
• Depending on the classification methods, a derived model can be in
various forms, such as a set of classification rules (i.e., IF-THEN rules),
a decision tree, a mathematical formula, or a learned neural network
as can be seen in the following figure:
A classification model can be represented in various forms: (a) IF-THEN rules,
(b) a decision tree, or (c) a neural network.
• A decision tree is a tree structure, where each node denotes a test on
an attribute value, each branch represents an outcome of the test, and
tree leaves represent classes or class distributions.
• Decision trees can easily be converted to classification rules.
• A neural network, when used for classification, is typically a collection
of neuron-like processing units with weighted connections between the
units.
• There are many other methods for constructing classification models,
such as naïve Bayesian classification, support vector machines, and k-
nearest-neighbor classification.
• Whereas classification predicts categorical (discrete, unordered) labels,
regression models continuous-valued functions.
• That is, regression is used to predict missing or unavailable numerical
data values rather than (discrete) class labels.
• The term prediction refers to both numeric prediction and class label
prediction.
• Regression analysis is a statistical methodology that is most often used
for numeric prediction, although other methods exist as well.
• Regression also encompasses the identification of distribution trends
based on the available data.
• Classification and regression may need to be preceded by feature
selection or relevance analysis, which attempts to identify attributes
(often called features) that are significantly relevant to the classification
and regression process.
• Such attributes will be selected for the classification and regression
process.
• Other attributes, which are irrelevant, can then be excluded from
consideration.
Example: Classification and regression.
• Suppose a webstore sales manager wants to classify a large set of items in the
store, based on three kinds of responses to a sales campaign: good response, mild
response, and no response.
• You want to derive a model for each of these three classes based on the
descriptive features of the items, such as price, brand, place_made, type, and
category.
• The resulting classification should maximally distinguish each class from the
others, presenting an organized picture of the data set.
• Suppose that the resulting classification is expressed as a decision tree.
• The decision tree, for instance, may identify price as being the first important
factor that best distinguishes the three classes.
• Other features that help further distinguish objects of each class from one another
include brand and place_made.
• Such a decision tree may help the manager understand the impact of the given
sales campaign and design a more effective campaign in the future.
• Suppose instead, that rather than predicting categorical response
labels for each store item, you would like to predict the amount of
revenue that each item will generate during an upcoming sale, based
on the previous sales data.
• This is an example of regression analysis because the regression
model constructed will predict a continuous function (or ordered
value.)
Cluster analysis
• Unlike classification and regression, which analyze class-labeled (training) data
sets, cluster analysis (also called clustering) groups data objects without
consulting class labels.
• In many cases, class-labeled data may simply not exist at the beginning.
• Clustering can be used to generate class labels for a group of data.
• The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a cluster have high
similarity in comparison to one another, but are rather dissimilar to objects in
other clusters.
• Each cluster so formed can be viewed as a class of objects, from which rules can
be derived.
• Clustering can also facilitate taxonomy formation, that is, the organization of
observations into a hierarchy of classes that group similar events together.
Example: Cluster analysis.
• Cluster analysis can be performed on the webstore customer data to
identify homogeneous subpopulations of customers.
• These clusters may represent individual target groups for marketing.
• The following figure shows a 2-D plot of customers with respect to
customer locations in a city. Three clusters of data points are evident.

A 2-D plot of customer data with respect to customer locations


in a city, showing three data clusters.
Deep learning
• For many data mining tasks, such as classification and clustering, a key step
often lies in finding “good features,” which is a vector representation of
each input data tuple.
• For example, in order to predict whether a regional disease outbreak will
occur, one might have collected a large number of features from the health
surveillance data, including the number of daily positive cases, number of
daily tests, number of daily hospitalization, etc.
• Traditionally, this step (called feature engineering) often heavily relies on
domain knowledge.
• Deep learning techniques provide an automatic way for feature
engineering, which is capable of generating semantically meaningful
features (e.g., weekly positive rate) from the initial input features.
• The generated features often significantly improve the mining performance
(e.g., classification accuracy).
• Deep learning is based on neural networks.
• A neural network is a set of connected input-output units where each
connection has a weight associated with it.
• During the learning phase, the network learns by adjusting the
weights to be able to predict the correct target values (e.g., class
labels) of the input tuples.
• The core algorithm to learn such weights is called backpropagation,
which searches for a set of weights and bias values that can model
the data to minimize the loss function between the network’s
prediction and the actual target output of data tuples.
• Various forms (called architectures) of neural networks have been
developed, including feed-forward neural networks, convolutional
neural networks, recurrent neural networks, graph neural networks,
and many more.
• Deep learning has broad applications in computer vision, natural
language processing, machine translation, social network analysis,
and so on.
• It has been used in a variety of data mining tasks, including
classification, clustering, outlier detection, and reinforcement
learning.
Outlier analysis
• A data set may contain objects that do not comply with the general
behavior or model of the data.
• These data objects are outliers.
• Many data mining methods discard outliers as noise or exceptions.
• However, in some applications (e.g., fraud detection) the rare events
can be more interesting than the more regularly occurring ones.
• The analysis of outlier data is referred to as outlier analysis or
anomaly mining.
• Outliers may be detected using statistical tests that assume a
distribution or probability model for the data, or using distance
measures where objects that are remote from any other cluster are
considered outliers.
• Rather than using statistical or distance measures, density-based
methods may identify outliers in a local region, although they look
normal from a global statistical distribution view.
Example: Outlier analysis. Outlier analysis may uncover fraudulent
usage of credit cards by detecting purchases of unusually large
amounts for a given account number in comparison to regular charges
incurred by the same account.
• Outlier values may also be detected with respect to the locations and
types of purchase, or the purchase frequency.
Are all mining results interesting?
• Data mining has the potential to generate a lot of results. A question
can be, “Are all of the mining results interesting?”
• Each type of data mining functions has its own measures on the
evaluation of the mining quality.
• Nevertheless, there are some shared philosophy and principles.
• Take pattern mining as an example.
• Pattern mining may generate thousands or even millions of patterns,
or rules.
• You may wonder,
• “What makes a pattern interesting?
• Can a data mining system generate all of the interesting patterns?
• Or, can the system generate only the interesting ones?”
• To answer the first question, a pattern is interesting if it is
(1) easily understood by humans,
(2) valid on new or test data with some degree of certainty,
(3) potentially useful, and
(4) novel.
• A pattern is also interesting if it validates a hypothesis that the user
sought to confirm.
• Several objective measures of pattern interestingness exist.
• These are based on the structure of discovered patterns and the
statistics underlying them.
• An objective measure for association rules of the form

is rule support, representing the percentage of transactions from a


transaction database that the given rule satisfies.
• This is taken to be the probability

where
indicates that a transaction contains both X and Y, that is, the union of itemsets X and Y.
• Another objective measure for association rules is confidence, which
assesses the degree of certainty of the detected association.
• This is taken to be the conditional probability

that is, the probability that a transaction containing X also contains Y.


• More formally, support and confidence are defined as
• In general, each interestingness measure is associated with a
threshold, which may be controlled by the user.
• For example, rules that do not satisfy a confidence threshold of, say,
50% can be considered uninteresting.
• Rules below the threshold likely reflect noise, exceptions, or minority
cases and are probably of less value.
• There are also other objective measures.
• For example, one may like set of items to be strongly correlated in an
association rule.
• Although objective measures help identify interesting patterns, they
are often insufficient unless combined with subjective measures that
reflect a particular user’s needs and interests.
• For example, patterns describing the characteristics of customers who
shop frequently online should be interesting to the marketing
manager, but may be of little interest to other analysts studying the
same database for patterns on employee performance.
• Furthermore, many patterns that are interesting by objective
standards may represent common sense and, therefore, are actually
uninteresting.
• Subjective interestingness measures are based on user beliefs in the
data.
• These measures find patterns interesting if the patterns are
unexpected (contradicting a user’s belief) or offer strategic
information on which the user can act. In the latter case, such
patterns are referred to as actionable.
• For example, patterns like “a large earthquake often follows a cluster
of small quakes” may be highly actionable if users can act on the
information to save lives.
• Patterns that are expected can be interesting if they confirm a
hypothesis that the user wishes to validate or they resemble a user’s
hunch.
• The second question—“Can a data mining system generate all of the
interesting patterns?”—refers to the completeness of a data mining
algorithm.
• It is often unrealistic and inefficient for a pattern mining system to
generate all possible patterns since there could be a very large
number of them.
• However, one may also worry whether one may miss some important
ones if the system stops short.
• To solve this dilemma, user-provided constraints and interestingness
measures should be used to focus the search.
• With well-defined interesting measures and user-provided
constraints, it is quite realistic to ensure the completeness of pattern
mining.
• Finally, the third question—“Can a data mining system generate only
interesting patterns?”—is an optimization problem in data mining.
• It is highly desirable for a data mining system to generate only
interesting patterns.
• This would be efficient for both the data mining system and the user
because the system may spend much less time to generate much
fewer but interesting patterns, whereas the user will not need to sift
through a large number of patterns to identify the truly interesting
ones.
Data mining: confluence of multiple disciplines
• As a discipline that studies efficient and effective methods for
uncovering patterns and knowledge from various kinds of massive
data sets for many applications, data mining naturally serves a
confluence of multiple disciplines including
• machine learning,
• statistics,
• pattern recognition,
• natural language processing,
• database technology,
• human computer interaction (HCI),
• algorithms,
• high-performance computing,
• social sciences,
• and many application domains
• The interdisciplinary nature of data mining research and development
contributes significantly to the success of data mining and its
extensive applications.
• On the other hand, data mining is not only nurtured from the
knowledge and development of these disciplines, the dedicated
research, development, and applications of data mining on various
kinds of big data may have substantially impacted the development of
these disciplines in recent years as well.
Statistics and data mining
• Statistics studies the collection, analysis, interpretation or explanation, and
presentation of data.
• Data mining has an inherent connection with statistics.
• A statistical model is a set of mathematical functions that describe the behavior of
the objects in a target class in terms of random variables and their associated
probability distributions.
• Statistical models are widely used to model data and data classes.
• For example, in data mining tasks such as data characterization and classification,
statistical models of target classes can be built.
• In other words, such statistical models can be the outcome of a data mining task.
• Alternatively, data mining tasks can be built on top of statistical models.
• For example, we can use statistics to model noise and missing data values.
• Then, when mining patterns in a large data set, the data mining process can use the
model to help identify and handle noisy or missing values in the data.
• Statistics research develops tools for prediction and forecasting using
data and statistical models.
• Statistical methods can be used to summarize or describe a collection
of data.
• Statistics is useful for mining various patterns from data and for
understanding the underlying mechanisms generating and affecting
the patterns.
• Inferential statistics (or predictive statistics) models data in a way
that accounts for randomness and uncertainty in the observations
and is used to draw inferences about the process or population under
investigation.
• Statistical methods can also be used to verify data mining results.
• For example, after a classification or prediction model is mined, the
model should be verified by statistical hypothesis testing.
• A statistical hypothesis test (sometimes called confirmatory data
analysis) makes statistical decisions using experimental data.
• A result is called statistically significant if it is unlikely to have
occurred by chance.
• If the classification or prediction model holds, then the descriptive
statistics of the model increases the soundness of the model.
• Applying statistical methods in data mining is far from trivial.
• Often, a serious challenge is how to scale up a statistical method over a large data
set.
• Many statistical methods have high complexity in computation.
• When such methods are applied on large data sets that are also distributed on
multiple logical or physical sites, algorithms should be carefully designed and tuned
to reduce the computational cost.
• This challenge becomes even tougher for online applications, such as online query
suggestions in search engines, where data mining is required to continuously handle
fast, real-time data streams.
• Data mining research has developed many scalable and effective solutions for the
analysis of massive data sets and data streams.
• Moreover, different kinds of data sets and different applications may require rather
different analysis methods.
• Effective solutions have been proposed and tested, which leads to many new,
scalable data mining-based statistical analysis methods.
Machine learning and data mining
• Machine learning investigates how computers can learn (or improve
their performance) based on data.
• Machine learning is a fast-growing discipline, with many new
methodologies and applications developed in recent years, from
support vector machines to probabilistic graphical models and deep
learning.
• In general, machine learning addresses two classical problems:
supervised learning and unsupervised learning.
Supervised learning:
• A classic example of supervised learning is classification.
• The supervision in the learning comes from the labeled examples in
the training data set.
• For example, to automatically recognize handwritten postal codes on
mails, the learning system takes a set of handwritten postal code
images and their corresponding machine-readable translations as the
training examples, and learns (i.e., computes) a classification model.
Unsupervised learning:
• A classic example of unsupervised learning is clustering.
• The learning process is unsupervised since the input examples are not
class-labeled.
• Typically, we may use clustering to discover groups within the data.
• For example, an unsupervised learning method can take, as input, a
set of images of handwritten digits.
• Suppose that it finds 10 clusters of data.
• These clusters may hopefully correspond to the 10 distinct digits of 0
to 9, respectively.
• However, since the training data are not labeled, the learned model
cannot tell us the semantic meaning of the clusters found.
• As to these two basic problems, data mining and machine learning do
share many similarities.
• However, data mining differs from machine learning in several major
aspects.
• First, even on similar tasks like classification and clustering, data
mining often works on very large data sets, or even on infinite data
streams, scalability can be an important concern, and many efficient
and highly scalable data mining algorithms or stream mining
algorithms have to be developed to accomplish such tasks.
• Second, in many data mining problems, the data sets are usually large, but
the training data can still be rather small since it is expensive for experts to
provide quality labels for many examples.
• Therefore, data mining has to put a lot of effort on developing weakly
supervised methods.
• These include methodologies like semisupervised learning with a small set of
labeled data but a large set of unlabeled data, integration or ensemble of
multiple weak models obtained from nonexperts (e.g., those obtained by
crowd-sourcing), distant supervision, such as using popularly available and
general (but distantly relevant to the problem to be solved) knowledge-bases
(e.g., wikipedia, DBPedia), actively learning by carefully selecting examples to
ask human experts, or transfer learning by integrating models learned from
similar problem domains.
• Data mining has been extending such weakly supervised methods for
constructing quality classification models on large data sets with a very
limited set of high quality training data.
• Third, machine learning methods may not be able to handle many kinds of
knowledge discovery problems on big data.
• On the other hand, data mining, developing effective solutions for concrete
application problems, goes deep in the problem domain, and expands far
beyond the scope covered by machine learning.
• For example, many application problems, such as business transaction data
analysis, software program execution sequence analysis, and chemical and
biological structural analysis, need effective methods for mining frequent
patterns, sequential patterns, and structured patterns.
• Data mining research has generated many scalable, effective, and diverse mining
methods for such tasks.
• As another example, the analysis of large-scale social and information networks
poses many challenging problems that may not fit the typical scope of many
machine learning methods due to the information interaction across links and
nodes in such networks.
• Data mining has developed a lot of interesting solutions to such problems.
Data mining and data science
• With the tremendous amount of data in almost every discipline and
various kinds of applications, big data and data science have become
buzzwords in recent years.
• Big data generally refers to huge amounts of structured and
unstructured data of various forms, and data science is an
interdisciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from
massive data of various forms.
• Clearly, data mining plays an essential role in data science.
• For most people, data science is a concept that unifies statistics,
machine learning, data mining, and their related methods in order to
understand and analyze massive data.
• It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, information science, and computer
science.
• For many industry people, the term “data science” often refers to
business analytics, business intelligence, predictive modeling, or any
meaningful use of data, and is being taken as a glamorized term to re-
brand statistics, data mining, machine learning, or any kind of data
analytics.
Data mining and other disciplines
• Besides statistics, machine learning, and database technology, data mining has close relationships
with many other disciplines as well.
• The majority of the real-world data are unstructured, in the form of natural language text, images,
or audio-video data.
• Therefore, natural language processing, computer vision, pattern recognition, audio-video signal
processing, and information retrieval will offer critical help at handling such data.
• Actually, handling any special kinds of data will need a lot of domain knowledge to be integrated
into the data mining algorithm design.
• For example, mining biomedical data will need the integration of knowledge from biological
sciences, medical sciences, and bioinformatics.
• Mining geospatial data will need much knowledge and techniques from geography and geospatial
data sciences.
• Mining software bugs in large software programs will need to integrate software engineering with
data mining.
• Mining social media and social networks will need knowledge and skills from social sciences and
network sciences.
• Such examples can go on and on since data mining will penetrate almost every application domain.
• One major challenge in data mining is efficiency and scalability since
we often have to handle huge amounts of data with critical time and
resource constraints.
• Data mining is critically connected with efficient algorithm design
such as low-complexity, incremental, and streaming data mining
algorithms.
• It often needs to explore high performance computation, parallel
computation, and distributed computation, with advanced hardware
and cloud computing or cluster computing environment.
• Data mining is also closely tied with human-computer interaction.
• Users need to interact with a data mining system or process in an
effective way, telling the system what to mine, how to incorporate
background knowledge, how to mine, and how to present the mining
results in an easy-to-understand (e.g., by interpretation and
visualization) and easy-to-interact way (e.g., with friendly graphic user
interface and interactive mining).
• Actually, nowadays, there are not only many interactive data mining systems
but also many more data mining functions hidden in various kinds of
application programs.
• It is unrealistic to expect everyone in our society to understand and master
data mining techniques.
• It is also forbidden for industries to expose their large data sets.
• Many systems have data mining functions built within so that people can
perform data mining or use data mining results simply by mouse clicking.
• For example, intelligent search engines and online retails perform such
invisible data mining by collecting their data and user’s search or purchase
history, incorporating data mining into their components to improve their
performance, functionality, and user satisfaction.
• When your grandma shops online, she may be surprised when receiving
some smart recommendations.
• This could likely be resulted from such invisible data mining.

You might also like