DMW - Unit 1
DMW - Unit 1
2.Do you think that data mining is also the result of the evolution of
Machine learning research? Can you present such views based on the
historical progress of this discipline? Address the same for the fields of
statistics and pattern recognition.
ANS:
No, Data mining is not an evolution of machine learning research. Data
mining has incorporated many technologies like statistics, Machine learning.
pattern recognition, data warehouses etc. Data mining is the process of
analyzing patterns in large sets of data to get useful information to take
decisions based on the analysis. Data mining is like a Knowledge discovery
on large data. Machine learning is applying Artificial Intelligence (AI) into
systems, so that machines can learn from experience. RDBMS (Relational
2
Database management Systems) evolved from simple file processing into
data storage, retrieval, transactions processing etc for structured data into
relations. Data formats have changed and evolved into new formats from
time to time.
Examples: Audio files, video files, high density, tweets, log messages etc. The
amount of data, formats the database process have increased and changed.
This led to data mining capabilities on huge amount of data.
3
• Knowledge presentation, where visualization and knowledge
representation techniques are
used to present the mined knowledge to the user
4
ANS:
Characterization is a summarization of the general characteristics or
features of a target class of data. For example, the characteristics of
students can be produced, generating a profile of all the University first year
computing science students, which may include such information as a high
GPA and large number of courses taken.
Discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting
classes. For example, the general features of students with high GPA’s may
be compared with the general features of students with low GPA’s. The
resulting description could be a general comparative profile of the students
such as 75% of the students with high GPA’s are fourth-year computing
science students while 65% of the students with low GPA’s are
not.
Association is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data. For
example, a data mining system may find association rules
Like major(X, “computing science””) ⇒ owns(X, “personal computer”)
[support = 12%, confidence = 98%]
where X is a variable representing a student. The rule indicates that of the
students under study, 12% (support) major in computing science and own a
personal computer. There is a 98% probability (confidence, or certainty) that
a student in this group owns a personal computer. Typically, association
rules are discarded as uninteresting if they do not satisfy both a minimum
support threshold and a minimum confidence threshold. Additional analysis
can be performed to uncover interesting statistical correlations between
associated attribute-value pairs.
Classification is the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class label is unknown. It
predicts categorical (discrete, unordered) labels.
5
Regression, unlike classification, is a process to model continuous-valued
functions. It is used to predict missing or unavailable numerical data values
rather than (discrete) class labels.
Clustering analyzes data objects without consulting a known class label. The
objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity. Each cluster that
is formed can be viewed as a class of objects. Clustering can also facilitate
taxonomy formation, that is, the organization of observations into a
hierarchy of classes that group similar events together.
Outlier analysis is the analysis of outliers, which are objects that do not
comply with the general behavior or model of the data. Examples include
fraud detection based on a large dataset of credit card transactions
6
is used for data or information retrieval and does not have the means for
finding association rules.
Similarly, simple statistical analysis cannot handle large amounts of data
such as those of customer
records in a department store.
7
labels while the latter predicts missing or unavailable, and often numerical,
data values. This pair of
tasks is similar in that they both are tools for prediction.
8
9. What are the major challenges of mining a huge amount of data (eg.
billions of tuples) in comparison with mining a small amount of data (eg,
data set of a few hundred tuple)?
ANS :
Challenges to data mining regarding data mining methodology and user
interaction issues include the
following: mining different kinds of knowledge in databases, interactive
mining of knowledge at multiple
levels of abstraction, incorporation of background knowledge, data mining
query languages and ad hoc
data mining, presentation and visualization of data mining results, handling
noisy or incomplete data,
and pattern evaluation. Below are the descriptions of the first three
challenges mentioned:
Mining different kinds of knowledge in databases: Different users are
interested in different
kinds of knowledge and will require a wide range of data analysis and
knowledge discovery tasks such
as data characterization, discrimination, association, classification,
clustering, trend and deviation
analysis, and similarity analysis. Each of these tasks will use the same
database in different ways and
will require different data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction: Interactive
mining, with
the use of OLAP operations on a data cube, allows users to focus the search
for patterns, providing
and refining data mining requests based on returned results. The user can
then interactively view the
9
data and discover patterns at multiple granularities and from different
angles.
Incorporation of background knowledge: Background knowledge, or
information regarding the
domain under study such as integrity constraints and deduction rules, may
be used to guide the
discovery process and allow discovered patterns to be expressed in concise
terms and at different levels
of abstraction. This helps to focus and speed up a data mining process or
judge the interestingness of
discovered patterns.
10
10. Outline the major research challenges of data mining in one specific
application domain, such as stream/sensor data analysis, spatiotemporal
data analysis, or bioinformatics.
ANS:
Let’s take spatiotemporal data analysis for example. With the ever
increasing amount of available
data from sensor networks, web-based map services, location sensing
devices etc., the rate at which
such kind of data are being generated far exceeds our ability to extract
useful knowledge from them to
facilitate decision making and to better understand the changing
environment. It is a great challenge
how to utilize existing data mining techniques and create novel techniques
as well to effectively exploit
the rich spatiotemporal relationships/patterns embedded in the datasets
because both the temporal
and spatial dimensions could add substantial complexity to data mining
tasks. First, the spatial and
temporal relationships are information bearing and therefore need to be
considered in data mining.
Some spatial and temporal relationships are implicitly defined, and must be
extracted from the data.
Such extraction introduces some degree of fuzziness and/or uncertainty that
may have an impact on the
results of the data mining process. Second, working at the level of stored
data is often undesirable, and
thus complex transformations are required to describe the units of analysis
at higher conceptual levels.
11
Third, interesting patterns are more likely to be discovered at the lowest
resolution/granularity level,
but large support is more likely to exist at higher levels. Finally, how to
express domain independent
knowledge and how to integrate spatiotemporal reasoning mechanisms in
data mining systems are still open problems
11. Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order) 13, 15, 16,16, 19, 20, 20,
21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median? (b) What is the
mode of the data? Comment on the data's modality (ie, bimodal, trimodal,
etc.).
(c) What is the midrange of the data?
ANS:
(a) What is the mean of the data? What is the median?
The (arithmetic) mean of the data is: ¯x =
1
n
∑n
i=1 xi = 809/27 = 30. The median (middle value
of the ordered set, as the number of values in the set is odd) of the data is:
25.
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
This data set has two values that occur with the same highest frequency and
is, therefore, bimodal.
12
The modes (values occurring with the greatest frequency) of the data are 25
and 35.
(c) What is the midrange of the data?
The midrange (average of the largest and smallest values in the data set) of
the data is: (70 +
13)/2 = 41.5
12. Suppose that a hospital tested the age and body fat data for 18
randomly selected adults with the following results.
(a) Calculate the mean, median, and standard deviation of age and %fot.
(b) Draw the boxplots for age and %fat.
ANS:
(a) Calculate the mean, median and standard deviation of age and %fat.
For the variable age the mean is 46.44, the median is 51, and the standard
deviation is 12.85. For
the variable %fat the mean is 28.78, the median is 30.7, and the standard
deviation is 8.99.
13
Figure 2.2: A boxplot of the variables age and %fat in Exercise 2.2.4
14
16. What kind of data can be mined?
ANS:
1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
15
8. WWW
1. Flat Files
o Flat files is defined as data files in text form or binary form
with a structure that can be easily extracted by data mining
algorithms.
o Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file,
then there will be no relations between the tables.
o Flat files are represented by data dictionary. Eg: CSV file.
o Application: Used in DataWarehousing to store data, Used in
carrying data to and from server, etc.
2. Relational Databases
o A Relational database is defined as the collection of data
organized in tables with rows and columns.
o Physical schema in Relational databases is a schema which
defines the structure of tables.
o Logical schema in Relational databases is a schema which
defines the relationship among tables.
o Standard API of relational database is SQL.
o Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
o A datawarehouse is defined as the collection of data
integrated from multiple sources that will queries and
decision making.
o There are three types of
datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
o Two approaches can be used to update data in
DataWarehouse: Query-driven Approach and Update-
driven Approach.
o Application: Business decision making, Data mining, etc.
4. Transactional Databases
o Transactional databases is a collection of data organized by
time stamps, date, etc to represent transaction in databases.
16
o This type of database has the capability to roll back or undo
its operation when a transaction is not completed or
committed.
o Highly flexible system where users can modify information
without changing any sensitive information.
o Follows ACID property of DBMS.
o Application: Banking, Distributed systems, Object databases,
etc.
5. Multimedia Databases
o Multimedia databases consists audio, video, images and text
media.
o They can be stored on Object-Oriented Databases.
o They are used to store complex information in a pre-
specified formats.
o Application: Digital libraries, video-on demand, news-on
demand, musical database, etc.
6. Spatial Database
o Store geographical information.
o Stores data in the form of coordinates, topology, lines,
polygons, etc.
o Application: Maps, Global positioning, etc.
7. Time-series Databases
o Time series databases contains stock exchange data and
user logged activities.
o Handles array of numbers indexed by time, date, etc.
o It requires real-time analysis.
o Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
o WWW refers to World wide web is a collection of
documents and resources like audio, video, text, etc which
are identified by Uniform Resource Locators (URLs) through
web browsers, linked by HTML pages, and accessible via the
Internet network.
o It is the most heterogeneous repository as it collects data
from multiple resources.
17
o It is dynamic in nature as Volume of data is continuously
increasing and changing.
o Application: Online shopping, Job search, Research,
studying, etc.
18
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures.
o Find interestingness score of each pattern.
o Uses summarization and Visualization to make data understandable by
user
o Generate reports.
o Generate tables.
o Generate discriminant rules, classification rules, characterization
rules, etc.
Note:
19
18. Apply your knowledge to justify "Data Mining is the backbone of
Machine Learning
ANS:
19. Examine that how Data Mining is the sub process in KDD?
20
ANS:
The Knowledge Discovery in Databases is considered as a
programmed, exploratory analysis and modeling of vast data
repositories.KDD is the organized procedure of recognizing
valid, useful, and understandable patterns from huge and
complex data sets.
Data Mining is the root of the KDD procedure, including the
inferring of algorithms that investigate the data, develop the
model, and find previously unknown patterns. The model is
used for extracting the knowledge from the data, analyze the
data, and predict the data
21