0% found this document useful (0 votes)
69 views

DMW - Unit 1

The document discusses data mining and provides answers to questions about it. It defines data mining and explains that it is not a hype but rather the natural evolution of information technology due to large amounts of available data. It describes data mining as involving the integration of techniques from multiple disciplines like databases, statistics, machine learning, etc. rather than just being a simple transformation of those individual technologies. It also explains how database technology evolved over time in a way that led to the development of data mining capabilities.

Uploaded by

Priya Bhalerao
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

DMW - Unit 1

The document discusses data mining and provides answers to questions about it. It defines data mining and explains that it is not a hype but rather the natural evolution of information technology due to large amounts of available data. It describes data mining as involving the integration of techniques from multiple disciplines like databases, statistics, machine learning, etc. rather than just being a simple transformation of those individual technologies. It also explains how database technology evolved over time in a way that led to the development of data mining capabilities.

Uploaded by

Priya Bhalerao
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

DMW – Unit 1

1. What is data mining? In your answer, address the following:


(a) Is it another hype?
(b) Is it a simple transformation or application of technology developed
from databases, statistics, machine learning, and pattern recognition?
(c) We have presented a view that data mining is the result of the evolution
of database technology.
ANS :
Data mining refers to the process or method that extracts or “mines”
interesting knowledge or
patterns from large amounts of data.
(a) Is it another hype?
Data mining is not another hype. Instead, the need for data mining has
arisen due to the wide
availability of huge amounts of data and the imminent need for turning such
data into useful
information and knowledge. Thus, data mining can be viewed as the result
of the natural evolution
of information technology.
(b) Is it a simple transformation of technology developed from databases,
statistics, and machine
learning?
No. Data mining is more than a simple transformation of technology
developed from databases,
statistics, and machine learning. Instead, data mining involves an
integration, rather than a
simple transformation, of techniques from multiple disciplines such as
database technology, statistics, machine learning, high-performance
computing, pattern recognition, neural networks, data
visualization, information retrieval, image and signal processing, and spatial
data analysis.
(c) Explain how the evolution of database technology led to data mining.
Database technology began with the development of data collection and
database creation mechanisms that led to the development of effective
mechanisms for data management including data
storage and retrieval, and query and transaction processing. The large
number of database systems offering query and transaction processing
eventually and naturally led to the need for data
analysis and understanding. Hence, data mining began its development out
of this necessity.

2.Do you think that data mining is also the result of the evolution of
Machine learning research? Can you present such views based on the
historical progress of this discipline? Address the same for the fields of
statistics and pattern recognition.
ANS:
No, Data mining is not an evolution of machine learning research. Data
mining has incorporated many technologies like statistics, Machine learning.
pattern recognition, data warehouses etc. Data mining is the process of
analyzing patterns in large sets of data to get useful information to take
decisions based on the analysis. Data mining is like a Knowledge discovery
on large data. Machine learning is applying Artificial Intelligence (AI) into
systems, so that machines can learn from experience. RDBMS (Relational

2
Database management Systems) evolved from simple file processing into
data storage, retrieval, transactions processing etc for structured data into
relations. Data formats have changed and evolved into new formats from
time to time.

Examples: Audio files, video files, high density, tweets, log messages etc. The
amount of data, formats the database process have increased and changed.
This led to data mining capabilities on huge amount of data.

3. Describe the steps involved in data mining when viewed as a process of


knowledge discovery.
ANS:
The steps involved in data mining when viewed as a process of knowledge
discovery are as follows:
• Data cleaning, a process that removes or transforms noise and
inconsistent data
• Data integration, where multiple data sources may be combined
• Data selection, where data relevant to the analysis task are retrieved from
the database
• Data transformation, where data are transformed or consolidated into
forms appropriate
for mining
• Data mining, an essential process where intelligent and efficient methods
are applied in
order to extract patterns
• Pattern evaluation, a process that identifies the truly interesting patterns
representing
knowledge based on some interestingness measures

3
• Knowledge presentation, where visualization and knowledge
representation techniques are
used to present the mined knowledge to the user

4. How is a data warehouse different from a database? How are they


similar?
ANS:
Differences between a data warehouse and a database: A data warehouse
is a repository of information collected from multiple sources, over a
history of time, stored under a unified schema, and used for data analysis
and decision support; whereas a database, is a collection of interrelated
data that
represents the current status of the stored data. There could be multiple
heterogeneous databases
where the schema of one database may not agree with the schema of
another. A database system
supports ad-hoc query and on-line transaction processing. For more
details, please refer to the section
“Differences between operational database systems and data
warehouses.”
Similarities between a data warehouse and a database: Both are
repositories of information, storing huge amounts of persistent data.

5. Define each of the following data mining functionalities 1


characterization, discrimination, association and correlation analysis,
classification, regression, clustering, and outlier analysis. Give examples of
each data mining functionality, using a real-life database that you are
familiar with.

4
ANS:
Characterization is a summarization of the general characteristics or
features of a target class of data. For example, the characteristics of
students can be produced, generating a profile of all the University first year
computing science students, which may include such information as a high
GPA and large number of courses taken.
Discrimination is a comparison of the general features of target class data
objects with the general features of objects from one or a set of contrasting
classes. For example, the general features of students with high GPA’s may
be compared with the general features of students with low GPA’s. The
resulting description could be a general comparative profile of the students
such as 75% of the students with high GPA’s are fourth-year computing
science students while 65% of the students with low GPA’s are
not.
Association is the discovery of association rules showing attribute-value
conditions that occur frequently together in a given set of data. For
example, a data mining system may find association rules
Like major(X, “computing science””) ⇒ owns(X, “personal computer”)
[support = 12%, confidence = 98%]
where X is a variable representing a student. The rule indicates that of the
students under study, 12% (support) major in computing science and own a
personal computer. There is a 98% probability (confidence, or certainty) that
a student in this group owns a personal computer. Typically, association
rules are discarded as uninteresting if they do not satisfy both a minimum
support threshold and a minimum confidence threshold. Additional analysis
can be performed to uncover interesting statistical correlations between
associated attribute-value pairs.
Classification is the process of finding a model (or function) that describes
and distinguishes data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class label is unknown. It
predicts categorical (discrete, unordered) labels.

5
Regression, unlike classification, is a process to model continuous-valued
functions. It is used to predict missing or unavailable numerical data values
rather than (discrete) class labels.
Clustering analyzes data objects without consulting a known class label. The
objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity. Each cluster that
is formed can be viewed as a class of objects. Clustering can also facilitate
taxonomy formation, that is, the organization of observations into a
hierarchy of classes that group similar events together.
Outlier analysis is the analysis of outliers, which are objects that do not
comply with the general behavior or model of the data. Examples include
fraud detection based on a large dataset of credit card transactions

6. Present an example where data mining is crucial to the success of a


business. What data mining functionalities does this business need (eg, think
of the kinds of patterns that could be mined)? Can such patterns be
generated alternatively by data query processing or simple statistical
analysis?
ANS:
A department store, for example, can use data mining to assist with its
target marketing mail campaign.
Using data mining functions such as association, the store can use the mined
strong association rules to
determine which products bought by one group of customers are likely to
lead to the buying of certain
other products. With this information, the store can then mail marketing
materials only to those kinds
of customers who exhibit a high likelihood of purchasing additional
products. Data query processing

6
is used for data or information retrieval and does not have the means for
finding association rules.
Similarly, simple statistical analysis cannot handle large amounts of data
such as those of customer
records in a department store.

7. Explain the difference and similarity between discrimination and


classification, between characterization and clustering, and between
classification and regression.
ANS:
Discrimination differs from classification in that the former refers to a
comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes, while
the latter is the process of finding a set of models (or functions) that
describe and distinguish data classes or concepts for the purpose of being
able to use the model to predict the class of objects whose class label is
unknown. Discrimination and classification are similar in that they both deal
with the analysis of class data objects.
Characterization differs from clustering in that the former refers to a
summarization of the general
characteristics or features of a target class of data while the latter deals with
the analysis of data
objects without consulting a known class label. This pair of tasks is similar in
that they both deal
with grouping together objects or data that are related or have high
similarity in comparison to one
another.
Classification differs from regression in that the former predicts categorical
(discrete, unordered)

7
labels while the latter predicts missing or unavailable, and often numerical,
data values. This pair of
tasks is similar in that they both are tools for prediction.

8. Outliers are often discarded as noise. However, one person's garbage


could be another's treasure. For example, exceptions in credit card
transactions can help us detect the fraudulent use of credit cards. Using
fraudulence detection as an example, propose two methods that can be
used to detect outliers and discuss which one is more reliable.
ANS :
There are many outlier detection methods. More details can be found in
Chapter 12. Here we propose
two methods for fraudulence detection:
a) Statistical methods (also known as model-based methods): Assume that
the normal transaction data follow some statistical (stochastic) model, then
data not following the model are
outliers.
b) Clustering-based methods: Assume that the normal data objects belong
to large and dense
clusters, whereas outliers belong to small or sparse clusters, or do not
belong to any clusters.
It is hard to say which one is more reliable. The effectiveness of statistical
methods highly depends
on whether the assumptions made for the statistical model hold true for the
given data. And the
effectiveness of clustering methods highly depends on which clustering
method we choose.

8
9. What are the major challenges of mining a huge amount of data (eg.
billions of tuples) in comparison with mining a small amount of data (eg,
data set of a few hundred tuple)?
ANS :
Challenges to data mining regarding data mining methodology and user
interaction issues include the
following: mining different kinds of knowledge in databases, interactive
mining of knowledge at multiple
levels of abstraction, incorporation of background knowledge, data mining
query languages and ad hoc
data mining, presentation and visualization of data mining results, handling
noisy or incomplete data,
and pattern evaluation. Below are the descriptions of the first three
challenges mentioned:
Mining different kinds of knowledge in databases: Different users are
interested in different
kinds of knowledge and will require a wide range of data analysis and
knowledge discovery tasks such
as data characterization, discrimination, association, classification,
clustering, trend and deviation
analysis, and similarity analysis. Each of these tasks will use the same
database in different ways and
will require different data mining techniques.
Interactive mining of knowledge at multiple levels of abstraction: Interactive
mining, with
the use of OLAP operations on a data cube, allows users to focus the search
for patterns, providing
and refining data mining requests based on returned results. The user can
then interactively view the

9
data and discover patterns at multiple granularities and from different
angles.
Incorporation of background knowledge: Background knowledge, or
information regarding the
domain under study such as integrity constraints and deduction rules, may
be used to guide the
discovery process and allow discovered patterns to be expressed in concise
terms and at different levels
of abstraction. This helps to focus and speed up a data mining process or
judge the interestingness of
discovered patterns.

One challenge to data mining regarding performance issues is the efficiency


and scalability of data
mining algorithms. Data mining algorithms must be efficient and scalable in
order to effectively extract
information from large amounts of data in databases within predictable and
acceptable running times.
Another challenge is the parallel, distributed, and incremental processing of
data mining algorithms.
The need for parallel and distributed data mining algorithms has been
brought about by the huge size of
many databases, the wide distribution of data, and the computational
complexity of some data mining
methods. Due to the high cost of some data mining processes, incremental
data mining algorithms
incorporate database updates without the need to mine the entire data
again from scratch.

10
10. Outline the major research challenges of data mining in one specific
application domain, such as stream/sensor data analysis, spatiotemporal
data analysis, or bioinformatics.
ANS:
Let’s take spatiotemporal data analysis for example. With the ever
increasing amount of available
data from sensor networks, web-based map services, location sensing
devices etc., the rate at which
such kind of data are being generated far exceeds our ability to extract
useful knowledge from them to
facilitate decision making and to better understand the changing
environment. It is a great challenge
how to utilize existing data mining techniques and create novel techniques
as well to effectively exploit
the rich spatiotemporal relationships/patterns embedded in the datasets
because both the temporal
and spatial dimensions could add substantial complexity to data mining
tasks. First, the spatial and
temporal relationships are information bearing and therefore need to be
considered in data mining.
Some spatial and temporal relationships are implicitly defined, and must be
extracted from the data.
Such extraction introduces some degree of fuzziness and/or uncertainty that
may have an impact on the
results of the data mining process. Second, working at the level of stored
data is often undesirable, and
thus complex transformations are required to describe the units of analysis
at higher conceptual levels.

11
Third, interesting patterns are more likely to be discovered at the lowest
resolution/granularity level,
but large support is more likely to exist at higher levels. Finally, how to
express domain independent
knowledge and how to integrate spatiotemporal reasoning mechanisms in
data mining systems are still open problems

11. Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order) 13, 15, 16,16, 19, 20, 20,
21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median? (b) What is the
mode of the data? Comment on the data's modality (ie, bimodal, trimodal,
etc.).
(c) What is the midrange of the data?
ANS:
(a) What is the mean of the data? What is the median?
The (arithmetic) mean of the data is: ¯x =
1
n
∑n
i=1 xi = 809/27 = 30. The median (middle value
of the ordered set, as the number of values in the set is odd) of the data is:
25.
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
This data set has two values that occur with the same highest frequency and
is, therefore, bimodal.

12
The modes (values occurring with the greatest frequency) of the data are 25
and 35.
(c) What is the midrange of the data?
The midrange (average of the largest and smallest values in the data set) of
the data is: (70 +
13)/2 = 41.5

12. Suppose that a hospital tested the age and body fat data for 18
randomly selected adults with the following results.

(a) Calculate the mean, median, and standard deviation of age and %fot.
(b) Draw the boxplots for age and %fat.
ANS:

(a) Calculate the mean, median and standard deviation of age and %fat.
For the variable age the mean is 46.44, the median is 51, and the standard
deviation is 12.85. For
the variable %fat the mean is 28.78, the median is 30.7, and the standard
deviation is 8.99.

(b) Draw the boxplots for age and %fat.


See Figure 2.2.

13
Figure 2.2: A boxplot of the variables age and %fat in Exercise 2.2.4

13. Demonstrate the types of Date attributes with suitable example.


ANS:

14. Use your knowledge to create a student Database of at least 10 tuples


with Nominal, Binary, Discrete, ordinal attributes.
ANS:

15. Briefly outline how to compute the dissimilarity between objects


described by the following:
(a) Nominal attributes
(b) Asymmetric binary attributes
ANS:

14
16. What kind of data can be mined?
ANS:

1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases

15
8. WWW

1. Flat Files
o Flat files is defined as data files in text form or binary form
with a structure that can be easily extracted by data mining
algorithms.
o Data stored in flat files have no relationship or path among
themselves, like if a relational database is stored on flat file,
then there will be no relations between the tables.
o Flat files are represented by data dictionary. Eg: CSV file.
o Application: Used in DataWarehousing to store data, Used in
carrying data to and from server, etc.
2. Relational Databases
o A Relational database is defined as the collection of data
organized in tables with rows and columns.
o Physical schema in Relational databases is a schema which
defines the structure of tables.
o Logical schema in Relational databases is a schema which
defines the relationship among tables.
o Standard API of relational database is SQL.
o Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
o A datawarehouse is defined as the collection of data
integrated from multiple sources that will queries and
decision making.
o There are three types of
datawarehouse: Enterprise datawarehouse, Data
Mart and Virtual Warehouse.
o Two approaches can be used to update data in
DataWarehouse: Query-driven Approach and Update-
driven Approach.
o Application: Business decision making, Data mining, etc.
4. Transactional Databases
o Transactional databases is a collection of data organized by
time stamps, date, etc to represent transaction in databases.

16
o This type of database has the capability to roll back or undo
its operation when a transaction is not completed or
committed.
o Highly flexible system where users can modify information
without changing any sensitive information.
o Follows ACID property of DBMS.
o Application: Banking, Distributed systems, Object databases,
etc.
5. Multimedia Databases
o Multimedia databases consists audio, video, images and text
media.
o They can be stored on Object-Oriented Databases.
o They are used to store complex information in a pre-
specified formats.
o Application: Digital libraries, video-on demand, news-on
demand, musical database, etc.
6. Spatial Database
o Store geographical information.
o Stores data in the form of coordinates, topology, lines,
polygons, etc.
o Application: Maps, Global positioning, etc.
7. Time-series Databases
o Time series databases contains stock exchange data and
user logged activities.
o Handles array of numbers indexed by time, date, etc.
o It requires real-time analysis.
o Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
o WWW refers to World wide web is a collection of
documents and resources like audio, video, text, etc which
are identified by Uniform Resource Locators (URLs) through
web browsers, linked by HTML pages, and accessible via the
Internet network.
o It is the most heterogeneous repository as it collects data
from multiple resources.

17
o It is dynamic in nature as Volume of data is continuously
increasing and changing.
o Application: Online shopping, Job search, Research,
studying, etc.

17. Illustrate KDD process with suitable diagram.

ANS: KDD process

1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant


data from collection. 
o Cleaning in case of Missing values.
o Cleaning noisy data, where noise is a random or variance error.
o Cleaning with Data discrepancy detection and Data transformation
tools.
2. Data Integration: Data integration is defined as heterogeneous data from
multiple sources combined in a common source(DataWarehouse). 
o Data integration using Data Migration tools.
o Data integration using Data Synchronization tools.
o Data integration using ETL(Extract-Load-Transformation) process.
3. Data Selection: Data selection is defined as the process where data relevant to
the analysis is decided and retrieved from the data collection. 
o Data selection using Neural network.
o Data selection using Decision Trees.
o Data selection using Naive bayes.
o Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of
transforming data into appropriate form required by mining procedure. 

Data Transformation is a two step process: 

o Data Mapping: Assigning elements from source base to destination to


capture transformations.
o Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful. 
o Transforms task relevant data into patterns.
o Decides purpose of model using classification or characterization.

18
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures. 
o Find interestingness score of each pattern.
o Uses summarization and Visualization to make data understandable by
user

7.Knowledge representation: Knowledge representation is defined as technique which


utilizes visualization tools to represent data mining results. 

o Generate reports.
o Generate tables.
o Generate discriminant rules, classification rules, characterization
rules, etc.

Note: 
 

 KDD is an iterative process where evaluation measures can be enhanced,


mining can be refined, new data can be integrated and transformed in order to
get different and more appropriate results.
 Preprocessing of databases consists of Data cleaning and Data Integration

19
18. Apply your knowledge to justify "Data Mining is the backbone of
Machine Learning
ANS:

19. Examine that how Data Mining is the sub process in KDD?

20
ANS:
The Knowledge Discovery in Databases is considered as a
programmed, exploratory analysis and modeling of vast data
repositories.KDD is the organized procedure of recognizing
valid, useful, and understandable patterns from huge and
complex data sets. 
Data Mining is the root of the KDD procedure, including the
inferring of algorithms that investigate the data, develop the
model, and find previously unknown patterns. The model is
used for extracting the knowledge from the data, analyze the
data, and predict the data

20. Show that Data Preprocessing is required to measure the


quality of Data with your proper justification.
ANS:

21

You might also like