0% found this document useful (0 votes)
147 views

Module 1-Data Mining Introduction (Student Edition)

This document provides an introduction and overview of a data mining course. The course will cover data mining concepts and techniques over 4 weeks. It will introduce topics like data selection, cleaning, coding, machine learning methods, and data visualization. Special emphasis will be given to machine learning as these methods provide real knowledge discovery tools. Important related technologies like data warehousing and online analytical processing will also be discussed. The course aims to help students understand how data mining can extract useful knowledge from large datasets and support fact-based decision making.

Uploaded by

Raiza Ananca
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views

Module 1-Data Mining Introduction (Student Edition)

This document provides an introduction and overview of a data mining course. The course will cover data mining concepts and techniques over 4 weeks. It will introduce topics like data selection, cleaning, coding, machine learning methods, and data visualization. Special emphasis will be given to machine learning as these methods provide real knowledge discovery tools. Important related technologies like data warehousing and online analytical processing will also be discussed. The course aims to help students understand how data mining can extract useful knowledge from large datasets and support fact-based decision making.

Uploaded by

Raiza Ananca
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DATA MINING

MODULE 1
An introduction (week 1 to 4)

WILBERT P. BENEDICTO, MIS

INSTRUCTOR 1

INSTITUTE OF INFORMATION SYSTEMS


AND TECHNOLOGY
DATA MINING: COURSE DESCRIPTION
Data Mining studies algorithms and computational paradigms that allow
computers to find patterns and regularities in databases, perform prediction
and forecasting, and generally improve their performance through
interaction with data. It is currently regarded as the key element of a more
general process called Knowledge Discovery that deals with extracting useful
knowledge from raw data.

The knowledge discovery process includes data selection, cleaning, coding,


using different statistical and machine learning techniques, and visualization
of the generated structures.

The course will cover all these issues and will illustrate the whole process by
examples. Special emphasis will be given to the Machine Learning methods
as they provide the real knowledge discovery tools. Important related
technologies, as data warehousing and on-line analytical processing (OLAP)
will be also discussed.
DATA MINING: COURSE INTENDED
LEARNING OUTCOMES

1. Understand the motivation and the application of data mining in the 21st
Century and how it encouraged to move in the era of data driven
applications.
2. Articulate the concepts and various forms of Data-warehouse and Online
Analytical Processing and its role in Data Mining.
3. Demonstrate skills in data pre-processing and recommend suitable pre-
processing method in dealing with various data problems.
4. Demonstrate skills in data visualization and create informative dashboards
to support fact-based decision making.
5. Familiarize with the stages CRISP-DM and understand the purpose of using
CRISP-DM as the methodology in a Data Mining Task.
6. Demonstrate skill in Association Rule mining using A priori Algorithm and
Unsupervised Learning using K-means Clustering and how it is applied in a
real-world data.
Table of Contents
Chapt.1: 21 S T CENTURY: MOVING TOWARDS THE INFORMATION AGE
…1
1.1. The Data Explosion
1.2. What is Data Mining?
1.3. The 4V's of Big Data
1.3.1. Why is it Important?
1.4. Data Mining Motivations and Objective
1.5. Data mining Phases
1.6. Various Kinds of Data that can be mined
1.7. Data Mining Task
1.7.1. Description
1.7.2. Estimation
1.7.3. Prediction
1.7.4. Classification
1.7.5. Clustering
1.7.6. Association
1.8. Technologies in data mining
1.8.1. Statistics
1.8.2. Machine Learning
1.8.3. Database systems and warehouses
1.8.4. Information retrieval
1.9. The UCI Repository of Datasets

Chapt.2: Getting to Know your Data: The Data Warehouse…………………25


2.2. Data Warehouse System vs. Operational Database System
2.3. OLPT vs. OLAP
2.4. The Data Representation
2.4.1. Data Objects and Attribute types
2.4.2. Nominal Attributes
2.4.3. Binary Attributes
2.4.4. Ordinal Attributes
2.4.5. Numeric Attributes
2.4.6. Discrete and Continuous Attributes

i
CHAPTER 1:
THE INFORMATION AGE

“Necessity, who is the mother of invention” – Plato

We live in a world where vast amounts of data are collected daily. Analyzing
such data is an important need. Look at how data mining can meet this
need by providing tools to discover knowledge from data. We will observe
how data mining can be viewed as a result of the natural evolution of
information technology.

ii
CHAPTER 1 provides an introduction to the multidisciplinary field of
data mining. It discusses the evolutionary path of information technology,
which has led to the need for data mining, and the importance of its
applications.

It examines the data types to be mined, including relational, transactional,


and data warehouse data, as well as complex data types such as time-series,
sequences, data streams, spatiotemporal data, multimedia data, text data,
graphs, social networks, and Web data.

The chapter presents a general classification of data mining tasks, based on


the kinds of knowledge to be mined, the kinds of technologies used, and the
kinds of applications that are targeted

iii
21ST CENTURY:
MOVING TOWARDS THE INFORMATION AGE
“We are living in the information age” is a popular saying; however, we are
actually living in the data age. Terabytes or petabytes of data pour into our
computer networks, the World Wide Web (WWW), and various data storage
devices every day from business, society, science and engineering, medicine,
and almost every other aspect of daily life. This explosive growth of available data
volume is a result of the computerization of our society and the fast
development of powerful data collection and storage tools. Businesses worldwide
generate gigantic data sets, including sales transactions, stock trading records,
product descriptions, sales promotions, company profiles and performance, and
customer feedback.

For example, large stores, such as Wal-Mart (in US) or local stores (i.e. PureGold),
handle hundreds of millions of transactions per week at thousands of branches
around the country or the world. Scientific and engineering practices generate
high orders of petabytes of data in a continuous manner, from remote sensing,
process measuring, scientific experiments, system performance, engineering
observations, and environment surveillance. Global backbone
telecommunication networks carry tens of petabytes of data traffic every day. The
medical and health industry generates tremendous amounts of data from medical
records, patient monitoring, and medical imaging.

Billions of Web searches supported by search engines process tens of petabytes of


data daily. Communities and social media have become increasingly important
data sources, producing digital pictures and videos, blogs, Web communities, and
various kinds of social networks.

2.1. The Data Explosion


The list of sources that generate huge amounts of data is endless. This explosively
growing, widely available, and gigantic body of data makes our time truly the
data age. Powerful and versatile tools are badly needed to automatically uncover
valuable information from the tremendous amounts of data and to transform
such data into organized knowledge. This necessity has led to the birth of data
mining. The field is young, dynamic, and promising. Data mining has and will
continue to make great strides in our journey from the data age toward the
coming information age.

1
2.1.1. Data Mining turns a large collection of data into knowledge.
Search engine such as Google receives hundreds of millions of search queries
every day. Each query can be viewed as a transaction where the user
describes her or his information need. What novel and useful knowledge can
a search engine learn from such a huge collection of queries collected from
users over time? Interestingly, some patterns found in user search queries can
disclose invaluable knowledge that cannot be obtained by reading
individual data items alone.

For example, Google’s Flu Trends uses specific search terms as indicators of
flu activity. It found a close relationship between the number of people who
search for flu-related information and the number of people who actually
have flu symptoms. A pattern emerges when all of the search queries related to
flu are aggregated. Using aggregated Google search data, Flu Trends can
estimate flu activity up to two weeks faster than traditional systems can. This
example shows how data mining can turn a large collection of data into
knowledge that can help meet a current global challenge.

2.2. So….What is Data Mining?


Simply stated, data mining refers to extracting or “mining knowledge from large
amounts of data”. The term is actually a misnomer. Remember that the mining of
gold from rocks or sand is referred to as gold mining rather than rock or sand
mining. Thus, “data mining” should have been more appropriately named
“knowledge mining from data”, which is unfortunately somewhat long. “Knowledge
mining”, a shorter term, may not reflect the emphasis on mining from large
amounts of data. Nevertheless, mining is a vivid term characterizing the process
that finds a small set of precious nuggets from a great deal of raw material.

Thus, such a misnomer which carries both “data” and “mining” became a
popular choice. There are many other terms carrying a similar or slightly
different meaning to data mining, such as knowledge mining from databases,
knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.

2
1.3. The 4vs of Big Data (Volume, Velocity, Veracity, Variety)
Big data is a term that describes the large volume of data – both structured and
unstructured – that inundates a business on a day-to-day basis. But it’s not the
amount of data that’s important. It’s what organizations do with the data that
matters.

The term “big data” refers to data that is so large, fast or complex that it’s
difficult or impossible to process using traditional methods. The act of accessing and
storing large amounts of information for analytics has been around a long time.

To do this companies need to understand the 4Vs of big data – volume, velocity,
variety, and veracity (figure 1.0) – and develop tools and processes to manage
data and turn it into actionable insights.

Volume Veracity
Amount of data Quality of data

BIG DATA
Velocity Variety
Speed of data Types of data

Figure 1.0: 4V's of BIG DATA

Volume

Volume refers to the amount of data being generated, and in the age of big data,
more data is being generated every minute than ever before.

Organizations collect data from a variety of sources, including business


transactions, smart (IoT) devices, industrial equipment, videos, social media
and more. In the past, storing it would have been a problem – but cheaper
storage on platforms like data lakes and Hadoop have eased the burden.

Velocity

Velocity refers to the speed of the data being generated and the rate at which it’s
being processed in terms of both collection and analysis.

With the growth in the Internet of Things, data streams in to businesses at an


unprecedented speed and must be handled in a timely manner. RFID tags,
sensors and smart meters are driving the need to deal with these torrents of
data in near-real time.

3
Veracity

Veracity refers to the quality, reliability or uncertainty of the data. Is the data
trustworthy? Is it outdated? Has it been modified in any way? Basically, is it
accurate? Data must be cleaned, current, and of high quality and reliability for
it to be accurately analyzed.

Variety

Variety refers to this broad range of different types of data that can come from many
different sources. Today, data comes not only from computers, but also devices
such as smartphones and appliances, among others. Additionally, with the
popularity of social media and other online platforms, vast amounts of
unstructured data are being created (e.g., tweets, photos, videos, social
media posts, online comments, etc.).

1.3.1. Why is Big Data Important?


The importance of big data doesn’t revolve around how much data you have, but
what you do with it. You can take data from any source and analyze it to find
answers that enable cost reductions, time reductions, new product development
and optimized offerings, and smart decision making. When you combine big
data with high-powered analytics, you can accomplish business-related tasks
such as:

 Determining root causes of failures, issues and defects in near-real time.


 Generating coupons at the point of sale based on the customer’s buying
habits.
 Recalculating entire risk portfolios in minutes.
 Detecting fraudulent behavior before it affects your organization.
 So on…

4
1.4. Data Mining Motivations and Objectives
Data mining has attracted a great deal of attention in the information industry
and in society as a whole in recent years, due to the wide availability of huge
amounts of data and the imminent need for turning such data into useful
information and knowledge. The information and knowledge gained can be
used for applications ranging from market analysis, fraud detection, and
customer retention, to production control and science exploration.

“Data mining can be viewed as a result of the natural evolution of


information technology”
The database and data management industry evolved in the development of
several critical functionalities (Figure 2.0) data collection and database creation, data
management, and advanced data analysis (involving data warehousing and data
mining).

Huge volumes of data have been accumulated beyond databases and data
warehouses. The early development of data collection and database creation
mechanisms served as a prerequisite for the later development of effective
mechanisms for data storage and retrieval, as well as query and transaction
processing. Nowadays numerous database systems offer query and
transaction processing as common practice. Advanced data analysis has naturally
become the next step.

In summary, the abundance of data, coupled with the need for powerful data
analysis tools, has been described as a data rich but information poor situation.
The fast-growing, tremendous amount of data, collected and stored in large
and numerous data repositories, has far exceeded our human ability for
comprehension without powerful tools. As a result, data collected in large data
repositories become “data tombs”—data archives that are seldom visited.

Consequently, important decisions are often made based not on the


information-rich data stored in data repositories but rather on a decision maker’s
intuition, simply because the decision maker does not have the tools to extract
the valuable knowledge embedded in the vast amounts of data.

The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into “golden
nuggets” of knowledge.

5
Data Collection and Database Creation
(1960s and earlier)
 Primitive file processing

Database Management System


(1970s to early 1980s)
 Hierarchical and network database systems
 Relational database systems
 Data modeling: entity-relationship models, etc.
 Indexing and accessing methods
 Query languages: SQL, etc.
 User interfaces, forms, and reports
 Query processing and optimization
 Transactions, concurrency control, and recovery
 Online transaction processing (OLTP)

Advanced Data Analysis


Advanced Database Systems
(Late 1980s to present)
(Mid 1980s to present)
 Data warehouse and OLAP
 Advanced data models: extended-
 Data mining and knowledge discovery:
relational, object relational, deductive, etc.
classification, clustering, outlier analysis,
 Managing complex data: spatial, temporal,
association and correlation, comparative
multimedia, sequence and structured,
summary, discrimination analysis, pattern
scientific, engineering, moving objects, etc.
discovery, trend and deviation analysis, etc.
 Data streams and cyber-physical data
 Mining complex types of data: streams,
systems
sequence, text, spatial, temporal,
 Web-based databases (XML, semantic web)
multimedia, Web, networks, etc.
 Managing uncertain data and data
 Data mining applications: business, society,
cleaning
retail, banking, telecommunications, science
 Integration of heterogeneous sources
and engineering, blogs, daily life, etc.
 Text database systems and integration with
 Data mining and society: invisible data
information retrieval
mining, privacy-preserving data mining,
 Cloud computing and parallel data
mining social and information networks,
processing
recommender systems, etc.

Future Generation of Information Systems


(Present to Future)

Figure 2.0: Database and Data Management Industry Evolution

6
1.5. Data Mining Phases
Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as
simply an essential step in the process of knowledge discovery. Knowledge
discovery as a process is depicted in Figure 3.0 and consists of an iterative
sequence of the following steps:

Step 1. Data cleaning (to remove noise and inconsistent data)


Step 2. Data integration (where multiple data sources may be combined)
Step 3. Data selection (where data relevant to the analysis task are retrieved
from the database)
Step 4. Data transformation (where data are transformed or consolidated into
forms appropriate for mining by performing summary or aggregation
operations, for instance)
Step 5. Data mining (an essential process where intelligent methods are
applied in order to extract data patterns)
Step 6. Pattern evaluation (to identify the truly interesting patterns representing
knowledge based on some interestingness measures)
Step 7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined knowledge
to the user)

Steps 1 to 4 are different forms of data preprocessing, where the data are
prepared for mining. The data mining step may interact with the user or a
knowledge base. The interesting patterns are presented to the user and may be
stored as new knowledge in the knowledge base. Note that according to this
view, data mining is only one step in the entire process, but an essential one
because it uncovers hidden patterns for evaluation.

We can agree that data mining is a step in the knowledge discovery process.
However, in industry, in media, and in the database research environment, the
term data mining is becoming more popular than the longer term of
knowledge discovery from data. Therefore, in this module, we choose to use
the term data mining. We adopt a broad view of data mining functionality:
data mining is the process of discovering interesting knowledge from large
amounts of data stored in databases, data warehouses, or other information
repositories.

7
Know
ledge

Know
ledge

Evaluation/Interpretation Knowledge

Data Mining
Patterns

Pre-processing (Data Selection & Transformation)


Preprocessed
Target Data
Data

Data Cleaning & Integration

Various Data
Sources

Figure 3.0 – Adapted from:


U. Fayyad, et al. (1995), “From Knowledge Discovery
to Data Mining: An Overview,” Advances in
Knowledge Discovery and Data Mining, U. Fayyad et
al. (Eds.), AAAI/MIT Press

8
1.6. Various kinds of Data that can be mined
In principle, data mining should be applicable to any kind of data repository, as well
as to transient data, such as data streams. Thus the scope of our examination
of data repositories will include relational databases, data warehouses, and
transactional databases, advanced database systems, flat files, data streams,
and the World Wide Web.

Advanced database systems include object-relational databases and specific


application-oriented databases, such as spatial databases, time-series
databases, text databases, and multimedia databases.

The challenges and techniques of mining may differ for each of the repository
systems. Although this module assumes that you have basic knowledge of
information systems, we provide a brief introduction to each of the major data
repository systems listed above.

1.3.2. Relational Database


A database system, also called a database management system (DBMS), consists
of a collection of interrelated data, known as a database, and a set of
software programs to manage and access the data.

A relational database is a collection of tables, each of which is assigned a unique


name. Each table consists of a set of attributes (columns or fields) and usually
stores a large set of tuples (records or rows). Each tuple in a relational table
represents an object identified by a unique key and described by a set of
attribute values.

A semantic data model, such as an entity-relationship (ER) data model, is often


constructed for relational databases. An ER data model represents the
database as a set of entities and their relationships.

Relational data can be accessed by database queries written in a relational


query language, such as SQL, or with the assistance of graphical user
interfaces.
Note: This topic should be covered on your subject Database Management

Relational databases are one of the most commonly available and rich
information repositories, and thus they are a major data form in our study of
data mining.

9
1.3.3. Data Warehouses
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single site.

Data warehouses are constructed via a process of data cleaning, data


integration, data transformation, data loading, and periodic data refreshing. This
process will be discussed more in the next Chapter. Figure 4.0 shows the
typical framework for construction and use of a data warehouse.

Data source from Chicago

Client 1

Data source from New York Clean


Integrate
Data Query and
Transform
Warehouse analysis tools
Load
Data source from Toronto Refresh Client 2

Data source from Vancouver Figure 4.0: Typical framework of a data warehouse

To facilitate decision making, the data in a data warehouse are organized


around major subjects, such as customer, item, supplier, and activity. The
data are stored to provide information from a historical perspective (such as
from the past 5–10 years) and are typically summarized.

A data warehouse is usually modeled by a multidimensional database structure,


where each dimension corresponds to an attribute or a set of attributes in the
schema, and each cell stores the value of some aggregate measure, such as
count or sales amount.

10
1.3.4. Transactional Databases
In general, a transactional database consists of a file where each record
represents a transaction. A transaction typically includes a unique transaction
identity number (trans_ID) and a list of the items making up the transaction
(such as items purchased in a store).
Table 1.0: Fragment of a transactional database for sales

Trans_ID Item_List_ID
T100 I1, I2, I4, I277
T101 I2, I5, I89
……. …….

The transactional database may have additional tables associated with it,
which contain other information regarding the sale, such as the date of the
transaction, the customer ID number, the ID number of the salesperson and
of the branch at which the sale occurred, and so on.

1.3.5. Advanced Data and Information Systems and Advanced Applications


Relational database systems have been widely used in business applications.
With the progress of database technology, various kinds of advanced data
and information systems have emerged and are undergoing development to
address the requirements of new applications.

The new database applications include handling spatial data (such as maps),
engineering design data (such as the design of buildings, system components, or
integrated circuits), hypertext and multimedia data (including text, image, video,
and audio data), time-related data (such as historical records or stock
exchange data), stream data (such as video surveillance and sensor data,
where data flow in and out like streams), and the World Wide Web (a huge,
widely distributed information repository made available by the Internet).

1.3.6. Object-Relational Database


Object-relational databases are constructed based on an object-relational
data model. Because most sophisticated database applications need to
handle complex objects and structures, object-relational databases are
becoming increasingly popular in industry and applications.

Conceptually, the object-relational data model inherits the essential


concepts of object-oriented databases, where, in general terms, each entity is
considered as an object

For data mining in object-relational systems, techniques need to be


developed for handling complex object structures, complex data types, class and
subclass hierarchies, property inheritance, and methods and procedures.

11
1.3.7. Temporal Databases, Sequence Databases, and Time-Series
Databases
A Temporal database typically stores relational data that include time-related
attributes. These attributes may involve several timestamps, each having
different semantics.

A Sequence database stores sequences of ordered events, with or without a


concrete notion of time.

A Time-series database stores sequences of values or events obtained over


repeated measurements of time (e.g., hourly, daily, weekly).

For instance, the mining of banking data may aid in the scheduling of bank
tellers according to the volume of customer traffic. Stock exchange data can
be mined to uncover trends that could help you plan investment strategies.

1.3.8. Spatial Databases and Spatiotemporal Databases


Spatial databases contain spatial-related information. Examples include
geographic (map) databases, very large-scale integration (VLSI) or
computer-aided design databases, and medical and satellite image
databases.

Spatial data may be represented in raster format, consisting of n-dimensional


bit maps or pixel maps.

“What kind of data mining can be performed on spatial databases?”

Spatial Data mining may uncover patterns describing the characteristics of


houses located near a specified kind of location, such as a park, for instance.
Moreover, spatial classification can be performed to construct models for
prediction based on the relevant set of features of the spatial objects.

1.3.9. Text and Multimedia Databases


Text databases are databases that contain word descriptions for objects. These
word descriptions are usually not simple keywords but rather long sentences
or paragraphs, such as product specifications, error or bug reports, warning
messages, summary reports, notes, or other documents. Text databases may
be highly unstructured, some text databases may be somewhat structured,
that is, semi-structured whereas others are relatively well structured (such as
library catalogue databases).

Multimedia databases store image, audio, and video data. They are used in
applications such as picture content-based retrieval, voice-mail systems,
video-on-demand systems, the World Wide Web, and speech-based user
interfaces that recognize spoken commands.

12
Because video and audio data require real-time retrieval at a steady and
predetermined rate in order to avoid picture or sound gaps and system
buffer overflows, such data are referred to as continuous-media data.

For multimedia data mining, storage and search techniques need to be


integrated with standard data mining methods. Promising approaches
include the construction of multimedia data cubes, the extraction of multiple
features from multimedia data, and similarity-based pattern matching.

1.3.10. Heterogeneous Databases and Legacy Databases


Heterogeneous database consists of a set of interconnected, autonomous
component databases. The components communicate in order to exchange
information and answer queries.

Legacy database is a group of heterogeneous databases that combines different


kinds of data systems, such as relational or object-oriented databases,
hierarchical databases, network databases, spreadsheets, multimedia
databases, or file systems.

The heterogeneous databases in a legacy database may be connected by


intra or inter-computer networks.

Data mining techniques may provide an interesting solution to the


information exchange problem by performing statistical data distribution and
correlation analysis, and transforming the given data into higher, more
generalized, conceptual levels (such as fair, good, or excellent for student
grades), from which information exchange can then more easily be
performed.

1.3.11. Data Streams


Many applications involve the generation and analysis of a new kind of data,
called stream data, where data flow in and out of an observation platform (or
window) dynamically.

Typical examples of data streams include various kinds of scientific and


engineering data, time-series data, and data produced in other dynamic
environments, such as network traffic, stock exchange, telecommunications,
Web click streams, video surveillance, and weather or environment
monitoring.

Mining data streams involves the efficient discovery of general patterns and
dynamic changes within stream data. For example, we may like to detect
intrusions of a computer network based on the anomaly of message flow,
which may be discovered by clustering data streams, dynamic construction
of stream models, or comparing the current frequent patterns with that at a
certain previous time.

13
1.3.12. The World Wide Web
The World Wide Web and its associated distributed information services, such
as Yahoo!, Google and so on provide rich, worldwide, on-line information
services, where data objects are linked together to facilitate interactive
access.

For example, understanding user access patterns will not only help improve
system design but also leads to better marketing decisions. Capturing user
access patterns in such distributed information environments is called Web
usage mining (Weblog mining).

Web mining is the development of scalable and effective Web data analysis
and mining methods. It may help us learn about the distribution of information
on the Web in general, characterize and classify Web pages, and uncover
Web dynamics and the association and other relationships among different
Web pages, users, communities, and Web-based activities.

14
1.7. Data Mining Task
The following listing shows the most common data mining tasks.

1. Description
2. Estimation
3. Prediction
4. Classification
5. Clustering
6. Association.

1.7.1. Descriptive Task


Sometimes researchers and analysts are simply trying to find ways to describe
patterns and trends lying within the data. For example, an investigator may
uncover evidence that those who have been laid off are less likely to support
the present in the presidential election. Descriptions of patterns and trends
often suggest possible explanations for such patterns and trends.

Data mining models should be as transparent as possible. That is, the results of the
data mining model should describe clear patterns that are agreeable to
intuitive interpretation and explanation.

Some data mining methods are more suited to transparent interpretation


than others. For example, decision trees provide an intuitive and human-
friendly explanation of their results. However, neural networks are
comparatively opaque to non-specialists, due to the nonlinearity and
complexity of the model.

High-quality description can often be accomplished with exploratory data


analysis, a graphical method of exploring the data in search of patterns and
trends.

1.7.2. Estimation Task


In estimation, we approximate the value of a numeric target variable using a
set of numeric and/or categorical predictor variables. Models are built using
“complete” records, which provide the value of the target variable, as well as
the predictors.

Then, for a new data (or observation), estimates of the value of the target
variable are made, based on the values of the predictors.

For example, we might be interested in estimating the systolic blood pressure


reading of a hospital patient, based on the patient’s age, gender, body mass
15
index, and blood sodium levels. The relationship between systolic blood
pressure and the predictor variables in the training set would provide us with
an estimation model. We can then apply that model to new cases.

Examples of estimation tasks in business and research include:


 Estimating the number of points per game LeBron James will score when
double-teamed in the play-offs.
 Estimating the grade point average (GPA) of a graduate student, based on
that student’s undergraduate GPA.
 Estimating the percentage decrease in rotary movement sustained by a
National Football League (NFL) running back with a knee injury.

1.7.3. Prediction Task


Prediction is similar to classification and estimation, except that for prediction,
the results lie in the future. Examples of prediction tasks in business and
research include:
 Predicting the price of a stock 3 months into the future;
 Predicting the percentage increase in traffic deaths next year if the speed
limit is increased;
 Predicting the winner of in a tournament, based on a comparison of the team
statistics;
 Predicting whether a particular molecule in drug discovery will lead to a
profitable new drug for a pharmaceutical company.

Any of the methods and techniques used for classification and estimation
may also be used, under appropriate circumstances, for prediction.

1.7.4. Classification Task


Classification is similar to estimation, except that the target variable is categorical
rather than numeric. In classification, there is a target categorical variable,
such as income bracket, which, for example, could be partitioned into three
classes or categories: high income, middle income, and low income.

For example, consider the excerpt from a data set in Table 2.0.
Table 2.0: Excerpt from dataset for classifying income

Subject Age Gender Occupation Income Bracket


001 47 F Software Engr. High
002 28 M Marketing Middle
Const.
003 35 M Unemployed Low
…. …. …. …. ….

Suppose the researcher would like to be able to classify the income bracket of
new individuals, not currently in the above database, based on the other
16
characteristics associated with that individual, such as age, gender, and
occupation. This task is a classification task, very nicely suited to data mining
methods and techniques.

The algorithm would proceed roughly as follows. First, examine the data set
containing both the predictor variables and the (already classified) target
variable, income bracket. In this way, the algorithm “learns about” which
combinations of variables are associated with which income brackets. For
example, older females may be associated with the high-income bracket.
This data set is called the training set.

Then the algorithm would look at new records, for which no information about
income bracket is available. On the basis of the classifications in the training set,
the algorithm would assign classifications to the new records. For example, a
63-year-old female professor might be classified in the high-income bracket.

1.7.5. Clustering Task


Clustering refers to the grouping of records, observations, or cases into classes
of similar objects. A cluster is a collection of records that are similar to one another,
and dissimilar to records in other clusters. Clustering differs from classification in
that there is no target variable for clustering. The clustering task does not try
to classify, estimate, or predict the value of a target variable. Instead,
clustering algorithms seek to segment the whole data set into relatively
homogeneous subgroups or clusters, where the similarity of the records within
the cluster is maximized, and the similarity to records outside of this cluster is
minimized.

Clustering is often performed as a preliminary step in a data mining process, with


the resulting clusters being used as further inputs into a different technique
downstream, such as neural networks.
Note: Clustering will be further discussed in the upcoming chapters

1.7.6. Association Task


The association task for data mining is the job of finding which attributes “go
together.” Most prevalent in the business world, where it is known as affinity
analysis or market basket analysis, the task of association seeks to uncover rules
for quantifying the relationship between two or more attributes.

Association rules are of the form “If antecedent then consequent,” together with a
measure of the support and confidence associated with the rule.

For example, a particular supermarket may find that, of the 1000 customers
shopping on a Thursday night, 200 bought diapers, and of those 200 who
bought diapers, 50 bought beer. Thus, the association rule would be “If buy
diapers, then buy beer,” with a support of 200/1000 =20% and a confidence of
50/200=25%.

Examples of association tasks in business and research include:


17
 investigating the proportion of subscribers to your company’s cell phone
plan that respond positively to an offer of a service upgrade;
 examining the proportion of children whose parents read to them who
are themselves good readers;
 predicting degradation in telecommunications networks;
 finding out which items in a supermarket are purchased together, and
which items are never purchased together;
 Determining the proportion of cases in which a new drug will exhibit
dangerous side effects.
Note: Data Mining task will be further discussed in the upcoming chapters

18
1.8. Technologies in data mining
As a highly application-driven domain, data mining has incorporated many
techniques from other domains such as statistics, machine learning, pattern
recognition, database and data warehouse systems, information retrieval, visualization,
algorithms, high performance computing, and many application domains (Figure
5.0).

The interdisciplinary nature of data mining research and development


contributes significantly to the success of data mining and its extensive
applications. In this section, we give examples of several disciplines that
strongly influence the development of data mining methods.

Machine Pattern
Statistics
Learning Recognition

Database
Visualizations
Systems
Data Mining

Data Warehouse Algorithms

Information High-performance
Applications
Retrieval Computing

Figure 5.0: Data Mining adopts techniques from various domains

1.1.1. Statistics in data mining


Statistics studies the collection, analysis, interpretation or explanation, and
presentation of data. Data mining has an inherent connection with statistics.

For example, in data mining tasks like data characterization and


classification, statistical models of target classes can be built. In other words,
such statistical models can be the outcome of a data mining task.

Statistics is useful for mining various patterns from data as well as for
understanding the underlying mechanisms generating and affecting the
patterns. Inferential statistics (or predictive statistics) models data in a way that
accounts for randomness and uncertainty in the observations and is used to
draw inferences about the process or population under investigation.

19
Applying statistical methods in data mining is far from trivial. Often, a serious
challenge is how to scale up a statistical method over a large data set.

1.1.2. Machine Learning


Machine learning investigates how computers can learn (or improve their
performance) based on data. A main research area is for computer
programs to automatically learn to recognize complex patterns and make
intelligent decisions based on data.

For example, a typical machine learning problem is to program a computer


so that it can automatically recognize handwritten postal codes on mail after
learning from a set of examples.

Here, we illustrate classic problems in machine learning that are highly


related to data mining:

1. Supervised Learning is basically a synonym for classification. The supervision


in the learning comes from the labeled examples in the training data set.
 For example, in the postal code recognition problem, a set of
handwritten postal code images and their corresponding machine-
readable translations are used as the training examples, which
supervise the learning of the classification model.

2. Unsupervised Learning is essentially a synonym for clustering. The learning


process is unsupervised since the input examples are not class labeled.
 For example, an unsupervised learning method can take, as input, a
set of images of handwritten digits. Suppose that it finds 10 clusters of
data. These clusters may correspond to the 10 distinct digits of 0 to 9,
respectively. However, since the training data are not labeled, the
learned model cannot tell us the semantic meaning of the clusters
found.

3. Semi-supervised Learning is a class of machine learning techniques that


make use of both labeled and unlabeled examples when learning a
model. In one approach, labeled examples are used to learn class
models and unlabeled examples are used to refine the boundaries
between classes.

4. Active Learning is a machine learning approach that lets users play an active
role in the learning process. An active learning approach can ask a user
(e.g., a domain expert) to label an example, which may be from a set of
unlabeled examples or synthesized by the learning program. The goal is to
optimize the model quality by actively acquiring knowledge from human
users, given a constraint on how many examples they can be asked to
label.

20
You can see there are many similarities between data mining and machine
learning. For classification and clustering tasks, machine learning research
often focuses on the accuracy of the model.

1.1.3. Database Systems and Data Warehouses


Database systems research focuses on the creation, maintenance, and use of
databases for organizations and end-users.
Data warehouse integrates data originating from multiple sources and various
timeframes.
Many data mining tasks need to handle large data sets or even real-time, fast
streaming data. Therefore, data mining can make good use of scalable
database technologies to achieve high efficiency and scalability on large
data sets. Moreover, data mining tasks can be used to extend the capability
of existing database systems to satisfy advanced users’ sophisticated data
analysis requirements

1.1.4. Information Retrieval


Information retrieval (IR) is the science of searching for documents or
information in documents.

The differences between information retrieval and database systems are


twofold: Information retrieval assumes that:

1. The data under search are unstructured;

2. The queries are formed mainly by keywords, which do not have complex
structures (unlike SQL queries in database systems).

21
1.9. The UCI Repository of Datasets
Most of the commercial datasets used by companies for data mining are
unsurprisingly, not available for others to use. However there are a number of
‘libraries’ of datasets that are readily available for downloading from the
World Wide Web free of charge by anyone.

The best known of these is the ‘Repository’ of datasets maintained by the


University of California at Irvine, generally known as the ‘UCI Repository’.

The URL for the Repository is: https://archive.ics.uci.edu/ml/index.php


You can download any dataset that you can use throughout this course.

It contains thousands of datasets on topics as diverse as predicting the age


of abalone from physical measurements, predicting good and bad credit
risks, classifying patients with a variety of medical conditions and learning
concepts from the sensor data of a mobile robot. Some datasets are
complete, i.e. include all possible instances, but most are relatively small
samples from a much larger number of possible instances. Datasets with missing
values and noise are included.

Important Note: In the great majority of cases the datasets in the UCI
Repository give good results when processed by standard algorithms.
Datasets that lead to poor results tend to be associated with unsuccessful
projects and so may not be added to the Repository. The achievement of good
results with selected datasets from the Repository is no guarantee of the
success of a method with new data, but experimentation with such datasets
can be a valuable step in the development of new methods.

22
MODULE 1
ACTIVITY 1:

[ACTIVIY #1 AND ACTIVITY TITLE HERE]

A Partial Requirement In

DATA MINING –ISP112

[date submitted here]

[STUDENT NAME HERE]


[course, year, section]

WILBERT P. BENEDICTO, MIS


Subject Instructor

23
CHAPTER 2
GETTING TO KNOW YOUR DATA:
THE DATA-WAREHOUSE

CHAPTER 2 introduces the difference between Operational Database


and Data warehouse system and how they differ in storing data. This chapter
also introduces the OLAP system and OLPT system. It also discusses data
objects, attribute and general data features.

24
CHAPTER 2: GETTING TO KNOW
YOUR DATA
The Data Warehouse
A data warehouse means different things to different people. Some definitions are
limited to data; others refer to people, processes, software, tools, and data. One
of the global definitions is the following:
“The data warehouse is a collection of integrated, subject-oriented databases designed to
support the decision-support functions (DSF), where each unit of data is relevant to some
moment in time.”

The existence of a data warehouse is not a prerequisite for data mining, in practice,
the task of data mining, especially for some large companies, is made a lot
easier by having access to a data warehouse. A primary goal of a data
warehouse is to increase the “intelligence” of a decision process and the knowledge
of the people involved in this process.

A data warehouse can be viewed as an organization’s repository of data, set up


to support strategic decision-making. The function of the data warehouse is to
store the historical data of an organization in an integrated manner that reflects the
various facets of the organization and business. Typically, data warehouses are
huge, storing billions of records.

2.1. Data Warehouse System vs. Operational Database System


Most people are familiar with operational database systems, it is easy to understand
what a data warehouse is by comparing these two kinds of systems.

The major task of operational database systems is to perform transaction and query
processing. These systems are called online transaction processing (OLTP) systems.
They cover most of the day-to-day operations of an organization such as
purchasing, inventory, manufacturing, banking, payroll, registration, and
accounting.

Data warehouse systems, on the other hand, serve users or knowledge workers in the
role of data analysis and decision making. Such systems can organize and present
data in various formats in order to accommodate the diverse needs of
different users. These systems are known as online analytical processing (OLAP)
systems.

25
2.2. OLPT vs. OLAP
Because operational databases store huge amounts of data, you may
wonder, “Why not perform online analytical processing directly on such
databases instead of spending additional time and resources to construct a
separate data warehouse?” A major reason for such a separation is to help
promote the high performance of both systems.

An operational database is designed and tuned from known tasks and


workloads like indexing and hashing using primary keys, searching for particular
records, and optimizing “canned” queries. On the other hand, data
warehouse queries are often complex. They involve the computation of large
data groups at summarized levels, and may require the use of special data
organization, access, and implementation methods based on multidimensional
views.

The separation of operational databases from data warehouses is based on


the different structures, contents, and uses of the data in these two systems.

Decision support requires historic data, whereas operational databases do not


typically maintain historic data.
Table 3: Comparison of OLTP and OLAP System

Features OLPT OLAP


Characteristic operational processing Informational processing
Orientation transaction Analysis
Access read/write mostly read
Focus data in information out
Summarization primitive, highly detailed summarized, consolidated
Function day-to-day operations long-term informational
requirements decision
support
User clerk, DBA, database professional knowledge worker (e.g.,
manager, executive,
analyst)
Data current, guaranteed up-to-date historic, accuracy
maintained over time

However, many vendors of operational relational database management


systems are beginning to optimize such systems to support OLAP queries. As this
trend continues, the separation between OLTP and OLAP systems is expected
to decrease.

26
2.3. The Data Representation
It’s tempting to jump straight into mining, but first, we need to get the data
ready. This involves having a closer look at attributes and data values. Real-
world data are typically noisy, enormous in volume (often several terabytes or
more), and may originate from a hodge-podge of heterogeneous sources.

This chapter is about getting familiar with your data. Knowledge about your
data is useful for data preprocessing in the next chapter which is the first major
task of the data mining process.

You will want to know the following:


1. What are the types of attributes or fields that make up your data?
2. What kind of values does each attribute have?
3. Which attributes are discrete, and which are continuous-valued?
4. What do the data look like?
5. How are the values distributed?
6. Are there ways we can visualize the data to get a better sense of it all?
7. Can we spot any outliers?
8. Can we measure the similarity of some data objects with respect to others?

Gaining such insight into the data will help with the subsequent analysis.

2.3.1. Data Objects and Attribute types


Datasets are made up of data objects (table 4.0) A data object represents an
entity—in a sales database, the objects may be customers, store items, and
sales; in a medical database, the objects may be patients; in a university
database, the objects may be students, professors, and courses.
Data objects are typically described by attributes. Data objects can also be
referred to as samples, examples, instances, data points, or objects. If the data
objects are stored in a database, they are data tuples. That is, the rows of a
database correspond to the data objects, and the columns correspond to
the attributes.

Table 4.0: Store Sales

Attribute
store_id sales state status
s_1001 $6,500 Kansas open
Data Objects/ s_1002 $7,400 Alabama open
Observation/ s_1003 $6,920 Texas close
Instance/ … … … …
Data-points

Dataset

27
What is an attribute?
An attribute is a data field, representing a characteristic or feature of a data
object. Attributes for example describing a customer object can include,
customer ID, name, and address. Observed values for a given attribute are
known as observations. A set of attributes used to describe a given object is
called an attribute vector (or feature vector). The distribution of data
involving one attribute (or variable) is called univariate. A bivariate
distribution involves two attributes, and so on.

2.3.2. Nominal Attributes


Nominal means “relating to names.” The values of a nominal attribute are
symbols or names of things. Each value represents some kind of category, code,
or state, and so nominal attributes are also referred to as categorical. The values
do not have any meaningful order.

Table 5.0: Student Enrollment


student_id course age status
s_1001 BS Infotech. 18 Enrolled
s_1002 BS InfoSys. 22 Pending
s_1003 BS Infotech. 19 Enrolled
… … … …

Example in table 5.0: the attribute ‘course’ and ‘status’ are nominal attributes
or categorical attributes. The values do not have any meaningful order and not
quantitative in nature, it makes no sense to find the mean (average) value or
median (middle) value for such an attribute, given a set of objects.

2.3.3. Binary Attributes


A binary attribute is a nominal attribute with only two categories or states: 0 or 1,
where 0 typically means that the attribute is absent, and 1 means that it is
present. Binary attributes are referred to as Boolean if the two states correspond
to true and false.
Table 6: Covid19 Patient Test Result
patient_id age region test_result
p_1001 19 Kansas 1
p_1002 30 Alabama 0
p_1003 60 Texas 0
… … … …

Example in table 6.0: the attribute ‘test_result’ can be considered as binary


attributes as such in the example, Covid19 patient that undergoes test can
be either positive or negative.

28
Important Note:
A binary attribute is symmetric if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which outcome
should be coded as 0 or 1. One such example could be the attribute gender
having the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a medical test for
HIV. By convention, we code the most important outcome, which is usually
the rarest one, by 1 (e.g., Covid19 positive) and the other by 0 (e.g., Covid19
negative).

2.3.4. Ordinal Attributes


An ordinal attribute is an attribute with possible values that have a meaningful
order or ranking among them, but the magnitude between successive values
is not known.
Suppose that drink size corresponds to the size of drinks available at a fast-
food restaurant. This nominal attribute has three possible values: small,
medium, and large. The values have a meaningful sequence (which
corresponds to increasing drink size). The values have a meaningful sequence
(which corresponds to increasing drink size);
Ordinal attributes are useful for registering subjective assessments of qualities that
cannot be measured objectively; thus ordinal attributes are often used in surveys
for ratings. In one survey, participants were asked to rate how satisfied they
were as customers. Customer satisfaction had the following ordinal
categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2: neutral, 3: satisfied,
and 4: very satisfied.

Table 7.0: Customer review

cust_id product_id store star_rating


c_1001 p_102 Kansas 5
c_1002 p_102 Alabama 4
c_1003 p_103 Texas 2
… … … …

Example in table 7.0: the attribute ‘star_rating’ is a nominal attribute for it


represents order of ranking from 5 to 1 (5 be the highest).

29
2.3.5. Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values. Numeric attributes can be interval-scaled or
ratio-scaled.
 Interval-scaled attributes are measured on a scale of equal-size units. The
values of interval-scaled attributes have order and can be positive, 0, or
negative. Thus, in addition to providing a ranking of values, such attributes
allow us to compare and quantify the difference between values.

Table 8.0:
Kansas
day Time Weather Temperature (Celsius) State
Monday 9:00am Cloudy 35
Tuesday 12:00pm Partly-cloudy 38
Wednesday 3:00pm Sunny 40
… … … …

Temperature

Example in table 8.0: A ‘temperature’ attribute is interval-scaled. Suppose


that we have the outdoor temperature value for a number of different
days, where each day is an object. By ordering the values, we obtain a
ranking of the objects with respect to temperature. In addition, we can quantify
the difference between values.
For example, a temperature of 38◦C is five degrees higher than a
temperature of 35◦C giving it 2◦C interval. Calendar dates and time are
another example. For instance, the day Monday and Wednesday are
one(1) day apart.

 Ratio-scaled attribute is a numeric attribute with an inherent zero-point. That


is, if a measurement is ratio-scaled, we can speak of a value as being a
product_id Name store delivery_time (from day 0) multiple
p_1001 Shoes Kansas 3 (or
ratio) of
another
value.
In addition, the values are ordered, and we can also compute the
difference between values, as well as the mean, median, and mode.

Table 9.0: Product Delivery Time

30
p_1002 Smartphone Alabama 7
p_1003 powerbank Texas 6
… … … …

Example in table 9.0: the attribute ‘delivery_time’ represents ratio starting


from day zero (0). In the example product ‘Shoes’ the delivery time was from
day 0 to day 3. Other examples of ratio-scaled attributes include count
attributes such as years of job experience and number of words in a word
document.

household_id family_members num_of_pets since


h_1001 4 1 2009
h_1002 5 4 2011
h_1003 3 2 2012
… … … …

2.3.6. Discrete and Continuous Attributes

There are two types of quantitative data, which is also referred to as numeric
data: continuous and discrete. As a general rule, counts are discrete
and measurements are continuous.

 Discrete data is a count that can't be made more precise. Typically it


involves integers. For instance, the number of children (or adults, or pets)
in your family is discrete data, because you are counting whole, indivisible
entities.

Table 10: Family members

household_id family_member_id Remark BMI age

Example in table 10.0: the attribute ‘family_members’ and ‘num_of_pets’


represents discrete data since you can’t have you can't have 2.5 kids, or
1.3 pets.

 Continuous data, on the other hand, could be divided and reduced to


finer and finer levels. For example, you can measure the height of your
kids at progressively more precise scales—meters, centimeters, millimeters,
and beyond—so height is continuous data.

Table 11: Household ID 11 members

31
h_1001 fm_1 Mother 25.6 34
h_1001 fm_2 Father 26.9 32
h_1001 fm_3 Sibling 18.4 12
… … … …

Example in table 11.0: the attribute ‘BMI’ represents continuous data since
BMI can be a precise scaling. Other attributes can be height and weight.

32
MODULE 1
ACTIVITY 2:

[ACTIVIY #2 AND ACTIVITY TITLE HERE]

A Partial Requirement In

DATA MINING –ISP112

[date submitted here]

[STUDENT NAME HERE]


[course, year, section]

WILBERT P. BENEDICTO, MIS


Subject Instructor

33

You might also like