Module 1-Data Mining Introduction (Student Edition)
Module 1-Data Mining Introduction (Student Edition)
MODULE 1
An introduction (week 1 to 4)
INSTRUCTOR 1
The course will cover all these issues and will illustrate the whole process by
examples. Special emphasis will be given to the Machine Learning methods
as they provide the real knowledge discovery tools. Important related
technologies, as data warehousing and on-line analytical processing (OLAP)
will be also discussed.
DATA MINING: COURSE INTENDED
LEARNING OUTCOMES
1. Understand the motivation and the application of data mining in the 21st
Century and how it encouraged to move in the era of data driven
applications.
2. Articulate the concepts and various forms of Data-warehouse and Online
Analytical Processing and its role in Data Mining.
3. Demonstrate skills in data pre-processing and recommend suitable pre-
processing method in dealing with various data problems.
4. Demonstrate skills in data visualization and create informative dashboards
to support fact-based decision making.
5. Familiarize with the stages CRISP-DM and understand the purpose of using
CRISP-DM as the methodology in a Data Mining Task.
6. Demonstrate skill in Association Rule mining using A priori Algorithm and
Unsupervised Learning using K-means Clustering and how it is applied in a
real-world data.
Table of Contents
Chapt.1: 21 S T CENTURY: MOVING TOWARDS THE INFORMATION AGE
…1
1.1. The Data Explosion
1.2. What is Data Mining?
1.3. The 4V's of Big Data
1.3.1. Why is it Important?
1.4. Data Mining Motivations and Objective
1.5. Data mining Phases
1.6. Various Kinds of Data that can be mined
1.7. Data Mining Task
1.7.1. Description
1.7.2. Estimation
1.7.3. Prediction
1.7.4. Classification
1.7.5. Clustering
1.7.6. Association
1.8. Technologies in data mining
1.8.1. Statistics
1.8.2. Machine Learning
1.8.3. Database systems and warehouses
1.8.4. Information retrieval
1.9. The UCI Repository of Datasets
i
CHAPTER 1:
THE INFORMATION AGE
We live in a world where vast amounts of data are collected daily. Analyzing
such data is an important need. Look at how data mining can meet this
need by providing tools to discover knowledge from data. We will observe
how data mining can be viewed as a result of the natural evolution of
information technology.
ii
CHAPTER 1 provides an introduction to the multidisciplinary field of
data mining. It discusses the evolutionary path of information technology,
which has led to the need for data mining, and the importance of its
applications.
iii
21ST CENTURY:
MOVING TOWARDS THE INFORMATION AGE
“We are living in the information age” is a popular saying; however, we are
actually living in the data age. Terabytes or petabytes of data pour into our
computer networks, the World Wide Web (WWW), and various data storage
devices every day from business, society, science and engineering, medicine,
and almost every other aspect of daily life. This explosive growth of available data
volume is a result of the computerization of our society and the fast
development of powerful data collection and storage tools. Businesses worldwide
generate gigantic data sets, including sales transactions, stock trading records,
product descriptions, sales promotions, company profiles and performance, and
customer feedback.
For example, large stores, such as Wal-Mart (in US) or local stores (i.e. PureGold),
handle hundreds of millions of transactions per week at thousands of branches
around the country or the world. Scientific and engineering practices generate
high orders of petabytes of data in a continuous manner, from remote sensing,
process measuring, scientific experiments, system performance, engineering
observations, and environment surveillance. Global backbone
telecommunication networks carry tens of petabytes of data traffic every day. The
medical and health industry generates tremendous amounts of data from medical
records, patient monitoring, and medical imaging.
1
2.1.1. Data Mining turns a large collection of data into knowledge.
Search engine such as Google receives hundreds of millions of search queries
every day. Each query can be viewed as a transaction where the user
describes her or his information need. What novel and useful knowledge can
a search engine learn from such a huge collection of queries collected from
users over time? Interestingly, some patterns found in user search queries can
disclose invaluable knowledge that cannot be obtained by reading
individual data items alone.
For example, Google’s Flu Trends uses specific search terms as indicators of
flu activity. It found a close relationship between the number of people who
search for flu-related information and the number of people who actually
have flu symptoms. A pattern emerges when all of the search queries related to
flu are aggregated. Using aggregated Google search data, Flu Trends can
estimate flu activity up to two weeks faster than traditional systems can. This
example shows how data mining can turn a large collection of data into
knowledge that can help meet a current global challenge.
Thus, such a misnomer which carries both “data” and “mining” became a
popular choice. There are many other terms carrying a similar or slightly
different meaning to data mining, such as knowledge mining from databases,
knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
2
1.3. The 4vs of Big Data (Volume, Velocity, Veracity, Variety)
Big data is a term that describes the large volume of data – both structured and
unstructured – that inundates a business on a day-to-day basis. But it’s not the
amount of data that’s important. It’s what organizations do with the data that
matters.
The term “big data” refers to data that is so large, fast or complex that it’s
difficult or impossible to process using traditional methods. The act of accessing and
storing large amounts of information for analytics has been around a long time.
To do this companies need to understand the 4Vs of big data – volume, velocity,
variety, and veracity (figure 1.0) – and develop tools and processes to manage
data and turn it into actionable insights.
Volume Veracity
Amount of data Quality of data
BIG DATA
Velocity Variety
Speed of data Types of data
Volume
Volume refers to the amount of data being generated, and in the age of big data,
more data is being generated every minute than ever before.
Velocity
Velocity refers to the speed of the data being generated and the rate at which it’s
being processed in terms of both collection and analysis.
3
Veracity
Veracity refers to the quality, reliability or uncertainty of the data. Is the data
trustworthy? Is it outdated? Has it been modified in any way? Basically, is it
accurate? Data must be cleaned, current, and of high quality and reliability for
it to be accurately analyzed.
Variety
Variety refers to this broad range of different types of data that can come from many
different sources. Today, data comes not only from computers, but also devices
such as smartphones and appliances, among others. Additionally, with the
popularity of social media and other online platforms, vast amounts of
unstructured data are being created (e.g., tweets, photos, videos, social
media posts, online comments, etc.).
4
1.4. Data Mining Motivations and Objectives
Data mining has attracted a great deal of attention in the information industry
and in society as a whole in recent years, due to the wide availability of huge
amounts of data and the imminent need for turning such data into useful
information and knowledge. The information and knowledge gained can be
used for applications ranging from market analysis, fraud detection, and
customer retention, to production control and science exploration.
Huge volumes of data have been accumulated beyond databases and data
warehouses. The early development of data collection and database creation
mechanisms served as a prerequisite for the later development of effective
mechanisms for data storage and retrieval, as well as query and transaction
processing. Nowadays numerous database systems offer query and
transaction processing as common practice. Advanced data analysis has naturally
become the next step.
In summary, the abundance of data, coupled with the need for powerful data
analysis tools, has been described as a data rich but information poor situation.
The fast-growing, tremendous amount of data, collected and stored in large
and numerous data repositories, has far exceeded our human ability for
comprehension without powerful tools. As a result, data collected in large data
repositories become “data tombs”—data archives that are seldom visited.
The widening gap between data and information calls for the systematic
development of data mining tools that can turn data tombs into “golden
nuggets” of knowledge.
5
Data Collection and Database Creation
(1960s and earlier)
Primitive file processing
6
1.5. Data Mining Phases
Many people treat data mining as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD. Alternatively, others view data mining as
simply an essential step in the process of knowledge discovery. Knowledge
discovery as a process is depicted in Figure 3.0 and consists of an iterative
sequence of the following steps:
Steps 1 to 4 are different forms of data preprocessing, where the data are
prepared for mining. The data mining step may interact with the user or a
knowledge base. The interesting patterns are presented to the user and may be
stored as new knowledge in the knowledge base. Note that according to this
view, data mining is only one step in the entire process, but an essential one
because it uncovers hidden patterns for evaluation.
We can agree that data mining is a step in the knowledge discovery process.
However, in industry, in media, and in the database research environment, the
term data mining is becoming more popular than the longer term of
knowledge discovery from data. Therefore, in this module, we choose to use
the term data mining. We adopt a broad view of data mining functionality:
data mining is the process of discovering interesting knowledge from large
amounts of data stored in databases, data warehouses, or other information
repositories.
7
Know
ledge
Know
ledge
Evaluation/Interpretation Knowledge
Data Mining
Patterns
Various Data
Sources
8
1.6. Various kinds of Data that can be mined
In principle, data mining should be applicable to any kind of data repository, as well
as to transient data, such as data streams. Thus the scope of our examination
of data repositories will include relational databases, data warehouses, and
transactional databases, advanced database systems, flat files, data streams,
and the World Wide Web.
The challenges and techniques of mining may differ for each of the repository
systems. Although this module assumes that you have basic knowledge of
information systems, we provide a brief introduction to each of the major data
repository systems listed above.
Relational databases are one of the most commonly available and rich
information repositories, and thus they are a major data form in our study of
data mining.
9
1.3.3. Data Warehouses
A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single site.
Client 1
Data source from Vancouver Figure 4.0: Typical framework of a data warehouse
10
1.3.4. Transactional Databases
In general, a transactional database consists of a file where each record
represents a transaction. A transaction typically includes a unique transaction
identity number (trans_ID) and a list of the items making up the transaction
(such as items purchased in a store).
Table 1.0: Fragment of a transactional database for sales
Trans_ID Item_List_ID
T100 I1, I2, I4, I277
T101 I2, I5, I89
……. …….
The transactional database may have additional tables associated with it,
which contain other information regarding the sale, such as the date of the
transaction, the customer ID number, the ID number of the salesperson and
of the branch at which the sale occurred, and so on.
The new database applications include handling spatial data (such as maps),
engineering design data (such as the design of buildings, system components, or
integrated circuits), hypertext and multimedia data (including text, image, video,
and audio data), time-related data (such as historical records or stock
exchange data), stream data (such as video surveillance and sensor data,
where data flow in and out like streams), and the World Wide Web (a huge,
widely distributed information repository made available by the Internet).
11
1.3.7. Temporal Databases, Sequence Databases, and Time-Series
Databases
A Temporal database typically stores relational data that include time-related
attributes. These attributes may involve several timestamps, each having
different semantics.
For instance, the mining of banking data may aid in the scheduling of bank
tellers according to the volume of customer traffic. Stock exchange data can
be mined to uncover trends that could help you plan investment strategies.
Multimedia databases store image, audio, and video data. They are used in
applications such as picture content-based retrieval, voice-mail systems,
video-on-demand systems, the World Wide Web, and speech-based user
interfaces that recognize spoken commands.
12
Because video and audio data require real-time retrieval at a steady and
predetermined rate in order to avoid picture or sound gaps and system
buffer overflows, such data are referred to as continuous-media data.
Mining data streams involves the efficient discovery of general patterns and
dynamic changes within stream data. For example, we may like to detect
intrusions of a computer network based on the anomaly of message flow,
which may be discovered by clustering data streams, dynamic construction
of stream models, or comparing the current frequent patterns with that at a
certain previous time.
13
1.3.12. The World Wide Web
The World Wide Web and its associated distributed information services, such
as Yahoo!, Google and so on provide rich, worldwide, on-line information
services, where data objects are linked together to facilitate interactive
access.
For example, understanding user access patterns will not only help improve
system design but also leads to better marketing decisions. Capturing user
access patterns in such distributed information environments is called Web
usage mining (Weblog mining).
Web mining is the development of scalable and effective Web data analysis
and mining methods. It may help us learn about the distribution of information
on the Web in general, characterize and classify Web pages, and uncover
Web dynamics and the association and other relationships among different
Web pages, users, communities, and Web-based activities.
14
1.7. Data Mining Task
The following listing shows the most common data mining tasks.
1. Description
2. Estimation
3. Prediction
4. Classification
5. Clustering
6. Association.
Data mining models should be as transparent as possible. That is, the results of the
data mining model should describe clear patterns that are agreeable to
intuitive interpretation and explanation.
Then, for a new data (or observation), estimates of the value of the target
variable are made, based on the values of the predictors.
Any of the methods and techniques used for classification and estimation
may also be used, under appropriate circumstances, for prediction.
For example, consider the excerpt from a data set in Table 2.0.
Table 2.0: Excerpt from dataset for classifying income
Suppose the researcher would like to be able to classify the income bracket of
new individuals, not currently in the above database, based on the other
16
characteristics associated with that individual, such as age, gender, and
occupation. This task is a classification task, very nicely suited to data mining
methods and techniques.
The algorithm would proceed roughly as follows. First, examine the data set
containing both the predictor variables and the (already classified) target
variable, income bracket. In this way, the algorithm “learns about” which
combinations of variables are associated with which income brackets. For
example, older females may be associated with the high-income bracket.
This data set is called the training set.
Then the algorithm would look at new records, for which no information about
income bracket is available. On the basis of the classifications in the training set,
the algorithm would assign classifications to the new records. For example, a
63-year-old female professor might be classified in the high-income bracket.
Association rules are of the form “If antecedent then consequent,” together with a
measure of the support and confidence associated with the rule.
For example, a particular supermarket may find that, of the 1000 customers
shopping on a Thursday night, 200 bought diapers, and of those 200 who
bought diapers, 50 bought beer. Thus, the association rule would be “If buy
diapers, then buy beer,” with a support of 200/1000 =20% and a confidence of
50/200=25%.
18
1.8. Technologies in data mining
As a highly application-driven domain, data mining has incorporated many
techniques from other domains such as statistics, machine learning, pattern
recognition, database and data warehouse systems, information retrieval, visualization,
algorithms, high performance computing, and many application domains (Figure
5.0).
Machine Pattern
Statistics
Learning Recognition
Database
Visualizations
Systems
Data Mining
Information High-performance
Applications
Retrieval Computing
Statistics is useful for mining various patterns from data as well as for
understanding the underlying mechanisms generating and affecting the
patterns. Inferential statistics (or predictive statistics) models data in a way that
accounts for randomness and uncertainty in the observations and is used to
draw inferences about the process or population under investigation.
19
Applying statistical methods in data mining is far from trivial. Often, a serious
challenge is how to scale up a statistical method over a large data set.
4. Active Learning is a machine learning approach that lets users play an active
role in the learning process. An active learning approach can ask a user
(e.g., a domain expert) to label an example, which may be from a set of
unlabeled examples or synthesized by the learning program. The goal is to
optimize the model quality by actively acquiring knowledge from human
users, given a constraint on how many examples they can be asked to
label.
20
You can see there are many similarities between data mining and machine
learning. For classification and clustering tasks, machine learning research
often focuses on the accuracy of the model.
2. The queries are formed mainly by keywords, which do not have complex
structures (unlike SQL queries in database systems).
21
1.9. The UCI Repository of Datasets
Most of the commercial datasets used by companies for data mining are
unsurprisingly, not available for others to use. However there are a number of
‘libraries’ of datasets that are readily available for downloading from the
World Wide Web free of charge by anyone.
Important Note: In the great majority of cases the datasets in the UCI
Repository give good results when processed by standard algorithms.
Datasets that lead to poor results tend to be associated with unsuccessful
projects and so may not be added to the Repository. The achievement of good
results with selected datasets from the Repository is no guarantee of the
success of a method with new data, but experimentation with such datasets
can be a valuable step in the development of new methods.
22
MODULE 1
ACTIVITY 1:
A Partial Requirement In
23
CHAPTER 2
GETTING TO KNOW YOUR DATA:
THE DATA-WAREHOUSE
24
CHAPTER 2: GETTING TO KNOW
YOUR DATA
The Data Warehouse
A data warehouse means different things to different people. Some definitions are
limited to data; others refer to people, processes, software, tools, and data. One
of the global definitions is the following:
“The data warehouse is a collection of integrated, subject-oriented databases designed to
support the decision-support functions (DSF), where each unit of data is relevant to some
moment in time.”
The existence of a data warehouse is not a prerequisite for data mining, in practice,
the task of data mining, especially for some large companies, is made a lot
easier by having access to a data warehouse. A primary goal of a data
warehouse is to increase the “intelligence” of a decision process and the knowledge
of the people involved in this process.
The major task of operational database systems is to perform transaction and query
processing. These systems are called online transaction processing (OLTP) systems.
They cover most of the day-to-day operations of an organization such as
purchasing, inventory, manufacturing, banking, payroll, registration, and
accounting.
Data warehouse systems, on the other hand, serve users or knowledge workers in the
role of data analysis and decision making. Such systems can organize and present
data in various formats in order to accommodate the diverse needs of
different users. These systems are known as online analytical processing (OLAP)
systems.
25
2.2. OLPT vs. OLAP
Because operational databases store huge amounts of data, you may
wonder, “Why not perform online analytical processing directly on such
databases instead of spending additional time and resources to construct a
separate data warehouse?” A major reason for such a separation is to help
promote the high performance of both systems.
26
2.3. The Data Representation
It’s tempting to jump straight into mining, but first, we need to get the data
ready. This involves having a closer look at attributes and data values. Real-
world data are typically noisy, enormous in volume (often several terabytes or
more), and may originate from a hodge-podge of heterogeneous sources.
This chapter is about getting familiar with your data. Knowledge about your
data is useful for data preprocessing in the next chapter which is the first major
task of the data mining process.
Gaining such insight into the data will help with the subsequent analysis.
Attribute
store_id sales state status
s_1001 $6,500 Kansas open
Data Objects/ s_1002 $7,400 Alabama open
Observation/ s_1003 $6,920 Texas close
Instance/ … … … …
Data-points
Dataset
27
What is an attribute?
An attribute is a data field, representing a characteristic or feature of a data
object. Attributes for example describing a customer object can include,
customer ID, name, and address. Observed values for a given attribute are
known as observations. A set of attributes used to describe a given object is
called an attribute vector (or feature vector). The distribution of data
involving one attribute (or variable) is called univariate. A bivariate
distribution involves two attributes, and so on.
Example in table 5.0: the attribute ‘course’ and ‘status’ are nominal attributes
or categorical attributes. The values do not have any meaningful order and not
quantitative in nature, it makes no sense to find the mean (average) value or
median (middle) value for such an attribute, given a set of objects.
28
Important Note:
A binary attribute is symmetric if both of its states are equally valuable and
carry the same weight; that is, there is no preference on which outcome
should be coded as 0 or 1. One such example could be the attribute gender
having the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a medical test for
HIV. By convention, we code the most important outcome, which is usually
the rarest one, by 1 (e.g., Covid19 positive) and the other by 0 (e.g., Covid19
negative).
29
2.3.5. Numeric Attributes
A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values. Numeric attributes can be interval-scaled or
ratio-scaled.
Interval-scaled attributes are measured on a scale of equal-size units. The
values of interval-scaled attributes have order and can be positive, 0, or
negative. Thus, in addition to providing a ranking of values, such attributes
allow us to compare and quantify the difference between values.
Table 8.0:
Kansas
day Time Weather Temperature (Celsius) State
Monday 9:00am Cloudy 35
Tuesday 12:00pm Partly-cloudy 38
Wednesday 3:00pm Sunny 40
… … … …
Temperature
30
p_1002 Smartphone Alabama 7
p_1003 powerbank Texas 6
… … … …
There are two types of quantitative data, which is also referred to as numeric
data: continuous and discrete. As a general rule, counts are discrete
and measurements are continuous.
31
h_1001 fm_1 Mother 25.6 34
h_1001 fm_2 Father 26.9 32
h_1001 fm_3 Sibling 18.4 12
… … … …
Example in table 11.0: the attribute ‘BMI’ represents continuous data since
BMI can be a precise scaling. Other attributes can be height and weight.
32
MODULE 1
ACTIVITY 2:
A Partial Requirement In
33